The use of the parsing method for optimal code generation and common subexpression elimination

Anatoly Bezdushny and Vladimir Serebriakov

Computing Centre of the Academy of Sciences,

Vavilova 40, Moscow, 117333, USSR.


Abstract. Research on methods to automatically generate compiler back-ends has shown that the performance of code generators combining a fast tree-pattern matching algorithm with a dynamic programming algorithm is comparable to very fast production compilers. We propose to use the parsing with ambiguous attributed tree grammars for code generation, resorting to classical terminology from formal language theory. An attribute grammar is supplemented with cost­attributes, their rules and cost criteria to select the best possible derivation tree. The algorithm of Cocke, Younger and Kasami(CYK) modified to accepts tree grammars is used for the parsing of an intermediate program. Code selection and register allocation are studied when common subexpressions are taken into account.

1. Introduction

Most automatically generated compiler back-ends combine a fast tree-pattern matching algorithm with a dynamic programming algorithm for high-quality code generation. We propose to return to the grammatical approach to code selection. An ambiguous attribute tree grammar can be used to describe the instruction set of the target CISC machine. The right-hand side of each production models a machine instruction. The left-hand side of the production describes, where the result of the instruction will be available (registers, memory locations). Grammar productions are supplied with cost-attribute rules (estimating the amount of memory and machine cycles necessary for machine instruction) to compute cost values which allow the code generator to select a minimal-cost derivation tree according to specified criteria. Instruction selection is done by a parsing algorithm combined with a dynamic programming algorithm that guarantees that the best possible code is selected. We modify the CYK algorithm for the parsing of an intermediate program. A general parsing method avoids problems that occur with a code generator using the LR parsing approach. The modification of CYK algorithm for tree grammars leads to algorithms of pattern matching techniques. First we investigate the CYK algorithm and modify it to accept only tree grammars. Then an algorithm which determines minimal-cost derivation trees is given. In the second part of the paper we combine common subexpression detection, register allocation and cost evaluation. Different approaches are studied.

2. The use of the parsing method for code generation

2.1. Introduction

Wasilew [17] suggested to treat code generation as a tree-rewriting process. Machine instructions are described by tree-rewriting rules consisting of a replacement node, a tree template and code generation actions. Code selection is a process of finding subtrees in the input tree that match tree templates and rewriting these matched subtrees by the corresponding replacement node. The process reduces the input tree to a single node. The machine code is generated by the actions associated with tree-rewriting rules applied in the process. The sequence of the matched subtrees is named a cover of the input tree. The process is usually indeterministic since more than one template can match a subtree or a portion of it. Cost values are associated with the tree-rewriting rules to indicate the quality of the generated code. The cost of the cover of the input tree is the sum of the costs of the covering rules. The cost values are used to resolve ambiguities in the selection of a cover.

Graham and Glanville [10,11,12] used LR parsing techniques to find the cover of the input tree. The replacement node of a tree-rewriting rule and its tree template written in prefix order is interpreted as a production of a context-free grammar (tree grammars). The set of such productions is a context-free grammar that is used to build an LR parser. The cover can be found by parsing a linearized prefix form of the input tree. The machine code is generated during the reductions of the parsing process. Several difficulties have to be overcome with the LR parsing approach. An LR machine grammar can have many thousands of productions, so parser tables can be too large. The grammars are normally ambiguous, so specialized techniques have to be built into the parser generator to resolve ambiguities.

Weisgerber and Wilhelm [24] presented a bottom up and a top down pattern matcher for code selection. Both pattern matcher first compute all possible covers for the input tree, then a cheapest cover is selected by dynamic programming. The bottom up automaton may grow exponentially with the size of the machine description, but the analysis time performance of it is much better than that of the top down automaton.

Ganapathi and Fischer [7,8] extended the LR approach by using attribute grammar to describe the target machine. Grammar production represent a generalized machine instruction, and attributes and predicates specify target machine architectural restrictions on instructions and addressing-modes. Instruction selection is done by attributed parsing. In an attribute grammar the number of production is smaller than in the LR parsing approach. However, the user who wants to write a code generator specification (CGS) has to know the internal algorithms of the attributed parsing process to write a good specification.

To avoid this problem and to have a strictly declarative CGS several implementations of code-generator generating systems [3,6] combine a fast tree-pattern matching algorithm with an efficient dynamic programming algorithm for generating high-quality code. These systems generate fast and efficient compiler back-ends. The BEG [6] system uses a direct pattern matching algorithm that is incorporated into the generated code generator. The tree-pattern matching algorithm of TWIG [3] system is a generalization of Aho and Corasick's keyword-matching algorithm[1,5]. Both systems use dynamic programming algorithm that is a simplification of Aho and Johnson's optimal code-generation algorithm [2]. Each subtree of input tree is characterized by a single scalar cost. These costs are computed using the cost expressions of the tree-rewriting rules.

Recent approaches apply a formalism of algebraic equational specifications and term rewriting system to code generation problem[9,16]. R.Geigerich and K.Schmal [9] reformulated the code selection task in the problem of inversion of a hierarchic derivor. Such approaches led to several interesting theoretical results. A new design strategy for code generation based on several levels of abstraction is presented in [18]. The machine model is described using many-sorted signatures.

We use grammatical approach to code generation [19]. The instruction set of the target CISC machine is described an ambiguous attribute tree grammar. An attribute grammar is supplemented with cost-attributes, their rules and cost criteria. Using cost-attribute rules, a parsing algorithm combined with the dynamic programming algorithm computes cost values, which allow to select a minimal-cost derivation tree according to specified criteria. Following this, the derivation-tree attributes are evaluated through the top-down left-to-right walk of the minimal-cost derivation tree, and corresponding instructions are emitted. For the parsing of an intermediate program we use the well-known algorithm of Cocke, Younger and Kasami[13,14] that is specially modified to accepts only tree grammars. The modified CYK algorithm takes O(n) time. The modification of CYK algorithm for tree grammars result in algorithms that are similar to algorithms of pattern matching technique.

2.2. Code generation by attributed tree grammars

A compiler front-end analyzes the input program and translates it into an intermediate representation (IR). The task of a back end is to transform the IR into the machine code. Many factors determine choosing an appropriate IR. We assume that IR is a sequence of prefix expressions. Each operator has a fixed arity and may have attributes whose values are calculated in an earlier compiler phase.

The prefix expression in which the operator a0 is applied to expressions e1,e2,...,ek is a0e1e2...ek, where Arity(a0)=k. If a0...an is the prefix expression and the operator a0 has arity k then there is only one decoding of a0...an as a0e1e2...ek, where e1,e2,...,ek are prefix expressions. So for each position i in the prefix expression a0...an we can compute position (i+j), such that ai...am=aie1e2...el where m=i+j-1 and l=Arity(ai). The string ai...am is called an ai-expression and j is the length of the ai-expression.

For each prefix expression, a corresponding expression tree can be built. If the operator a0 is applied to prefix expressions e1,e2,...,ek in the prefix expression z=a0e1,e2,...,ek (Arity(a0)=k) then its expression tree T0 is shown in Figure 2.2.1, where T1,T2,...,Tk are expression trees for expressions e1,e2,...,ek respectively.

                         (a0)  T0
                       /  | \
                     /    |   \
                   /      |     \
                 /\      / \ ...  /\
                /T1\    /T2 \    /Tk\
                ----    -----    ----
Figure 2.2.1: Expression tree for z=a0e1,e2,...,ek

Figure 2.2.2 shows an example prefix expression and expression tree for a:=a+5, where a is a local variable stored on the stack at the offset CONST(a) from a stack pointer stored in the register BASE(i). The attribute values of operators are indicated in parentheses. The @ operator makes its argument a memory address.

                              :=
                            /   \
                         /         \
                      @              +
                      |             / \
                      +      CONST(5) @
                    /   \              |
             CONST(a)   BASE(a)        +
                                     /   \
                               CONST(a) BASE(a)


  := @ + CONST(a) BASE(a) + CONST(5) @ + CONST(a) BASE(a) 

Figure 2.2.2: Expression tree for a := a +5

A code generator processes a prefix expression and generates machine code for it. An algorithm that reads a prefix expression and produces code for each operator separately generates bad code, since one machine instruction, in addition to a proper action, usually performs an address calculation and a memory fetch.

So the code generator has to choose several operators to produce one machine instruction. To achieve this, we use attributed tree grammars.

The tree grammar G=(N,T,P,S) fulfils the following conditions:

(1) Each operator a of T has a fixed arity Arity(a)>=0.

(2) The right-hand side of each production A­>w of P is a prefix expression with operators from T and operands from N.

A machine instruction is described by a production of an attributed grammar. The machine code is emitted by the actions associated with the productions. Grammar nonterminals represent a generalized result of corresponding instructions and changes of the target machine environment. Nonterminals in the production right-hand side represent results calculated early by other instructions. The production left-hand side nonterminal represents a result of the instruction. Figure 2.2.3 contains productions to process the expression of Figure 2.2.2

(1) Operator ::= ':=' '@' Reg(R1) Reg(R2) =>
emit: MOV R2,@R1 .
(2) Reg(R0) ::= '+' Reg(R1) Reg(R2) =>
emit: ADD R1,R0
if R2=R0 .
(3) Reg(R0) ::= '+' CONST(X) Reg(R1) =>
emit: ADD #X,R0
if R1=R0 .
(4) Reg(R0) ::= '+' Reg(R1) '@' '+' CONST(X) Reg(R2) =>
emit: ADD X(R2),R0
if R1=R0 .
Add the value of a local variable stored on the stack at the offset "X" from a stack pointer stored in the register "R2" to the content of the register "R1", if the register "R0" is the register "R1".
(5) Reg(R0) ::= '@' '+' CONST(X) Reg(R1) =>
emit: MOV X(R1),R0 .
(6) Reg(R0) ::= '@' Reg(R1) =>
emit: MOV @R1,R0 .
(7) Reg(R0) ::= CONST(X) =>
emit: MOV #X,R0 .
(8) Reg(R0) ::= BASE(I) =>
R0:=I .

Figure 2.2.3: Production needed to process the expression of Figure 2.2.2

The code generator parses an input prefix expression and builds a derivation-tree. Then the derivation tree attributes are evaluated during a traversal of it, and instructions are emitted in accordance with actions of productions applied in the derivation tree.

However, the tree grammars that are used to describe the target machine are usually ambiguous grammars. There are several possible derivations for a given input sentence. Cost-attributes associated with nonterminals and cost­attribute rules associated with productions are used to indicate qualities of the code, for example, code size and execution time, changes of the target machine environment. The cost-attributes are evaluated according to the cost-attribute rules during the parsing of a prefix expression. Ambiguities during parsing are resolved in accordance with cost criteria. Each nonterminal is supplied with a cost criterion. Using nonterminal cost-attributes values, the criterion determines the desired production to be applied to the nonterminal. The grammar start symbol production, that is selected in accordance with the start symbol criterion, determines a single derivation tree. This method allows the desired cheapest derivation tree to be found. If the derivation tree is detected, derivation-tree attributes are evaluated during a walk traversal, and corresponding instructions are emitted.

2.3. General parsing algorithm of Cocke, Younger and Kasami

Let G=(N,T,P,S) be a context free grammar (CFG). V will denote the set N+T. Elements of N will be denoted by capitals A,B,C,D,E. Elements of T will be denoted by a,b,c,d and elements of V by capitals X,Y,Z. Elements of V* will be denoted by u,v,w,x,y and elements of T* by z. The symbol e stands for the empty string.

Let G be e-free CFG and z=a1a2...an be a sentence to be recognized. We assume that the CFG is e-free, unless otherwise stated. The algorithm of Cocke, Younger and Kasami was originally presented for application on grammars in Chomsky normal form. It can be modified to accept CFG in general form [13,14]. The algorithm builds a triangular recognition table h. The name h[i,j] stands for elements of the table h, where 1<=i<=n and 0<=j<=n-i+1. Elements h[i,j] contains the items [A­>u.v]

h[i,j]={[A­>u.v] | A­>uv in P and u =>* ai...ai+j-1}

That is, j symbols of the input sentence z starting at i-th symbol are derived from A. The sentence z is in L(G) iff [S­>w.] is in h[1,n].

The initial step of the algorithm is the construction of the sets h[i,0]. The main part of the algorithm is the construction of h[i,j] from the previous constructed sets h[i,k] for every 1<=i<=n, 1<=j<=n-i+1 and 0<=k<=j-1. The final step tests whether the item [S­>w.] is a member of the h[1,n].

Algorithm 1:

VAR a : ARRAY[1..n] OF Tterminal;(* ai denotes a[i] *)
h : ARRAY[1..n],[0..n] OF SET OF Titem;
BEGIN (*
== denotes recurrence relation *)
(*
implemented by means of WHILE-statement *)
FOR i:=1 TO n DO h[i,0]:={[A->
.w]|(A->w) in P} END;
FOR every 1<=i<=n, 1<=j<=n-i+1 DO h[i,j]:={} END;
FOR j:=1 TO n DO (*
Increasing length *)
FOR i:=1 TO n-j+1 DO h[i,j]:=h[i,j]+
{[A->ua
i+j-1.v]| [A->u.ai+j-1v] in h[i,j-1]} (*1*)
FOR k:=1 TO j-1 DO h[i,j]:=h[i,j]+
{[A->uB
.v]| [A->u.Bv] in h[i,k], (*2*)
[B->w
.] in h[i+k,j-k]}
END;
h[i,j]==h[i,j]+{[A-> B
.v]| [A-> .Bv] in h[i,0],
[B->w
.] in h[i,j]}
END
END;
Test whether [S->w
.] is a member of h[1,n];
END

Computing the sets h[i,1],...,h[i,n-i+1] for each position i, the algorithm recognizes all derivations of ai...ak (k<=n-i+1). It inserts the item [A­>w.] into the set h[i,k-i+1], if some derivation A =>* ai...ak is recognized.

The algorithm constructs n*(n+1)/2 sets h[i,j]. So its space complexity is O(n2), where n is the length of the string to be recognized. The number of objects in the set h[i,j] is O(|P|). The algorithm does O(j) steps to process all set pairs (h[i,k],h[i+k,j-k]) (k<j) computing h[i,j]. This means that the time complexity of the algorithm is O(n3).

Computing the sets h[i,j] (j<=n-i+1) the algorithm solves two problems. The first problem consists of finding productions A­>uapwar...aqv (uap=>*ai...) that match the substring ai...ap...ar...aq...ak of input sentence a1a2...an, where ap,ar...,aq are the terminals of the right-hand side of productions and u,w,...,v in N*. The first process tests whether there is the derivation

u'apw'ar...aqv' =>* ai...ap...ar...aq...ak

for some u',w',...,v' in N* and |u'|=|u|,...,|v'|=|v|. Solving the second problem, the algorithm tests whether there is the derivation

A => z0Bz1C...Dzm =>* z0z1'z1...zm'zm = ai...ak

such that A => z0Bz1C...Dzm is the matched production and

B =>* z1', C =>* z2', ..., D =>* zm'

are some subderivations that are recognized during the computation of sets h[p,q] (p>i and q<k-i+1), where B,C,...,D in N and zi,zi' in T*.

The items [A­>u.v] of h[i,m] (v<>e) are used during the computation of sets h[i,j] (m<=j) to solve the first problem. The items [A­>w.] of h[i,m] are used to solve the second problem when sets h[l,j] (l<=i,m<=j) are computed.

In the following modification of the algorithm we organize the set of items [A­>w.] for each position i (r[i]) to separate data of the first and the second processes and to simplify the search of subderivations that correspond the matching production. Instead of the triangular recognition table h[i,j] linear table h'[i] is used to find the matched production. The algorithm computes h[i,j] from n down to 1 to compute on i-th step all phrases started at i-th symbol. It recognizes derivations A => *al...aj when all possible derivations B => *ai...am are recognized for all l<i.

Algorithm 2:

VAR a : ARRAY[1..n] OF Tterminal; (*ai denotes a[i] *)
h': ARRAY[0..n] OF SET OF Titem;
r : ARRAY[1..n] OF SET OF Tpair; (*
The type Tpair describes sets of pairs such that for each item [A->w.] of h[i,j](Alg.1) the pair (A­>w,j) in r[i] *)
BEGIN
FOR i:=n DOWNTO 1 DO
h'[0]:={[A->
.w]| (A->w) in P }; r[i]:={};
FOR j:=1 TO n-i+1 DO
(*
matching production *)
h'[j]:= {[A->ua
i+j-1.v]| [A->u.ai+j-1v] in h'[j-1]}
FOR k:=1 TO j-1 DO (*
look for subderivations*)
h'[j]:=h'[j]+{[A->uB
.v]| [A->u.Bv] in h'[k],
(B->w,j-k) in r[i+k]}
END;
h'[j]==h'[j]+{[A-> B
.v]| [A-> .Bv] in h'[0],
[B->w
. ] in h'[j]};
END;

FOR j:=1 TO n-i+1 DO (* matched productions *)
r[i]:=r[i]+{(A->w,j)| [A->w
.] in h'[j]}
END;

END;
Test whether (S->w,n) is a member of r[1]
END

The sets r[i] have sizes O(n). This means that the space and time complexities are the same as for the previous algorithm.

The quality of the algorithm can be improved, if it is possible to implement the following transformations.

For each pair (i,j) pre-compute all values k1,...,km, such that the pairs (h[i,k1],h[i+k1,j­k1])...(h[i,km],h[i+km,j­km]) are only used to compute h[i,j]. Replace the loop on k of the algorithm 2 by a call to the procedure COMP(i,j) that computes for each i and j

h[i,j]:=F(h[i,k1],h[i+k1,j­k1],...,h[i,km],h[i+km,j­km])

using corresponding pre-computed values k1...km. The algorithm keeps the pre-computed values k1...km in some structure before the parsing is started. The procedure COMP gets them during parsing.

It is interesting, if the time complexity of the pre-computation does not exceed O(n2) and a fixed amount of time is required to keep and to restore information.

If the pre-computation can be done taking time O(n) we can only compute O(n) sets of k1...km. This means that O(n) sets h[i,j] have to be computed and the time complexity of the corresponding modification of the algorithm becomes O(n).

2.4. Parsing tree grammars

The tree grammars are used to describe the target machine instructions. Each string that is derived from some nonterminal A in a tree grammar, is a prefix expression. So, if S =>* a1...an then for each position i and subderivations A =>* ai...ak and B =>* ai...al the equality k=l holds and k-i+1 is the length of the ai-expression. Therefore instead of using the table h to do production matching, we can find all productions that match the string ai...ak of the input sentence a1...an through the left-to-right walk of ai...ak using pre-computed lengths of prefix expressions to jump from the matched terminal to the next terminal to be matched. Computing the length l[i] of the ai-expression for each position i we pre-compute values k1...km for each pair (i,l[i]).

The algorithm 3 computes all phrases started at i-th symbol on i-th step. The procedure MATCH(j,h") traverses the string ajaj+1...aj+l[j]-1 and compare it's with aj­expressions indicated in items of h". The parameter h" represents such set {[A->uaj.vw]} that if on i-th step of the algorithm [A->u.v] in h" then u=>*ai...aj. Processing a subexpression vk of the aj-expression ajv1...vm (m=Arity(aj)) the procedure computes the sets h1 and h2. Assignments to h1 and h2 corresponds to assignment statements 1 and 2 of the algorithm 1 for j=(p-i)+1 and k=p­i, j=k+l[p] respectively.

Chain productions are independent of operators, so they have to be tested separately. The application of a chain production may depend on another chain production. Therefore, chain productions have to be tested repeatedly until no chain productions are applicable any more. We assume that there are no cycles in chain production applications.

Algorithm 3:

VAR
a : ARRAY[1..n] OF Tterminal;(*
ai denotes a[i] *)
l : ARRAY[1..n] OF INTEGER; (*
l[i]=ai­expression length*)
r : ARRAY[1..n] OF SET OF Tproduction;
(*
For each [A->w.] of h[i,j](of Alg.1), j=l[i],
(
A->w) is in r[i] *)
h": SET OF Titem;

PROCEDURE MATCH(VAR j:INTEGER; VAR h": SET OF Titem);
VAR p: INTEGER;
h1,h2: SET OF Titem;
BEGIN
(*
h" is equal to {[A->uaj.vw]}, m=Arity(aj) *)
(*
v=v1v2...vk-1vkvk+1...vm=v'vkv" *)
IF h"<>{} THEN
p:=j+1; (*
skip aj *)
FOR k:=1 TO m DO (*
match {v1,v2,...,vm} *)
h2:={[
A->uajv'B.v"w] |
[
A->uajv'.Bv"w] in h", vk=B, B->y in r[p]};
h1:={[
A->uajv'ap.xv"w]|
[
A->uajv'.apxv"w] in h", vk=apx, ap=a[p]};
MATCH(p,h1); (*
match {vk} *)
h":=h1+h2;
END;
END;
j:=j+l[j]; (*
skip aj-expression *)
END MATCH;

BEGIN
(*
Pre-computation k1...,km *)
Compute the length l[i] of a
i-expression
for each position i=1,n.
FOR i:=n DOWNTO 1 DO (*
Recognize input sentence *)
h":={[A->
ai.v]| (A->aiv) in P};
j:=i;
MATCH(j,h"); (*
Match unchain productions*)
(*
Match chain productions*)
h"==h"+{[A-> B
.]| (A->B) in P, [B->w.] in h"};
r[i]:= {(A->w) | [A->w
.] in h"}
END;
Test whether (S->w) is a member of r[1]
END

The sets r[i] have sizes O(|P|). It is evident that the time and space complexity of the algorithm for tree grammars is O(n).

Compiler that does not apply code-improving transformations can use algorithm 3 for the parsing of a prefix intermediate representation. We assume that the code generation phase follows the optimizing one that deals with expression trees (second part of this paper). The following algorithms will process expression trees. However, they can be rewritten to process a prefix intermediate representation.

Let G be a tree grammar. For each sentence z of L(G) an expression tree can be built. We rewrite algorithm 3 to process expression trees instead of strings. The following algorithm traverses an expression tree in depth-first right-to-left order and finds subtrees in it that match the right-hand sides of productions in a bottom-up manner.

The procedure MATCH(n,h") traverses some neighbourhood of the subtree in depth-first order starting at node n. It tests whether the subtree matches prefix expressions {ajv} of matching productions {A­>uajvw} and searches subderivations that correspond to the matching production. Therefore, if the production matches the subtree then the string that corresponds to the subtree is derived from A, and the matched production can be applied in this derivation. We assume that nonterminals are 0-arity operators. The applicable productions are stored directly in the tree nodes. The data structure that represents a tree node has the following form:

PTnode=^Tnode;
Tnode=RECORD
op : Tterminal;
son : ARRAY[1..MaxArity] OF PTnode;
rules: SET OF Tproduction
END;

The algorithm for recognizing expression trees is the following:

Algorithm 4:

VAR root:PTnode; h": SET OF Titem;
PROCEDURE PARSE(n:PTnode);
PROCEDURE MATCH(n:PTnode; VAR h":SET OF Titem);
VAR h1,h2: SET OF Titem;
BEGIN (*
h" is equal to {[A->uaj.vw]}, m=Arity(aj) *)
(*
v=v1v2...vk-1vkvk+1...vm=v'vkv" *)
IF h"<>{} THEN (*
match v1,v2,...,vm *)
FOR k:=1 TO m DO WITH n^.son[k]^ DO
h2:={[
A->uajv'B.v"w] |
[
A->uajv'.Bv"w] in h", vk=B, (B->y) in rules};
h1:={[
A->uajv'ap.xv"w]|
[
A->uajv'.apxv"w] in h", vk=apx, ap=op};
MATCH(n^.son[k],h1); (*
math {vk} *)
h":=h1+h2
END END END
END MATCH;

BEGIN FOR k:=Arity(n^.op) DOWNTO 1 DO(*FOR i:=n DOWNTO 1*)
PARSE(n^.son[k]) END;
h":={[A->
ai.v]| (A->aiv) in P, ai=n^.op};
MATCH(n,h"); (*
Match unchain productions*)
(*
Match chain productions*)
h" ==h"+{[A-> B
.]| (A->B) in P, [B->w.] in h"};
n^.rules:={(A->w)| [A->w
.] in h"}(*Matched productions*)
END PARSE;

BEGIN (* Pre-computation k1...,km *)
Build the expression tree for a sentence to be recognized;
root:= pointer_to_expression_tree;
PARSE(root); (*
Recognize input sentence *)
Test whether (
S->w) is a member of root^.rules;
END;

The output of algorithm 4 is the expression tree decorated with the sets of applicable productions.

The tree grammars can be ambiguous grammars. We can build all possible derivations using the decorated expression tree of algorithm 4 and applicable production table r and the expression length table l of algorithm 3.

2.5. Minimal cost derivation tree

The tree grammars that are used to describe the target machine are usually ambiguous grammars. There are several possible derivations for a given input sentence. To generate code, it is needed to select a single derivation tree. The tree has to represent desired qualities of the code, e.g. code size and execution time.

A derivation tree can be selected according to specified cost-attribute, cost­attribute rules and cost­criteria. Cost­attributes are associated with grammar nonterminals, their rules - with grammar productions. Cost­attributes of every nonterminal A are attached to every node n of an expression tree t. Their values are computed in parallel to the parsing of the input sentence. Cost­attribute values reflect some properties of derivations from A.

Assume that for the node n, an applicable production p: A­>z0X1z1...Xkzk is detected, where zi in T* and Xj in N (1<=i,j<=k). The cost-attribute rules corresponding to the production p are evaluated as follows. According to the applicable production there are descendants n1,...,nk of n which correspond to the nonterminals X1,...,Xk respectively. Cost-attributes values are computed in a bottom-up order. First cost-attributes are initialized with the undefined value. It is assumed that cost-attributes values for all descendants n1,...,nk are already computed.

If an attribute rule

A.a:=f(Xi.b,Xj.c,...) for 1<=i,j<=k

is associated with p then the attribute rule instance

n.a:=f(ni.b,nj.c,...)

is evaluated. The computed value is the value of the cost­attribute a of the nonterminal A in the node n for the production p. It is possible there are several applicable productions for A. To choose a single applicable production for A, a cost­criterion is used. Each nonterminal is supplied with its cost­criterion. The criterion determines the desired production to be applied to the nonterminal. It uses early and currently computed cost-attributes values of nonterminal A. If it decides that the currently tested applicable production is better than the early investigated one, or nonterminal A cost-attributes values are undefined, it assigns currently computed values to the cost-attributes of A and keeps the corresponding production in the node n. At the end of parsing, the stored production of the grammar start symbol defines a single derivation tree.

To modify algorithm 4 to determine a single derivation tree we have to add cost-attributes, their rules and cost­criteria to it. The data structure of a tree node has to have the following form:

PTnode=^Tnode;
Tnode=RECORD
op : Tterminal;
son : ARRAY[1..MaxArity] OF PTnode;
nonterm: ARRAY Tnonterminal OF RECORD
CostAttr : ADDRESS;
Production : Tproduction;
END;
OperatorAttributes: ...
END;

In general the field CostAttr points to an area allocated for cost-attrubutes of the corresponding nonterminal. If the required area can be located in one machine word then it is allocated directly in this field. Each operator may have attributes whose values are calculated in an earlier compiler phase. These values are stored in the field OperatorAttributes.

The body of the procedure PARSE becomes

BEGIN
FOR k:=Arity(n^.op) DOWNTO 1 DO PARSE(n^.son[k]) END;
FOR each A of N DO WITH n^.nonterm[A] DO
CostAttr:=UndefinedValue;
production:=Undefined;
END END;
h":={ [A->
ai.v]| (A->aiv) in P, ai=n^.op };
MATCH(n,h");
FOR each (A->w) of P such that [A->w
.] in h"
DO
ComputeCostAttribOf(A,n,(A->w));
EvaluateCriterionOf(A,n^.nonterm[A].CostAttr);
IF (A->w)
is better than n^.nonterm[A].production
THEN
Update(n^.nonterm[A].CostAttr)
n^.nonterm[A].production:=(A->w)
END END;

h"==h"+{[A-> B.]| (A->B) in P, [B->w.] in h"};
WHILE exists a production (A->B) of P
such that [B->w
.] in h" AND no [A->B.] in h"
DO
ComputeCostAttribOf(A,n,(A->B));
EvaluateCriterionOf(A,n^.nonterm[A].CostAttr);
IF (A->B)
is better than n^.nonterm[A].production
THEN
Update(n^.nonterm[A].CostAttr)
n^.nonterm[A].production:=(A->B)
END;
h":=h"+{[A->B
.]}
END END;

Chain productions have to be processed if no unchain productions are applicable any more.

If the derivation tree is detected, then the derivation-tree attributes are evaluated through the top-down left-to-right walk of the minimal cost derivation tree, and corresponding instructions are emitted. To traverse the derivation tree, evaluator start with the root of the expression tree and the grammar start symbol. It uses the production p:A->z0X1z1...Xkzk saved in the node n and a given nonterminal A to determine corresponding nodes n1,...,nk and nonterminal X1,...,Xk. Then it traverses recursively each node ni with the nonterminal Xi.

The compiler generating system SUPER [20] uses the described method and produces efficient code generators. Code selection algorithms are directly incorporated into the generated code generator. For productions of Figure 2.2.3 the following code is generated:

CONST NumProd = 8; (* the number of productions *)
TYPE TprodNum = [1..NumProd];
Tnonterminal = (Operator,Reg);
Tterminal = (Const,Base,Assign,Indr,Plus);
TArity = ARRAY Terminal OF BYTE;
CONST Arity = TArity(0,0,2,1,2);

(* For each nonterminal X the EvalX(p,n) procedure is generated. The procedure computes cost-attributes values of the nonterminal X according to the applicable production p and evaluates the cost-criterion of X. It modifies cost-attributes, if the production p is better than the early investigated one. *)

TYPE TEvalProc = PROCEDURE(TprodNum,PTnode);
PROCEDURE EvalOperator(pn:TprodNum;n:PTnode);
VAR CostAttrValue : ...
BEGIN WITH n^.nonterm[ Operator ] DO
CostAttrValue:=CostAttribRuleForProdN1; (*
:= @ Reg Reg*)
EvaluateCriterionOfOperator;(*
evaluate criterion*)
IF pn
is better then production THEN
CostAttr:=CostAttrValue; production:=pn
END;
END END EvalOperator;

PROCEDURE EvalReg(pn:TprodNum; n:PTnode);
VAR CostAttrValue:...
(*
for currently computed cost­attributes values*)
BEGIN WITH n^.nonterm[ Reg ] DO
CASE pn OF (*
compute cost-attributes values*)
|2:CostAttrValue:=CostAttribRuleForProdN2 (*
+ Reg Reg*)
| . . .
|8:CostAttrValue:=CostAttribRuleForProdN8 (*
BASE *)
END;
EvaluateCriterionOfReg; (*
evaluate criterion*)
IF pn
is better then production THEN (* update fields *)
CostAttr:=CostAttrValue; production:=pn
END;
END END EvalReg;

(* For each operator T that begins some productions RHS the matchT(n) procedure is generated. The procedure tests whether a subtree of the node n matches prefix expressions {Tv} of matching productions {X­>Tv}. EvalX(p,n) procedures are called for applicable productions. *)

PROCEDURE notMatched(pn:TprodNum;n:PTnode);
BEGIN END notMatched;

PROCEDURE matchConst(n:PTnode);
BEGIN EvalReg(6,n) END matchConst;
. . .

PROCEDURE matchPlus(n:PTnode);(* n points to the + node*)
CONST NumAlternatives=3; (*
the number of productions RHS
started with operator + *)
TYPE Tmatched=ARRAY[1..NumAlternatives] OF TEvalProc;
TprodN =ARRAY[1..NumAlternatives] OF TprodNum;
CONST Cmatched=Tmatched(EvalReg,EvalReg,EvalReg);
ProdN =TprodN (2 ,3 ,4 );
VAR matched:Tmatched;
BEGIN WITH n^ DO (*
let n-subtree matches all productions*)
matched:=Cmatched;
IF son[1]^.op<>const THEN matched[2]:=notMatched END;
WITH son[2]^ DO
IF op<>indr THEN matched[3]:=notMatched
ELSE
IF son[1]^.op<>plus THEN matched[3]:=notMatched END
END
END;
(*
n-subtree matches the i-th production
iff matched[i]<>notMatched
*)
FOR i:=1 TO NumAlternatives DO matched[i](ProdN[i],n)
END;
END END matchPlus;

TYPE TmatchProc =PROCEDURE(PTnode);
TmatchArr =ARRAY Terminal OF TmatchProc;

CONST
matchNode = TmatchArr(matchConst,matchBase,...,matchPlus);

(*The PARSE procedure traverses an expression tree in depth-first left-to-right order and finds subtrees that match the right-hand sides of productions.
*)
PROCEDURE PARSE(n:PTnode);
BEGIN WITH n^ DO
FOR k:=1 TO Arity[op] DO PARSE(son[k]) END;
matchNode[op](n); (*
Match unchain productions *)
MatchChainProductions(n);
END END PARSE;
BEGIN PARSE(root); ... END;

An undefined value is used as a cost-attributes value, so we withdraw testing subderivations that correspond to the matching production. Matching procedures test each subexpression vk of the aj-expression ajv1...vm (m=Arity(aj)) independently of the result of the processing v1,...,vk-1.

The code produced for a chain production includes evaluation of cost-attribututes rules and evaluation of cost-criteria. Since the application of a chain production may depend on another chain production, an application sequence is determined for chain productions. This application sequence guarantees that after its execution no productions are applicable any more.

3. Common subexpression elimination, register allocation and instruction emitting

Let us see now code generation for basic blocks when common subexpressions (CSE) are taken into account. Algorithms suggested below are based on the following results.

1) Common subexpression detection can be carried out during one bottom-up traversal over the expression tree [21].

2) Calculation of the cost variants can be carried out during one bottom-up traversal over expression tree too [3,6].

3) Generation of the optimal code sequence without taking into account CSE and register allocation can be carried out during the second depth firs traversal [3,6].

4) The problem of optimal register allocation for expressions with CSE is NP-complete [22,23]. In the rest of the paper we consider an efficient algorithm for CSE detection and combination of this algorithm with optimal code generation and register allocation.

3.1. CSE detection

CSE detection is carried out over the basic block and is based on the two following observations.

1. As a variable may have assignments in the basic block, it is necessary to distinguish the variable before and after the assignment. To do this a number is attached to every variable. At the beginning, these numbers are equal to 0. The number is increased by 1 when the variable gets a new value.

2. CSE detection is carried out during the depth-first traversal over the expressions tree of the basic block. The set of all subexpressions connected with any operation is attached to it. When a node is reached (let a binary operation 'op' be applied to the node; in the case of unary operations the reasoning will be the same), we expect the CSE to be attached to op. If there is a CSE attached to op, such that its left operand is a CSE with left operand of the node, and its right operand is a CSE with right operand of the node, then declare this subexpression as a CSE with the previous one and save in this new CSE the pointer to the previous (Figure 3.1.1). Constants and variables form the basis for this construction: if operands of two expressions are the same variables with equal numbers then they are a CSE. If an expression is not a CSE, then it is attached to op.

                |<-----| op |<-------|
               / \                  / \
              /   \                /   \
         --->/\   /\<---    ------/\   /\---
        |   /  \ /  \   |  |     /  \ /  \  |
        |   ---- ----   |  |     ---- ----  |
        |                --|----------------
         ------------------

                        Figure 3.1.1.
3.2. Implementation

The following global variables are used:

TABLE - table that keeps information about variables: its number (NUMBER) and pointer (LAST) to the node in the expression tree at which the variable was last used;

OPTABLE - table that keeps CSE lists, attached to every operation (ADDR - pointer to the tree node, LIST - list tail);

REGSTATE - table of registers state; for every register it keeps pointer to the last tree node, value of which is loaded into the register;

VARLIST - list of pointers to the TABLE for all variables that was used at the basic block (ADDR - pointer the tree node, LIST - list tail). This list is used to "release" quickly all variables, that were marked as common after the basic block is finished.

The following record is connected to every expression tree node:

NODETYPE =
RECORD LEFT -- left descendant of the node;
RIGHT -- right descendant of the node;
COMM -- pointer to the previous CSE;
FLAG -- the flag shows is the subtree a CSE;
STACK -- stack displacement, if the register was stored;
REGS -- pointer to the REGSTATE for the register
taken for the node;
VARBL -- this lag shows is the node a variable;
VARNUM -- variable number;
VAREF -- pointer to the TABLE entry containing the
variable description
END;

The dependencies between variables are shown in Fig. 3.2.1.

Node of the expression tree

 --------------------------------------------------
|LEFT|RIGHT|COMM|SIGN|STACK|REGS|VARBL|VARNUM|VAREF|
 --------------------------------------------------
        ^                     ^                 ^
        |                     |                 |
     ------                   |                 |
    | ADDR |                  |                 |
    |------|                  |       ----------
    | LIST |-->...            |      |
     ------                   |      |
        ^                     |      V
        |                     |  |--------|    ------
                              |  | LAST   |   | LIST |->...

 |----| |           |----|    |  |--------|   |------|
 | OP |-            |    |<---   | NUMBER |<--| ADDR |
 |----|             |----|       |--------|    ------
                                                ^
                                                |
OPTABLE REGSTATE        TABLE       VARLIST

                           Figure 3.2.1.

           |<------------|<-------------|<---------|
| OP
          / \           / \            / \         OPTABLE
         /   \         /   \          /   \
        /\   /\       /\   /\        /\   /\
       /  \ /  \     /  \ /  \      /  \ /  \
       ---- ----     ---- ----      ---- ----

                        Figure 3.2.2.

All CSEs are gathered into the list that is started at OPTABLE[OP] as shown in Fig. 3.2.2.

CSE detection and tree construction are conducted using the following rules. In these rules the syntax is written before the delimiter => and the semantics after it. In the semantic part a reference of the form $i.NAME means the attribute NAME of the i-th symbol of the rule ($0 for the left hand side symbol, $1 or the first symbol of the right hand side etc.).

The attribute ENTRY of the nonterminal VARIABLE gives the pointer to the variable in the symbol table. The attribute VAL of the symbol OP gives the code of the operation. The attribute NODE of the symbols INTEXPR and ASSIGNMENT gives the pointer to the record of the type NODETYPE of the corresponding nonterminal.

ASSIGNMENT ::= VARIABLE INTEXPR =>

TABLE[$1.ENTRY].NUMBER:=TABLE[$1.ENTRY].NUMBER+1;
{Increment the counter of assignments to the variable}

INTEXPR ::= VARIABLE =>

WITH $0.NODE DO WITH TABLE[$1.ENTRY]
BEGIN IF LAST<>NIL
{ Variable has been used yet }
AND LAST^.VARNUM = NUMBER THEN
{ There have been no assignment to the variable since }
BEGIN FLAG:=TRUE;
{The variable is a common subexpression }
COMM:=LAST;
{ Pointer to CSE }
END
ELSE FLAG:=FALSE;
VAREF:=$1.ENTRY;
{ Entry to TABLE }
LAST:=$0.NODE;
{Pointer to the last usage of the variable}
VARNUM:=NUMBER;
{The variable usage number}
VARBL:=TRUE;
{The expression is a variable}
END;

INTEXPR ::= OP INTEXPR INTEXPR =>
VAR P,L:LISTYPE; {the type of the lists for operations}
R:BOOLEAN;

IF $2.NODE.FLAG AND $3.NODE.FLAG THEN
{ CSEs as left, as right }
BEGIN L:=OPTABLE[$1.VAL];
{ The beginning of the list of CSEs for the operation }
R:=L<>NIL; {Is the list empty}
WHILE R DO
IF ($2.NODE=L.LEFT^.ADDR)
{ Left subtrees coincide }
AND ($3.NODE=L.RIGHT^.ADDR)
{ Right subtrees coincide }
THEN R:=FALSE
ELSE BEGIN L:=L^.LIST; {The next list elemet}
R:=L<>NIL; {Is the list over}
END END
ELSE L:=NIL; {Not CSE}
WITH $0.NODE^ DO
BEGIN VARBL:=FALSE;
{ Not a variable }
COMM:=L;
{ Pointer to the previous CSE or NIL}
IF L<>NIL THEN
BEGIN FLAG:=TRUE;
{ Common subexpression }
LEFT:=$2.NODE;
{ Pointer to the left subtree }
RIGHT:=$3.NODE;
{ Pointer to the right subtree }
END
ELSE BEGIN
FLAG:=FALSE;
{ This expression cannot be seen as common }
{ If there were not CSE with this one
include it into CSE list for the operation }
NEW(P);
P^.ADDR:=$0.NODE;
P^.LIST:=OPTABLE[$1.VAL];
OPTABLE[$1.VAL]:=P;
END END;

3.3. Cost evaluation and register allocation.

Let us consider now how to combine cost evaluation and optimal code generation with CSE elimination and register allocation. Techniques suggested below allow to construct different variants of the algorithm depending on the following.

1) The way that cost is evaluated. Two ways are possible:

1.1) The cost is evaluated without CSEs taking into account. In this case CSE detection and evaluation of the cost may be done during one depth-first traversal of the tree [21,22].

1.2) CSEs are taken into account when the cost is evaluated. If we suppose that unlimited number of registers are available, it is possible to do this in two traversals over the expressions tree in the following way. Let we have a CSE and when the first time it was evaluated its value was loaded into a register. Then one of the variants of the cost of the second CSE will be 0, that implies that the value from the register is used. Note, that if there are two variants of the CSE covering, and one of them loads its value into register and other does not, then it is not necessary that the former is preferable. It is illustrated below by example.

Code generation and register allocation may be done during the second first-depth traversal simultaneously with optimal code selection. If the number of registers is not limited generated program for the basic block will be optimal in the sense of the cost.

2) If number of registers is limited, then we may choose one of the two following.

2.1) During the top-down depth-first traversal and finding the optimal covering, when we come to a node, we need to determine whether it is a CSE with a subtree in the previously traversed part of the tree (and therefore registers were allocated in that part) and the result of this subtree is in a register. If so, and if this register still holds the value, replace the evaluation of the subtree by the register. The cost of the resulting subtree is not greater then the cost of the evaluation of the whole subtree. If the register contents have changed, then evaluate the subtree over again (i.e. continue to walk top-down over the tree). Code generation and register allocation will be done during the return bottom-up walk.

2.2). During the second top-down traversal find an optimal covering, and after this, during the bottom-up walk, allocate registers. If we detect at any node that its subtree is a CSE with some previous one, but the value of the corresponding register is lost, then mark the node of the previous CSE to save the value in memory. In such node it is necessary to emit MOVE instruction to save register. If the cost of the STORE instruction plus the cost of the load instruction is not greater than the cost of the subtree then storing the subtree in memory is cheaper. So far as the cost of the MOVE R,-(SP) is known, during the top-down traversal when the cost of the subtree is known, it is possible to compare these costs and to make an optimal decision: to mark the node and save the register or to evaluate the whole subtree. Code is generated during the third traversal.

Let us see an example that shows that sometimes it is advantageous to generate code that loads values into registers and sometimes that does not, depending on the number of CSEs. The patterns are depicted in Fig. 3.3.1

         Expression                   Operand
            /\    MOVE Operand,Memory     |
          / := \  Cost 2                  | Cost 0
        /        \                        |
      Memory    Operand              ( Register )

       ( Register )
           /\    ADD Displacement,Register
         / @  \  Cost 2
       /        \
  Displacement  Register

Here @ stands for address addition.

         Operand                   Register
           /\                         | MOVE SP,Register
         /    \  Cost 1               | Cost 1
       /        \                     |
 Displacement  Register             Stack

                         Figure 3.3.1.

Assume it is necessary to emit code for the following assignment statement sequence:

A:=V
B:=V

where V is a local variable (i.e. its displacement is defined from the top of the stack). The tree for this basic block is shown in Fig. 3.3.2.

                       Basic Block
                            /\
                          /    \
                        /        \
                      /\          /\
                    /   \        /   \
                   A   /@\      B    /@\
                      /   \         /   \
                     V    SP       V    SP

                        Figure 3.3.2.

If we do not use CSEs in the addressing part, then the following code will be generated:

                  MOVE V(SP),A | 3
                  MOVE V(SP),B | 3
                                --
                                 6

that has cost 6. Figure 3.3.3 shows the covering.

                   Basic block
                /               \
              /                   \
    Expression                  Expression
       /\    MOVE V(SP),A            /\    MOVE V(SP),B
     / := \  Cost      2           / := \  Cost 2          
   /        \                    /        \                  
  
 Memory  Operand             Memory     Operand
   A        /\                  B          /\                
    
          /    \  Cost 1                 /    \  Cost 1      

        /        \                     /        \            
    
  Displacement  Stack            Displacement  Stack
     V                              V

                        Figure 3.3.3.

If the CSE is evaluated explicitly, the code will be

                  MOVE SP,R   | 1
                  ADD V,R     | 2
                  MOVE (R),A  | 2
                  MOVE (R),B  | 2
                               --
                                7

that has cost 7. The covering is shown in Figure 3.3.4.

                  Basic block
                /               \
              /                   \
    Expression                 Expression
       /\    MOVE (R),A            /\    MOVE (R),B
     / := \  Cost 1              / := \  Cost 1          
   /        \                  /        \                    

 Memory   Operand           Memory    Operand
   A         |                            |
             | Cost 0                     | Cost 0
          (Register)                   (Register)
            /\    ADD V,R
          /    \  Cost 1
        /        \
  Displacement Register
      V           | MOVE SP,R
                  | Cost 1
                  | 
                 Stack  
                    Figure 3.3.4.

So in this case it is not profitable to use CSEs. Note that if the variable V was used three times in the basic block, the total cost would be equal to 9 in both cases. If used four times, the cost would be 11, while the sum of the costs without CSEs would be 12, and so it would be profitable to use the CSE.

Note that the application of the algorithm that combines cost evaluation and optimal code selection with CSE detection gives the best code in any case.

4. Conclusion

We investigate the algorithm of Cocke, Younger and Kasami, and modify it to accept only tree grammars. An algorithm which determines minimal cost derivation trees is given. Approach is suggested that permits to combine CSE selection, cost evaluation and register allocation. We believe the main advantage of our approach to code generation is the use of the usual parsing and attribute techniques for the generation of code and attributed tree grammars for specifications.

References

1. Aho A.U., Corasick M.J. Efficient string matching: An aid to bibliographic search. Comm.ACM,V.18,N.6,1975.

2. Aho A.U., Johnson S.C. Optimal code generation for expression trees, J.ACM 23,3,1976

3. Aho A.U., Ganapathi M., Tjiang S.W. Code generation using tree matching and dynamic programming. ACM transaction of programming languages and systems, V.11, N.4, 1989.

4. Chow F.C.,Ganapathy M. Intermediate languages in compiler construction - a bibliography. ACM SIGPLAN Not., V.18., N.11.

5. Christopher T.W., Hatcher P.J., Kukuk P.C. Using dynamic programming to generate optimal code in Graham-Glanville style code generation. ACM SIGPLAN Not., V.19, N.6, 1982.

6. Emmelman H., Schroer F.W., Landweher R. BEG - a generator for efficient back-ends. ACM SIGPLAN Not., 1989, pp.227-237

7. Ganapathy M., Fischer C.N.,Hennessy J. Retargetable code generation. ACM Comp.Surveys, V.14, N.4, 1982.

8. Ganapathy M., Fischer C.N. Description driven code generation using attribute grammars. Proc.of the 9th ACM Symp.on Principles of programming languages, ACM,1982.

9. R. Geigerich and K. Schmal. Code selection techniques: pattern matching, tree parsing, and inversion of derivor. In Proceeding of the European Symposium on Programming (Nancy,France),1988.

10. Glanville R.S. A machine independent algorithm for code generation and its use in retargetable compilers. Ph.D.Th, Univ.of California, Berkeley, 1977.

11. Graham S.L., Glanville R.S. A new method for compiler code generation. Proc.of the 5th ACM Symp.on Principles of programming languages, ACM, 1978.

12. Graham S.L., Henry R.R., Schulman R.A. An Experiment in table driven code generation. ACM SIGPLAN Not., V.17, N.6, 1982.

13. Graham S.L., Harrison M.A., Ruzzo W.L. Am improved context-free recognizer. ACM transaction on programming language systems, N.2, 1980.

14. Harrison M.A. Introduction to formal language theory. Addison-Wesley, Reading, Mass., 1978.

15. Hoffman C.W., O'Donnell M.J. Pattern matching trees. J.ACM, V.29, N.1, 1989.

16. A. Despland, M. Mazaud, R. Rakotozafy. Code generators and proving them correct. SCP 15, 1990, 15-54.

17. Wasielew S.G. A compiler writing system with optimization capabilities for complex order structures. Ph.D.Th., Northwestern University, Evanston, 1972.

18. G.E. Veldhuijzen van Zanten. On the design of a code generator. Proceedings Bilateral Workshop on Compiler Construction, University of Twente, Enschede, The Netherlands, 1990.

19. A.Bezdushny, Code generation by parsing method, Computing Centre of the Academy of Sciences of the USSR, 1990.

20. A.Bezdushny, V.Serebriakov, The input language of the compiler writing system SUPER, Computing Centre of the Academy of Sciences of the USSR, 1986.

21. A.Aho, R.Sethi, J.D.Ullman, Compilers: Principles, Techniques and Tools, Addison-Wesly, 1986.

22. A.V.Aho, S.C.Johnson, J.D.Ullman, Code Generation for Expressions with Common Subexpressions, J.ACM, 24:1, pp.146-160.

23. J.Bruno, R.Sethi, Code Generation for a one-register machine. J.ACM, 23:3, pp. 502-510

24. B.Weisgerber, R.Wilhelm, Two tree pattern matchers for code selection. Compiler compiler and high speed compilation, Processings of the workshop, Berlin, Oct. 10-14,1988, pp. 397-411.