![]() |
The Accent Compiler Compiler
A Short Tutorial |
Accent Overview Tutorial Language Installation Usage Lex Algorithms Distribution |
The input of a computer program is often written in a specific language.
This holds for traditional compilers but it is also true for many other
application programs. For example, a program that displays molecules
reads input written in a molecule language.
Accent is a tool that supports the development of language processors. It generates programs that process source text written in the language specified by the user and make the underlying structure of the text explicit. Following this structure semantic action linked to certain constructs are executed.
How to Describe LanguagesGrammarsThe user describes the language by providing a grammar. Such a grammar is given by rules.A rule describes how to build a particular construct of the language. This is done by listing one or more alternatives how to build the construct from constituents. For example, if we define a programming language, we can express the fact that a program is constructed from a declaration_part and a statement_part by the rule program : declaration_part statement_part ;A rule has a left hand side that names the construct defined by the rule. A colon (":") separates the left hand side from the right hand side. The right hand side specifies how to build the construct. A semicolon (";") terminates the rule. The name on the left hand side is called a nonterminal. A nonterminal may be used in right hand sides to specify constituents. The possible representations defined by a nonterminial are called the phrases of the nonterminal. The rule above says that phrases for program are composed from a phrase for declaration_part followed by a phrase for statement_part. A rule may provide more than one alternative. Here is a specification of the nonterminal statement: statement : variable '=' expression | IF expression THEN statement ELSE statement | WHILE expression DO statement | BEGIN statement_seq END ;It describes four alternatives how to construct a statement. Alternatives are separated by a bar ("|"). The nonterminal that is defined by the first rule of the grammar is called the start symbol. The phrases of the start symbol constitute the language defined by the grammar. Grammars of this kind are called context-free grammars. Accent can process all context-free grammars without restriction. Lexical ElementsNonterminal symbols are defined by rules. There are also elementary items called terminal symbols or tokens.They may be given literally, if they are represented by a single character. For example, the '=' element in the first alternative for statement stands for the character "=". They may also be referred by a symbolic name such as IF in the second alternative for statement. In our language an IF symbol could be represented by the two character string "if". Such a mapping from strings to tokens is not defined by the Accent specification. In Accent, one just introduces symbolic names that are used for terminal symbols. For example, %token IF, THEN, ELSE, WHILE, DO, BEGIN, END;introduces names for the tokens in the rule for statement. The declaration precedes the first rule of the grammar. The actual representation is given by rules for a further tool: The generator Lex can be used to create a lexical analyzer that partitions the input text into tokens. A specification for Lex is a list of lexical rules. A lexical rule states a pattern (a regular expression) that matches the token and an action that is carried out when the token is recognized. Here is are two example rules: "=" { return '='; } "if" { return IF; }If "=" is recognized then the code of the character '=' is returned as an indicator. If "if" is recognized then the value of the constant IF is returned. A token may have a more complex structure. An example is the token NUMBER which represents a sequence of digits. This can be specified by a Lex rule like this: [0-9]+ { return NUMBER; }A string that matches the pattern [0-9]+ (i.e. a sequence of digits) is indicated as a NUMBER. The lexical analyzer (described in the Lex specification) structures the input into a sequence of tokens. The syntactical analyzer (described in the Accent specification) hierarchically structures this sequence into phrases. An item is specified as token if it has a fixed representation such as "=" or "if" or if it can be defined by a simple expression such as the pattern for NUMBER. In many cases tokens are those items that can be separated by additional white space. A Lex rule can specify to skip white space so that it can be ignored in the syntactical grammar. The Lex rule " " { /* skip blank */ }skips blanks by not returning a token indicator. A Grammar for ExpressionsWe now present a simple but complete example: a grammar for expressions like10+20*30Here is the Accent specification: 01 %token NUMBER; 02 03 expression : 04 term 05 ; 06 07 term : 08 term '+' factor 10 | term '-' factor 11 | factor 12 ; 13 14 factor : 15 factor '*' primary 16 | factor '/' primary 17 | primary 18 ; 19 20 primary : 21 NUMBER 22 | '(' term ')' 23 | '-' primary 24 ;These rules not only define what constitutes a valid expression but also give it structure. The different nonterminals reflect the binding strength of the operators. The operators of a factor ("*" and "/") have a stronger binding than the operators of a term ("+" and "-") because factor is a constituent of a term (a term appearing inside a factor must be enclosed in parentheses). For example, the input 10+20*30is structured as follows: expression | +-term | +-term | | | +- factor | | | +-primary | | | +-NUMBER | +-'+' | +-factor | +-factor | | | +-primary | | | +-NUMBER | +-'*' | +-primary | +-NUMBERor more precisely (this representation, generated with Accent, specifies which alternative has been chosen and lists in curly braces the constituents): expression alternative at line 4, col 6 of grammar { term alternative at line 8, col 6 of grammar { term alternative at line 10, col 8 of grammar { factor alternative at line 16, col 9 of grammar { primary alternative at line 20, col 8 of grammar { NUMBER } } } '+' factor alternative at line 14, col 8 of grammar { factor alternative at line 16, col 9 of grammar { primary alternative at line 20, col 8 of grammar { NUMBER } } '*' primary alternative at line 20, col 8 of grammar { NUMBER } } } }The tree above indicates that 10+20*30is structured as 10+(20*30)and not as (10+20)*30Here is the Lex specification for the expression grammar: %{ #include "yygrammar.h" %} %% "+" { return '+'; } "-" { return '-'; } "*" { return '*'; } "/" { return '/'; } [0-9]+ { return NUMBER; } " " { /* skip blank */ } \n { /* skip newline */ } . { yyerror("illegal token"); }(The file yygrammar.h, which is included in the header of the Lex specification, is generated by Accent and contains the definition of the constant NUMBER.) How to Assign MeaningSemantic ActionsFrom the above grammar Accent generates a program that analyzes its input syntactically: it rejects all texts that do not conform to the grammar.In order to process the input semantically we have to specify semantic actions. These actions may be embedded into the grammar at arbitrary positions. They are executed when the particular alternative is processed. The members of a selected alternative are processed from left to right. A semantic action is arbitrary C code inclosed in curly braces. The text (without the braces) is copied verbatim into the generated program. Here is an example N: { printf("1\n"); } A { printf("2\n"); } B { printf("3\n"); } | { printf("x\n"); } C { printf("y\n"); } ; A: 'a' { printf("inside A\n"}; }; B: 'b' { printf("inside B\n"}; }; C: 'c' { printf("inside C\n"}; };For the input a bthe generated program produces the output 1 inside A 2 inside B 3For each nonterminal Accent generates a tree walker function. Here is the code generated for N (slightly edited and without #line pragmas for C preprocessor): N () { switch(yyselect()) { case 1: { printf("1\n"); A(); printf("2\n"); B(); printf("3\n"); } break; case 2: { printf("x\n"); C(); printf("y\n"); } break; } } Attributes of NonterminalsLike functions in C, nonterminal can have parameters. Parameters may be of mode in or out. in parameters are used to pass information from the context to a particular nonterminal (often called inherited attributes). out parameters pass information from a nonterminal to its context (often called synthesized attributes). At the left hand side of rule the name of the nonterminal is followed by a signature that specifies mode, type, and name of parameters. The signature is enclosed in the braces "<" and ">".For example N < %in int context, %out int result > : ... ;N has an input parameter context and an output parameter result, both are of type int. If a nonterminal appears on the right hand side, actual parameters follow the nonterminal name, enclosed in "<" and ">". For example N<actual_context, actual_result>Parameters can be accessed inside semantic actions. The values of input parameters must be defined inside semantic actions or be the output of other members. For example demo : { actual_context = 10; } N<actual_context, actual_result> { printf("%d\n", actual_result); } ;An alternative for a nonterminal must define its output parameters, either by using them as output parameters for members or by assigning a value inside a semantic action. If an output parameter of the left hand side (a formal output parameter) is used inside a semantic action, it must be dereferenced with the "*" operator (output parameters are passed by reference to the generated tree walker function). For example N<%in int context, %out int result> : { *result = context+1; } ;An elaboration of demo prints 11. Here are the generated functions: demo () { int actual_context; int actual_result; switch(yyselect()) { case 1: { actual_context = 10; N(actual_context, &actual_result); printf("%d\n", actual_result); } break; } } N (context, result) int context; int *result; { switch(yyselect()) { case 2: { *result = context+1; } break; } }As you see, identifiers that appear as parameters of nonterminals are automatically declared. Formal parameters, if present, are specified in the form < %in parameter_specifications %out parameter_specifications >where either the %in group or the %out group may be omitted. The most frequent case, where we have only output parameters, can simply be written without a mode indicator: < parameter_specifications >parameter_specifications is a list of the form Type_1 Name_1 , ... , Type_n Name_nThe type may be omitted. In this case the special type YYSTYPE is assumed. This may be defined by the user as a macro. If there is no user specific definition YYSTYPE stands for long. Hence in most cases a left hand side of a rule simply looks like this: Block<b> : ... Attributes of TokensAll items declared as tokens have an output parameter of type YYSTYPE. If a token is used on the right hand side of a rule, an actual parameter may be specified to access the attribute value of the token which is computed by the scanner.For example Value : NUMBER<n> { printf("%d\n", n); } ;Here n represents the numeric value of NUMBER. It can be used in the semantic action. The attribute value of a token must be computed in the semantic action of the Lex rule for the token. It must be assigned to the special variable yylval which is of type YYSTYPE. For example if we want to access the value of a NUMBER the corresponding Lex could be [0-9]+ { yylval = atoi(yytext); return NUMBER; }The special variable yytext holds the string that matched the pattern. The C function atoi converts it into a integer. Global PreludeIf YYSTYPE is defined by the user, it should be declared in an include file, because it is used in the grammar as well as in the Lex specification.#include statements can be placed in the global prelude part at the beginning of the grammar file. Text which is enclosed by %prelude {and }is copied verbatim into the generated program. For example %prelude { #include "yystype.h" } Rule PreludeIdentifiers that are used as attributes need not be declared. One may use semantic actions to declare additional variables (the curly braces surrounding a semantic do not appear in the generated code).For example demo : {int i = 0;} alternative_1 ; | alternative_2 ;Such variables are local to the alternative. i is visible in the sematic action of alternative_1 but not in those of alternative_2 Variables that are visible to all alternatives (but local to the rule) can be declare in rule prelude which has the same form as the global prelude but appears before the alternative list of a rule. For example demo : %prelude {int i = 0;} alternative_1 | alternative_2 ;i is visible in the sematic actions of both alternatives. The rule prelude can also be used to provide code that should be execute as initialization for all alternatives. A CalculatorWe are now ready to turn the expression grammar into a calculator.For this purpose nonterminals get an output parameter that holds the numerical value of the phrase represented by the nonterminal. This value is compute from the numerical values of the constituents. For example, in the left hand side term<n> :term gets an attribute n. In the right hand side term<x> '+' factor<y> { *n = x+y; }the nonterminal term gets an attribute x and the nonterminal factor gets an attribute y. The attribute of the left hand side is the computed as the sum of x and y. Here is the complete grammar: %token NUMBER; expression : term<n> { printf("%d\n", n); } ; term<n> : term<x> '+' factor<y> { *n = x+y; } | term<x> '-' factor<y> { *n = x-y; } | factor<n> ; factor<n> : factor<x> '*' primary<y> { *n = x*y; } | factor<x> '/' primary<y> { *n = x/y; } | primary<n> ; primary<n> : NUMBER<n> | '(' term<n> ')' | '-' primary<x> { *n = -x; } ;See Using the Accent Compiler Compiler how to process this specification to obtain a functioning calculator. How to Abbreviate SpecificationsExtended Backus Naur FormSo far we have only considered grammars where members of alternatives are nonterminal and terminal symbols. A formalism of this kind was used in the Algol 60 report and named after the editors of that document: Backus Naur Form.Accent also supports a notation that is known as Extended Backus Naur Form. In this formalism one can write structured members to specify local alternatives and optional and repetitive elements. Local AlternativesA member of the form( alt_1 | ... | alt_n )can be used to specify alternative representations of a member without introducing a new nonterminal. For example, instead of signed_number : sign NUMBER ; sign : '+' | '-' ;one can write signed_number : ( '+' | '-' ) NUMBER ;Semantic actions may be inserted. The actions of the selected alternative are executed. For example, signed_number<r> : { int s; } ( '+' { s = +1; } | '-' { s = -1; } ) NUMBER<n> { *r = s*n; } ; Optional ElementsA member can also have the form( M_1 ... M_n )?in which case the enclosed items M_1, ... , M_n may appear in the input or not. For example, integer : ( sign )? NUMBER ;specifies specifies that integer is a NUMBER preceded by an optional sign. So both 123and + 123are valid phrases for integer. More than one alternative may be specified between "(" and ")?": ( alt_1 | ... | alt_n )?For example, integer : ( '+' | '-' )? NUMBER ;specifies that an integer is a NUMBER that is optionally preceded by either a "+" or a "-". In case of semantic actions, proper initialization is required, because none of the alternative may be processed: integer<r> : { int s = +1; } ( '+' | '-' { s = -1; } )? NUMBER<n> { *r = s*n; } ; Repetitive ElementsA further form of a member is( M_1 ... M_n )*in which case the enclosed items M_1, ... , M_n may be repeated an arbitrary number of times (including zero). For example, number_list : NUMBER ( ',' NUMBER )* ;specifies that a number_list is given by at least one NUMBER which is followed by arbitrary number of comma-separated NUMBERs. Semantic action inside repetions are executed as often as there are instances. For example, number_list : NUMBER<sum> ( ',' NUMBER<next> { sum += next;} )* { printf("%d\n", sum); } ;adds all the numbers and prints their sum. Again, several alternatives may specified: ( alt_1 | ... | alt_n )*For example, statements : ( simple_statement | structured_statement )* ;statements matches an arbitrary number of statements, each of which may be a simple_statement or a structured_statement. How to Resolve AmbiguitiesAmbiguitiesThe phrase structure of a source text determines the actions that are executed. There are grammars, where the same source text can have different phrase structures. Such grammars are called ambiguous.As an example consider this rule from the C language report: selection_statement : IF '(' expression ')' statement | IF '(' expression ')' statement ELSE statement | SWITCH '(' expression ')' statement ;For the source text if (x) if (y) f(); else g();two interpretations are possible. One treats the text like if (x) { if (y) f(); else g(); }the other one like if (x) { if (y) f(); } else g();In the first case g() is invoked if x is true and y is false. In the second case g() is invoked if x is false. The intended interpretation could be specified by an unambiguous grammar (as has been done in the case of Java). But such a grammar would be more clumsy because the problem can not be solved locally. The C report uses an ambiguous grammar and states: "The else ambiguity is resolved by connecting an else with the last-encountered else-less if at the same block nesting level." In traditional systems ambiguous grammars lead to violations of the constraints of the underlying grammar classes (all LL(1) and all LALR(1) grammars are unambiguous). The result is a conflict between parser actions (such as a "shift/reduce" conflict in Yacc, these conflicts are also possible if the grammar is unambiguous). Such systems allow the user to resolve the parser conflicts or take a default action. Since Accent has been designed so that it can be used without knowledge of parser implementation, we have developed an annotation framework that allows the user to resolve ambiguities at the abstract level of the grammar. Moreover, this framework is complete, it allows to resolve every ambiguity. For this purpose we have to give a complete classification of ambiguities. We distinguish ambiguities between alternatives and ambiguities inside alternatives. It is undecidable whether a given grammar is unambiguous. But an ambiguity can be detected at runtime. In this case the Accent Runtime prints a detailed analysis and gives an indication how to resolve the ambiguity via an annotation. If the annotated grammar is processed by Accent, the parser selects the phrase structure that has been specified by the user. Ambiguities Between AlternativesHere is a simple example for an ambiguity between alternatives:01 Start: 02 'x' N 'z' 03 ; 04 05 N: 06 A { printf("A selected\n"); } 07 | B { printf("B selected\n"); } 08 ; 09 10 A: 'y' ; 11 B: 'y' ;For the input x y zthere are two possible derivations for N since both, A and B, can produce a "y". Hence there are two possible outputs for the same input: A selectedand B selectedWhen confronted with that input the parser emits the following diagnostics: GRAMMAR DEBUG INFORMATION Grammar ambiguity detected. Two different ``N'' derivation trees for the same phrase. TREE 1 ------ N alternative at line 6, col 3 of grammar { A alternative at line 10, col 4 of grammar { 'y' } } TREE 2 ------ N alternative at line 7, col 3 of grammar { B alternative at line 11, col 4 of grammar { 'y' } } Use %prio annotation to select an alternative. END OF GRAMMAR DEBUG INFORMATIONA "%prio" annotation can be used to give an alternative a certain priority. It is written in the form %prio numberand attached at the end of an alternative. The alternative then gets the priority number. When two different rules can produce the same string, both must have a "%prio" annotation. The alternative with the higher priority is selected. To select the second alternative for N we give it the priority 2 and assign 1 to the first alternative. N: A { printf("A selected\n"); } %prio 1 | B { printf("B selected\n"); } %prio 2 ;Now the ambiguity is resolved and the output of the generated program is: B selectedIn the same style we can resolve the else ambiguity: selection_statement : IF '(' expression ')' statement %prio 1 | IF '(' expression ')' statement ELSE statement %prio 2 | SWITCH '(' expression ')' statement ;It is not necessary to provide a priority for the third alternative because it is not involved in the ambiguity. Ambiguities Inside AlternativesWe have seen that a derivation is not unique if one can replace a subtree by a different one that produces the same string. But there is also another source of ambiguity that is less obvious.In this kind of ambiguity inside the same alternative two neighbor derivations are replaced by different derivations that produce different strings. But together the new derivations produce the same string than the old ones. It is clear that the string produced by a new derivation must have a different length than the string produced by an old derivation. The ambiguity can be resolved by specifying whether the short or the long string should be preferred. Here is an example: 01 N : 02 'x' L R 'z' 03 ; 04 05 L : 06 'a' { printf("( a )"); } 07 | 'a' 'b' { printf("( a b )"); } 08 ; 09 10 R : 11 'c' { printf("( c )\n"); } 12 | 'b' 'c' { printf("( b c )\n"); } 13 ;To parse the input x a b c zone can either use the first alternative of L and the second alternative of R or the second alternative of L and the first alternative of R. The diagnostic message produced by the parser is: GRAMMAR DEBUG INFORMATION Grammar ambiguity detected. There are two different parses for the beginning of ``N'', alternative at line 2, col 3 of grammar, up to and containing ``R'' at line 2, col 9 of grammar. PARSE 1 ------- 'x' L alternative at line 6, col 3 of grammar { 'a' } R alternative at line 12, col 3 of grammar { 'b' 'c' } PARSE 2 ------- 'x' L alternative at line 7, col 3 of grammar { 'a' 'b' } R alternative at line 11, col 3 of grammar { 'c' } For ``R'' at line 2, col 9 of grammar, use %long annotation to select first parse, use %short annotation to select second parse. END OF GRAMMAR DEBUG INFORMATIONThe "%long" annotation in front of a nonterminal indicates that we prefer that the nonterminal produces a longer string. If we prefix the nonterminal symbol R with %long as in N : 'x' L %long R 'z' ;the ambiguity is resolved and we get the output ( a )( b c )The "%short" annotation indicates that we prefer the shorter string. If we prefix R with %short as in N : 'x' L %short R 'z' ;we get the output ( a b )( c ) accent.compilertools.net |