[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7. Handling Context Dependencies

The Bison paradigm is to parse tokens first, then group them into larger syntactic units. In many languages, the meaning of a token is affected by its context. Although this violates the Bison paradigm, certain techniques (known as kludges) may enable you to write Bison parsers for such languages.

7.1 Semantic Info in Token Types  Token parsing can depend on the semantic context.
7.2 Lexical Tie-ins  Token parsing can depend on the syntactic context.
7.3 Lexical Tie-ins and Error Recovery  Lexical tie-ins have implications for how error recovery rules must be written.

(Actually, "kludge" means any technique that gets its job done but is neither clean nor robust.)


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.1 Semantic Info in Token Types

The C language has a context dependency: the way an identifier is used depends on what its current meaning is. For example, consider this:

 
foo (x);

This looks like a function call statement, but if foo is a typedef name, then this is actually a declaration of x. How can a Bison parser for C decide how to parse this input?

The method used in GNU C is to have two different token types, IDENTIFIER and TYPENAME. When yylex finds an identifier, it looks up the current declaration of the identifier in order to decide which token type to return: TYPENAME if the identifier is declared as a typedef, IDENTIFIER otherwise.

The grammar rules can then express the context dependency by the choice of token type to recognize. IDENTIFIER is accepted as an expression, but TYPENAME is not. TYPENAME can start a declaration, but IDENTIFIER cannot. In contexts where the meaning of the identifier is not significant, such as in declarations that can shadow a typedef name, either TYPENAME or IDENTIFIER is accepted--there is one rule for each of the two token types.

This technique is simple to use if the decision of which kinds of identifiers to allow is made at a place close to where the identifier is parsed. But in C this is not always so: C allows a declaration to redeclare a typedef name provided an explicit type has been specified earlier:

 
typedef int foo, bar, lose;
static foo (bar);        /* redeclare bar as static variable */
static int foo (lose);   /* redeclare foo as function */

Unfortunately, the name being declared is separated from the declaration construct itself by a complicated syntactic structure--the "declarator".

As a result, the part of Bison parser for C needs to be duplicated, with all the nonterminal names changed: once for parsing a declaration in which a typedef name can be redefined, and once for parsing a declaration in which that can't be done. Here is a part of the duplication, with actions omitted for brevity:

 
initdcl:
          declarator maybeasm '='
          init
        | declarator maybeasm
        ;

notype_initdcl:
          notype_declarator maybeasm '='
          init
        | notype_declarator maybeasm
        ;

Here initdcl can redeclare a typedef name, but notype_initdcl cannot. The distinction between declarator and notype_declarator is the same sort of thing.

There is some similarity between this technique and a lexical tie-in (described next), in that information which alters the lexical analysis is changed during parsing by other parts of the program. The difference is here the information is global, and is used for other purposes in the program. A true lexical tie-in has a special-purpose flag controlled by the syntactic context.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.2 Lexical Tie-ins

One way to handle context-dependency is the lexical tie-in: a flag which is set by Bison actions, whose purpose is to alter the way tokens are parsed.

For example, suppose we have a language vaguely like C, but with a special construct `hex (hex-expr)'. After the keyword hex comes an expression in parentheses in which all integers are hexadecimal. In particular, the token `a1b' must be treated as an integer rather than as an identifier if it appears in that context. Here is how you can do it:

 
%{
int hexflag;
%}
%%
...
expr:   IDENTIFIER
        | constant
        | HEX '('
                { hexflag = 1; }
          expr ')'
                { hexflag = 0;
                   $$ = $4; }
        | expr '+' expr
                { $$ = make_sum ($1, $3); }
        ...
        ;

constant:
          INTEGER
        | STRING
        ;

Here we assume that yylex looks at the value of hexflag; when it is nonzero, all integers are parsed in hexadecimal, and tokens starting with letters are parsed as integers if possible.

The declaration of hexflag shown in the C declarations section of the parser file is needed to make it accessible to the actions (see section The C Declarations Section). You must also write the code in yylex to obey the flag.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.3 Lexical Tie-ins and Error Recovery

Lexical tie-ins make strict demands on any error recovery rules you have. See section 6. Error Recovery.

The reason for this is that the purpose of an error recovery rule is to abort the parsing of one construct and resume in some larger construct. For example, in C-like languages, a typical error recovery rule is to skip tokens until the next semicolon, and then start a new statement, like this:

 
stmt:   expr ';'
        | IF '(' expr ')' stmt { ... }
        ...
        error ';'
                { hexflag = 0; }
        ;

If there is a syntax error in the middle of a `hex (expr)' construct, this error rule will apply, and then the action for the completed `hex (expr)' will never run. So hexflag would remain set for the entire rest of the input, or until the next hex keyword, causing identifiers to be misinterpreted as integers.

To avoid this problem the error recovery rule itself clears hexflag.

There may also be an error recovery rule that works within expressions. For example, there could be a rule which applies within parentheses and skips to the close-parenthesis:

 
expr:   ...
        | '(' expr ')'
                { $$ = $2; }
        | '(' error ')'
        ...

If this rule acts within the hex construct, it is not going to abort that construct (since it applies to an inner level of parentheses within the construct). Therefore, it should not clear the flag: the rest of the hex construct should be parsed with the flag still in effect.

What if there is an error recovery rule which might abort out of the hex construct or might not, depending on circumstances? There is no way you can write the action to determine whether a hex construct is being aborted or not. So if you are using a lexical tie-in, you had better make sure your error recovery rules are not of this kind. Each rule must be such that you can be sure that it always will, or always won't, have to clear the flag.


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Frank B. Brokken on January, 28 2005 using texi2html