parse()
to return PARSE_ABORT
on error and
have the caller ignore the rest of the input line when that happens (and then
call parse()
again). But this is inadequate for a compiler, because it
forgets all the syntactic context leading up to the error. A syntactic error
deep within a function in the compiler input should not cause the compiler to
treat the following line like the beginning of a source file.
It is possible to specify how to recover from a syntactic error by
writing rules recognizing the special token error
. This is a terminal
symbol that is always defined (it must not be declared) and is reserved
for error handling. The bisonc++ parser generates an error
token whenever a
syntactic error is detected; if a rule was provided recognizing this token
in the current context, the parse can continue. For example:
statements: // empty | statements '\n' | statements expression '\n' | statements error '\n'The fourth rule in this example says that an error followed by a newline makes a valid addition to any
statements
.
What happens if a syntactic error occurs in the middle of an
expression
? The error recovery rule, interpreted strictly, applies to the
precise sequence of a statements
, an error and a newline. If an error
occurs in the middle of an expression
, there will probably be some
additional tokens and subexpressions on the parser's stack after the last
statements
, and there will be tokens waiting to be read before the next
newline. So the rule is not applicable in the ordinary way.
bisonc++, however, can force the situation to fit the rule, by discarding
part of the semantic context and part of the input. When a (syntactic) error
occurs the parsing algorithm tries to recover from the error in the
following way: First it discards states from the stack until it encounters a
state in which the error
token is acceptable (meaning that the
subexpressions already parsed are discarded, back to the last complete
statements
). At this point the error token is shifted. Then, if the
available look-ahead token is not acceptable to be shifted next, the parser
continues to read tokens and to discard them until it finds a token which
is acceptable. I.e., a token which can follow an error
token in
the current state. In this example, bisonc++ reads and discards input until the
next newline was read so that the fourth rule can apply.
The choice of error rules in the grammar is a choice of strategies for error recovery. A simple and useful strategy is simply to skip the rest of the current input line or current statement if an error is detected:
statement: error ';' // on error, skip until ';' is readAnother useful recovery strategy is to recover to the matching close-delimiter of an opening-delimiter that has already been parsed. Otherwise the close-delimiter probably appears to be unmatched, generating another, spurious error message:
primary: '(' expression ')' | '(' error ')' | ... ;Error recovery strategies are necessarily guesses. When they guess wrong, one syntactic error often leads to another. In the above example, the error recovery rule guesses that an error is caused by bad input within one statement. Suppose that instead a spurious semicolon is inserted in the middle of a valid statement. After the error recovery rule recovers from the first error, another syntactic error will immediately be found, since the text following the spurious semicolon is also an invalid statement.
To prevent an outpouring of error messages, the parser may be configured in such a way that no error message are generated for another syntactic error that happens shortly after the first. E.g., only after three consecutive input tokens have been successfully shifted error messages may again be generated. This configuration is currently not available in bisonc++'s parsers.
Note that rules using the error
token may have actions, just as any
other rules can.
The token causing an error is re-analyzed immediately when an error
occurs. If this is unacceptable, then the member function clearin()
may be
called to skip this token. The function can be called by any member function
of the Parser class. For example, suppose that on a parse error, an error
handling routine is called that advances the input stream to some point where
parsing should once again commence. The next symbol returned by the lexical
scanner is probably correct. The previous token ought to be discarded using
clearin()
.
lookup()
function cannot find an action for the current token in the current state it
throws an UNEXPECTED_TOKEN_
exception.
This exception is caught by the parsing function, calling the
errorRecovery()
member function. By default, this member function
terminates the parsing process. The non-default recovery procedure is
available once an error
token is used in a production rule. When the
parsing process throws UNEXPECTED_TOKEN_ the recovery procedure is
started (i.e., it is started whenever a syntactical error is encountered or
ERROR()
is called).
The recovery procedure consists of
If the error recovery procedure fails (i.e., if no acceptable token is ever encountered) error recovery falls back to the default recovery mode (i.e., the parsing process is terminated).
Not all syntactic errors are always reported: the option --required-tokens can be used to specify the minimum number of tokens that must have been successfully processed before another syntactic error is reported (and counted).
The option --error-verbose may be specified to obtain the contents of the state stack when a syntactic error is reported.
The example grammar may be provided with an error
production rule:
%token NR %left '+' %% start: start expr | // empty ; expr: error | NR | expr '+' expr ;The resulting grammar has one additional state (handling the error production) and one state in which the
ERR_ITEM
flag has been set. When
and error is encountered, this state obtains tokens until a token having a
valid continuation was received, after which normal processing continues.
In the parser's verbose output (using the -V
option) the various grammar
rules and state transition tables are shown. The debug output below refers to
this information.
The rules are:
1: start -> start expr 2: start -> <empty> 3: expr (errTok_) -> errTok_ 4: expr (NR) -> NR 5: expr ('+') -> expr '+' expr 6: start_$ -> start
The state-transitions are:
State 0: 6: start_$ -> . start 0: On start to state 1 Reduce by 2: start -> . State 1: 6: start_$ -> start . 1: start -> start . expr 0: On expr to state 2 1: On errTok_ to state 3 2: On NR to state 4 State 2: 1: start -> start expr . 5: expr -> expr . '+' expr 0: On '+' to state 5 Reduce by 1: start -> start expr . State 3: 3: expr -> errTok_ . Reduce by 3: expr -> errTok_ . State 4: 4: expr -> NR . Reduce by 4: expr -> NR . State 5: 5: expr -> expr '+' . expr 0: On expr to state 6 1: On errTok_ to state 3 2: On NR to state 4 State 6: 5: expr -> expr '+' expr . 5: expr -> expr . '+' expr 0 (removed by precedence): On '+' to state 5 Reduce by 5: expr -> expr '+' expr .
The following output from the parse()
function, generated by bisonc++ using the
--debug
option illustrates error recovery for the above grammar, entering
the input
a 3+aresults in:
parse(): Parsing starts PUSH 0 (initializing the state stack) LOOKUP: [0, 'Reserved_::UNDETERMINED_'] -> default reduce using rule 2 REDUCE: rule 2 execute action 2 ... ... completed rule 2: pop 0 elements. Next will be: [0, 'start'] pop 0 elements from the stack having size 1 next: [0, 'start'] PUSH: [0, 'start'] -> 1 scanner token `a' (97) ERROR: [1, `a' (97)] -> ??. Errors: 1 Syntax error Reached ERROR state 1 PUSH: [1, 'errTok_'] -> 3 LOOKUP: [3, `a' (97)] -> default reduce using rule 3 REDUCE: rule 3 rule 3: pop 1 elements. Next will be: [1, 'expr'] pop 1 elements from the stack having size 3 next: [1, 'expr'] available token 'expr' PUSH: [1, 'expr'] -> 2 available token `a' (97) LOOKUP: [2, `a' (97)] -> default reduce using rule 1 REDUCE: rule 1 rule 1: pop 2 elements. Next will be: [0, 'start'] pop 2 elements from the stack having size 3 next: [0, 'start'] PUSH: [0, 'start'] -> 1 available token `a' (97) scanner token 'NR' PUSH: [1, 'NR'] -> 4 ERROR RECOVERED: next state 4 LOOKUP: [4, 'Reserved_::UNDETERMINED_'] -> default reduce using rule 4 REDUCE: rule 4 execute action 4 ... ... completed rule 4: pop 1 elements. Next will be: [1, 'expr'] pop 1 elements from the stack having size 3 next: [1, 'expr'] available token 'expr' PUSH: [1, 'expr'] -> 2 scanner token 'Reserved_::EOF_' LOOKUP: [2, 'Reserved_::EOF_'] -> default reduce using rule 1 REDUCE: rule 1 execute action 1 ... ... completed rule 1: pop 2 elements. Next will be: [0, 'start'] pop 2 elements from the stack having size 3 next: [0, 'start'] PUSH: [0, 'start'] -> 1 available token 'Reserved_::EOF_' ACCEPT(): Parsing successful parse(): returns 0 or 1
The final debug message should be interpreted as a C++ expression: the
0 indicates that Parse::ACCEPT
rather than Parser::ABORT
was called,
the 1 shows the value of d_nErrors_
. Consequently, parse()
returns 1
(i.e., `0 or 1'
).
expr '=' exprThe assignment operator's left-hand side must be a so-called lvalue. An lvalue is simply an addressable location, like a variable's identifier, a dereferenced pointer expression or some other address-expression. The right-hand side is a so-called rvalue: this may be any value: any expression will do.
A rule like the above leaves room for many different semantic errors:
expr
, any expression is accepted
by the parser. E.g.,
3 = 12So, the action associated with this rule should check whether the expression's left-hand side is actually an lvalue. If not, a semantic error should be reported;
size_t
d_nSemanticErrors
. It may be possible to test this counter's value once the
input has been parsed, calling ABORT()
(see section 5.3) if the
counter isn't zero anymore. When the grammar's start symbol itself has
multiple alternatives, it is probably easiest to augment the grammar with an
additional rule, becoming the augmented grammar's start symbol which simply
calls the former start symbol. For example, if input
was the name of the
original start-symbol, augment the grammar as follows to ensure a
PARSE_ABORT return value of the parse()
member when either syntactic
or semantic errors were detected:
semantic_input: // new start-symbol input { if (d_nSemanticErrors) // return PARSE_ABORT ABORT(); // on semantic errors too. }Returning from the parser's
parse()
member the number of syntactic
and semantic errors could then be printed, whereupon the program might
terminate.