parse
, that recognizes correct instances of the grammar.
In this chapter the organization and specification of such a grammar file is discussed in detail.
Having read this chapter you should be able to define a grammar for which Bisonc++ can generate a class, containing a member that recognizes correctly formulated (according to the grammar) input. Such a grammar must be in the class of LALR(1) grammars (see, e.g., Aho, Sethi & Ullman, 2003 (Addison-Wesley)).
Bisonc++ directives %% Grammar rulesReaders familiar with Bison may note that there is no C declaration section and no section to define Additional C code. With bisonc++ these sections are superfluous since, due to the fact that a bisonc++ parser is defined as a C++ class, all additional code required for the parser's implementation can be incorporated into the parser class itself. Also, C++ classes normally only require declarations that can be defined in the classes' header files, so also the `additional C code' section was omitted from bisonc++'s grammar file.
The `%%' is a separator that appears in every bisonc++ grammar file separating the two sections.
The directives section is used to declare the names of the terminal and nonterminal symbols, and may also describe operator precedence and the data types of semantic values of various symbols. Additional directives are used to define, e.g., the name of the generated parser class and a namespace in which the parser class is defined. All bisonc++ directives are covered in section 4.5.
Grammar rules define how to construct nonterminal symbols. The grammar rules section defines one or more bisonc++ grammar rules. See section 4.3, covering the syntax of grammar rules.
The `%%' separator must always be specified, even if the directives section is empty.
Bisonc++'s grammar file may be split into several files. Each file may be given a suggestive name. This allows quick identification of where a particular section or rule is found, and improves readability of the designed grammar. The %include-directive (see section 4.5.8) can be used to include a partial grammar specification file into another specification file.
A terminal symbol (also known as a token) represents a class of
syntactically equivalent symbols. Tokens are represented in bisonc++'s parser class
by a number, defined in an enum. The parser's lex
member function returns
a token value indicating what kind of token has been read. You don't need to
know what the code value is; instead, its symbol should always be used. By
convention, it contains uppercase characters.
Nonterminal symbol define concepts of grammars. The symbol name is used in writing grammar rules. By convention, it contains lowercase characters.
Symbol names consist of letters, digits (not at the beginning), and underscores. Bisonc++ does not support periods in symbol names (users familiar with Bison may observe that Bison does support periods in symbol names, but the Bison's user guide states that `periods make sense only in nonterminals'. Even so, it appears that periods in symbols are hardly ever used).
There are two ways terminal symbols can be referred to:
%token
. See section 4.5.29.
character token
(or literal character token
) is written in
the grammar using C++'s character constants syntax; for example, '+
'
is a character token. A character token doesn't need to be declared unless you
need to specify its semantic value data type (cf. section 4.6),
associativity, or precedence (cf. section 4.5.9).
By convention, a character token is only used to represent that
particular character. Thus, the token '+
' is represents the character
`+
' as a token.
All common escape sequences that can be used in C++'s character
constants can be used in bisonc++ as well. Be careful not to use the 0
character as a character literal because its ASCII code, zero, is the code
lex
returns to indicate end-of-input (see section 5.3.1). If your
program must be able to return 0-byte characters, define a special token
(e.g., ZERO_BYTE
) and return that token instead.
Note that literal string tokens, formally supported in Bison, are
not supported by bisonc++. Again, such tokens are hardly ever encountered, and
lexical scanner generators (like flex(1) and flexc++(1)) do not
support them. Common practice is to define a symbolic name for a literal
string token. So, a token like EQ
may be defined in the grammar file, with
the lexical scanner returning EQ
when it matches ==
.
The value returned by the parser's lex
member is always a terminal
token. The numeric code for a character token is simply the ASCII code of that
character, and lex
can simply return that character constant as a token
value. Each named token becomes a C++ enumeration value in the parser base
class header file, and lex
can return such enumeration identifiers as
well. When using an externally defined lexical scanner, that lexical scanner
should include the parser's base class header file, and it should return
either character constants or the token identifiers defined in that header
file. So, if (%token NUM) was defined in the parser class Parser
, then the
lexical scanner may return Parser::NUM
.
The symbol `error
' is a reserved terminal symbol reserved by bisonc++ for
error recovery purposes (see chapter 8). The error
symbol
should not be used for other purposes. In particular, the parser's member
function lex
should never return error
. Several more identifiers
should not be used as terminal symbols either. See section 4.5.29.1 for
an overview.
exp: exp '+' exp ;is a recursive rule definition, stating that when two
exp
groupings,
with a `+' token in between, have been recognized, these three elements
themselves represent an exp
gouping.
{ C++ statements }Usually there is only one action block, defined at the end of production rules . See section 4.6.2.
result: first-production-rule | second-production-rule | ... ;
result:
could also have been
defined as:
result: first-production-rule ; result: second-production-rule ;However, this is a potentially dangerous practice, since one of the two
result
rules could also have used a misspelled rule-name (maybe the second
result
) should have been results
. Therefore, bisonc++ generates a warning
if the same nonterminal is used repeatedly when defining production rules.
exp
nonterminals:
expseq: expseq1 | // empty ; expseq1: expseq1 ',' exp | exp ;By convention, to improve the visibility of an empty production rule, it contains the comment `
// empty
'.
Empty production rules may contain action blocks. Statements in
such action blocks are executed when the parser has recognized the empty
production rule. In such cases the // empty
comment is omitted.
expseq1: expseq1 ',' exp | exp ;Since the recursive use of expseq1 is the leftmost symbol in the right hand side, we call this left recursion. By contrast, here the same construct is defined using right recursion:
expseq1: exp ',' expseq1 | exp ;Any kind of sequence can be defined using either left recursion or right recursion, but whenever possible use left recursion, because it can parse a sequence of any number of elements given a limited stack space. Right recursion consumes stack space proportionally to the number of elements in the sequence, because all elements must be pushed on the stack before the rule can be applied even once. See chapter 7 for further explanation of this phenomenon.
Indirect or mutual recursion occurs when the result of the rule does not appear directly on its right hand side, but does appear in rules for other nonterminals which do appear on its right hand side. For example:
expr: primary '+' primary | primary ; primary: constant | '(' expr ')' ;defines two mutually-recursive nonterminals, since each refers to the other.
All token names (except for single-character literal tokens such as '+' and '*') must be declared. If data types must be associated with semantic values of symbols (see sections 4.6) then those symbols must also be declared.
By default the first rule in the rule section specifies the grammar's start
symbol. The %start
directive can be used to specify another rule (see
section 3.1).
This section covers all of bisonc++'s declarations. Some have already been
mentioned, but several more are available. Some define how the grammar parses
its input (like %left, %right
); other directives define, e.g., the name of
the parsing function, or the name(s) of the files bisonc++ generates.
In particular readers familiar with Bison (or Bison++) should read this section thoroughly, since bisonc++'s directives are more extensive and different from the `declarations' offered by Bison, and the macros defined by Bison++.
Several directives expect file- or pathname arguments. File- or pathnames must
be provided on the same line as the directive itself, starting at the first
non-blank character after the directive. File- or pathnames may contain escape
sequences (e.g., if you must: use `\
' to include a blank in a filename),
and continue until encountering the next blank character. Alternatively, file-
or pathnames may be surrounded by double quotes ("..."
) or pointed
brackets (<...>
). Pointed brackets surrounding file- or path-names merely
function to delimit filenames. They do not refer to, e.g., C++'s include
path. No escape sequences are required for blanks within delimited file- or
path-names.
Directives accepting a `filename' do not accept path names, i.e., they cannot
contain directory separators (/
); directives accepting a 'pathname' may
contain directory separators.
Sometimes directives have analogous command-line options. In those cases command-line options take priority over directives.
Some directives may generate errors. This happens when a directive
conflicts with the contents of a file bisonc++ cannot modify (e.g., a
parser class header file exists, but doesn't define a namespace, but a
%namespace
directive was provided).
To solve such errore the offending directive could be omitted, the existing file could be removed, or the existing file could be hand-edited according to the directive's specification.
pathname
Pathname
defines the path to the file preincluded in the parser's
base-class header. See the description of the
--baseclass-preinclude option for details about this
directive.
By default `pathname' is surrounded by double quotes; but the double
quotes can also explicitly be specified. When the argument is surrounded by
pointed brackets #include <header>
is used.
parser-class-name
By default, bisonc++ generates a parser-class using the class name
Parser
. The default can be changed using this directive which defines the
name of the C++ class that will be generated. It may be defined only once
and parser-class-name
must be a C++ identifier.
It is an error if this directive is used and an already existing parser-class
header file does not define class `className'
and/or if an already
existing implementation header file does not define members of the class
`className'
.
Provide parse
and its support functions with debugging code,
showing the actual parsing process on the standard output
stream.
When specified, debug output is shown by default, but its activity may be
controlled using the setDebug(bool on-off)
or setDebug(DebugMode_)
members. Note that no #ifdef DEBUG
macros are used anymore. Rerun bisonc++
without the --debug option to generate an equivalent parser not containing
the debugging code.
off|quiet|warn|std
By default, production rules without final action blocks are augmented by the
bison(1) parser generator with $$ = $1
action blocks: the semantic
value of the first component is returned as the rule's semantic value. Its
manual also states that for empty rules there is no meaningful default
action. However, it could be argued that empty production rules could return
default semantic values, resulting in every matched rule having a defined
semantic value.
When multiple semantic value types are used, the semantic value type returned
by a $$ = $1
action is not uniquely defined. For one rule $1
might be
an int
field in a union, for another rule it might be a std::string
*
. With polymorphic semantic values comparable situations are encountered.
By default, bisonc++ mimics bison's behavior in that it adds a $$ = $1
action
block to rules not having final action blocks, but not to empty production
rules. This default behavior can also explicitly be configured using the
default-actions std
option or directive.
Bisonc++ also supports alternate ways of handling rules not having final action
blocks. When off
is specified, bisonc++ does not add $$ = $1
action
blocks; when polymorphic semantic values are used, then specifying
warn
adds specialized action blocks, using the semantic types of
the first elements of the production rules, while issuing a warning;
quiet
adds these action blocks without issuing warnings.
warn
or quiet
are specified the types of $$ and $1 must
match. When bisonc++ detects a type mismatches it issues errors.
When polymorphic semantic values are used then the default $$ = $1
action
probably is less useful than when, e.g., plain %stype
semantic values are
used. After all, no semantic values are associated with $$
. Furthermore,
once the production rule has been recognized, the production rule is reduced
to the rule's left-hand side non-terminal. Thus, $1
ceases to exists,
immediately following the $$ = $1
statement. Therefore, if default
actions are used in combination with polymorphic semantic value types they are
implemented using move-operations: $$ = std::move($1)
. However, in this
situation default actions can frequently be suppressed, slightly improving the
efficency of the generated parser.
When a default action block can be added to a production rule and either
warn
or quiet
was specified, bisonc++ compares the types associated with
rule's nonterminal, and with the production rule's first element. The warn
and quiet
specifications make identical decisions about the action
blocks to add, but in addition warn
also shows a warning message that the
action block is added to the production rule.
STYPE_
then, if no type is associated
with the production rule's first element it is initialized to
STYPE_
; otherwise it is initialized with the first element's
semantic value.
The parser's state stack is dumped to the standard error stream when an
error is detected by the parse
member function. After calling
error
the stack is dumped from the top of the stack (highest
offset) down to its bottom (offset 0). Each stack element is prefixed by the
stack element's index.
number
Bisonc++ normally warns if there are any conflicts in the grammar (see section
7.1), but many real grammars have harmless shift/reduce
conflicts which are resolved in a predictable way and which would be
difficult to eliminate. It is desirable to suppress the warning about these
conflicts unless the number of conflicts changes. You can do this with the
%expect
declaration.
The argument number
is a decimal integer. The declaration says there
should be no warning if there are number
shift/reduce conflicts and no
reduce/reduce conflicts. The usual warning is given if there are either
more or fewer conflicts, or if there are any reduce/reduce
conflicts.
In general, using %expect
involves these steps:
%expect
. Use the `-V
' option to
get a verbose list of where the conflicts occur. Bisonc++ will also print the
number of conflicts.
%expect
declaration, copying the number of (shift-reduce)
conflict printed by bisonc++.
When provided, the scanner matched text function is called as
d_scanner.YYText()
, and the scanner token function is called as
d_scanner.yylex()
. This directive is only interpreted if the %scanner
directive is also provided.
pathname
This directive is used to switch to pathname
while processing a grammar
specification. Unless pathname
defines an absolute file-path, pathname
is searched relative to the location of bisonc++'s main grammar specification
file (i.e., the grammar file that was specified as bisonc++'s command-line
option). This directive can be used to split long grammar specification files
in shorter, meaningful units. After processing pathname
processing
continues beyond the %include pathname
directive.
Bisonc++'s main grammar specification file could simply be:
%include spec/declarations %% %include spec/ruleswhere
spec/declarations
contains declarations and spec/rules
contains the rules. Each of the files included using %include
may itself
use %include
directives (which are then processed relative to their
locations). The default nesting limit for %include
directives is 10, but
the option --max-inclusion-depth can be used to change this
default.
%include
directives should be specified on their own lines, not containing
any other information.
%left [ <type> ] terminal(s)These directives are called precedence directives (see also section 4.5.9 for general information on operator precedence).
%nonassoc [ <type> ] terminal(s)
%right [ <type> ] terminal(s)
The %left
, %right
, and %nonassoc
directives are used to
declare tokens and to specify their precedence and associativity, all at once.
op
determines how repeated
uses of the operator nest: whether `x op y op z
' is parsed by combining
x
with y
first or by combining y
with z
first. %left
specifies left-associativity (combining x
with y
first) and
%right
specifies right-associativity (combining y
with z
first). %nonassoc
specifies no associativity, which means that `x
op y op z
' is not a defined operation, and could be considered an error.
<type>
specification is optional, and specifies the type of the
semantic value when a token specified to the right of a <type>
specification is received. The pointed arrows are part of the type
specification; the type itself must be a field of a %union
(see section 4.5.33) or it must be a polymorphic tag (see section
4.6.1).
When multiple tokens are listed they must be separated by whitespace or by
commas. Note that the precedence directives also serve to define token names:
symbolic tokens mentioned with these directives should not be defined using
%token
directives.
struct-definition
Defines the organization of the location-struct data type LTYPE_. This struct should be specified analogously to the way the parser's stacktype is defined using %union (see below). The location struct type is named LTYPE_. If neither locationstruct nor LTYPE_ is specified, the default LTYPE_ struct is used.
When this directive is specified the standard location stack is added to the generated parser class. The standard location type (defined in the parser's base class) is equal to the following struct:
struct LTYPE_ { int timestamp; int first_line; int first_column; int last_line; int last_column; char *text; };Note that defining this struct type does not imply that its field are also assigned. Some form of communication with the lexical scanner is probably required to initialize the fields of this struct properly.
Specifies a user-defined token location type. If %ltype is used,
typename
should be the name of an alternate default-constructible type
(e.g., size_t). It should not be used together with a
%locationstruct specification. From inside the parser class,
this type may be referred to as LTYPE_.
Any text following %ltype
up to the end of the line, up to the first
of a series of trailing blanks or tabs or up to a comment-token (//
or
/*
) becomes part of the type definition. Be sure not to end an
%ltype
definition in a semicolon.
namespace
Defines all of the code generated by bisonc++ in the namespace
namespace
. By default no namespace is defined.
If this options is used the implementation header will contain a commented
out using namespace
directive for the requested namespace.
In addition, the parser and parser base class header files also use the specified namespace to define their include guard directives.
It is an error to use this directive while an already existing parser-class
header file and/or implementation header file does not specify namespace
identifier
.
Accept (do not generate warnings) zero- or negative dollar-indices in the
grammar's action blocks. Zero or negative dollar-indices are commonly used to
implement inherited attributes and should normally be avoided. When used they
can be specified like $-1
, or like $<type>-1
, where type
is an
STYPE_
tag; or a %union
field-name. See also sections 4.6.2,
5.6, and 4.6.1.
Do not put #line preprocessor directives in the file containing the
parser's parse
function. By default #line
preprocessor directives
are inserted just before action blocks in the generated parse.cc
file.
The #line
directives allow compilers and debuggers to associate errors
with lines in your grammar specification file, rather than with the source
file containing the parse
function itself.
polymorphic-specification(s)
The %polymorphic
directive is used to define a polymorphic semantic value
class, offering a (preferred) alternative to (traditional) union
types.
Refer to section 4.6.1 for a detailed description of the specification, characteristics, and use of polymorphic semantic values.
As a quick reference: to define multiple semantic values using a polymorphic
semantic value class offering either an int
, a std::string
or a
std::vector<double>
specify:
%polymorphic INT: int; STRING: std::string; VECT: std::vector<double>and use
%type
specifications (cf. section 4.5.32) to associate
(non-)terminals with specific semantic value types.
%stype, %union
and %polymorphic
are mutually exclusive: only one
of these directives can be used.
token
The construction %prec token
may be used in production rules to
overrule the actual precedence of an operator in a particular production
rule. Well known is the construction
expression: '-' expression %prec UMINUS { ... }Here, the default precedence and precedence of the
`-'
token as the
subtraction operator is overruled by the precedence and precedence of the
UMINUS
token, which is frequently defined as:
%right UMINUSE.g., a list of arithmetic operators could consists of:
%left '+' '-' %left '*' '/' '%' %right UMINUSgiving
* /
and %
a higher precedence than +
and -
,
ensuring at the same time that UMINUS
is given both the highest precedence
and a right-associativity.
In the above production rule the operator order would result in the construction
'-' expressionbeing evaluated from right to left, having a higher precedence than either the multiplication or the addition operators.
The %prec
directive associates priorities to rules. These priorities are
interpreted whenever there are (shift-reduce) conflicts. If there are no
conflicts, priorities are not required, and are ignored.
When the parser analyzes the above grammar, a conflict is
encountered. Consider the following simple grammar, in which only the minus
('-'
) operator is used, albeit in beinary and unary form:
%token NR %left '-' %right UNARY expr: NR | expr '-' expr | '-' expr %prec UNARY ;When analuzing this grammar, bisonc++ defines states (cf. chapter 7) defining what to do when encountering certain input tokens. Each possibility is defined by an item, in which a dot (
.
) is
commonly used to show to which point a production rule has been
recognized. For the above grammar one such state looks like this:
0: expr -> '-' expr . { '-' <EOF> } 1: expr -> expr . '-' exprThe elements between parentheses define the look-ahead tokens: the token that may appear next for reducible rules. Item 0 is such a reducible rule, and it is used to reduce
'-' expr
to an expression (expr
).
The second item shows the production rule defining the binary minus operator. Its left-hand side expression has been recognized, and the parser expects to see a minus operator next.
The conflict is caused by the expected minus operator in item 1, and a
minus operator that may follow item 0. As there is only one look-ahead symbol
(since bisonc++ can only handle LALR(1) grammars) the grammer contains a
shift-reduce conflict: shift, and continue with item 1; or reduce, and
continue with item 0. In this case, %prec
solves the issue, by giving item
0 a higher precedence than item 1 (whose precedence is equal to the precedence
of its first terminal token, which is the binary minus operator's
precedence).
Although never encountered in real life, it's also possible to give the unary minus operator a lower priority than the binary minus operator. The grammar, in this case, looks like this:
%token NR %right UNARY %left '-' expr: NR | expr '-' expr | '-' expr %prec UNARY ;
With this grammar we encounter a state with these two items:
0: expr -> '-' expr . { <EOF> } 1: expr -> expr . '-' exprHere, the conflict no longer manifests itself, as the minus operator no longer appears in item 0's look-ahead set. The resulting parser will, when encountering a minus, shift the minus, and proceed according to item 1, and when anything else is encountered reduce the
'-' expr
production to
expr
. In real life this means that an expression like -4 - 3
evaluates to -1.
To illustrate a situation where %prec
won't work consider this
grammar:
%token NR %left '-' %right POSTFIX expr: NR | expr '-' expr | expr '-' %prec POSTFIX ;When this grammar is analyzed the following state is encountered:
0: expr -> expr '-' . expr 1: expr -> expr '-' . { '-' <EOF> } 2: expr -> . NR 3: expr -> . expr '-' expr 4: expr -> . expr '-'To appreciate why
%prec
doesn't work here, consider the various
look-ahead tokens. For items 0, 3, and 4 the look-ahead token is the
non-terminal expr
; for item 2 the look-ahead token is the terminal NR
,
and for item 1, handling the postfix minus operator, it is a minus
character. Thus, there isn't any conflict between the shiftable items and the
reducible item 1, and consequently the %prec
specification isn't
used. Any attempt to define a grammar handling a postfix minus operator will
fail. A common solution consists of defining a separate operator, explicitly
giving it its appropriate priority and associativity. E.g.,
%token NR %left '-' %right '_' // underscore expr: NR | expr '-' expr | expr '_' // underscore as postfix minus ;
The %print-tokens
directive provides an implementation of the Parser
class's print_
function displaying the current token value and the text
matched by the lexical scanner as received by the generated parse
function.
The print_
function is also implemented if the --print
command-line option is provided.
When adding debugging code (using the debug
option or directive) debug
information is displayed continuously while the parser processes its
input. When using the prompt
directive the generated parser displays a
prompt (a question mark) at each step of the parsing process.
Caveat: when using this option the parser's input cannot be provided at the parser's standard input stream.
Instead of using the %prompt
directive the --prompt
option can also be
used.
ntokens
Whenever a syntactic error is detected during the parsing process subsequent tokens received by the parsing function may easily cause yet another (spurious) syntactic error. In this situation error recovery in fact produces an avalanche of additional errors. If this happens the recovery process may benefit from a slight modification. Rather than reporting every syntactic error encountered by the parsing function, the parsing function may wait for a series of successfully processed tokens before reporting the next error.
The directive %required-tokens
can be used to specify this
number. E.g., the specification %required-tokens 10
requires the parsing
function to process successfully a series of 10 tokens before another
syntactic error is reported (and counted). If a syntactic error is encountered
before processing 10 tokens then the counter counting the number of
successfully processed tokens is reset to zero, no error is reported, but the
error recoery procedure continues as usual. The number of required tokens can
also be set using the option --required-tokens. By default the
number of required tokens is initialized to 0.
header
Use header
as the pathname of a file to include in the parser's class
header file. See the description of the --scanner option for
details about this directive. When this directive is used a Scanner
d_scanner
data member is automatically included in the generated parser,
while the predefined int lex() member is simply returning
d_scanner.lex()
's return value. When, in addition to the %scanner
directive the %flex
directive was also specified then the function
d_scanner.YYText()
is called.
Unless double quotes or angular brackets were explixity used, the specified
header
file will be surrounded by double quotes.
It is an error to use this directive in combination with an already existing
parser-class header not including `header
'.
function-call
The %scanner-matched-text-function
directive defines the scanner function
that must be called to obtain the text matching the most recently returned
token. By default this is d_scanner.matched()
.
A complete function call expression should be provided (including a scanner
object, if applicable). This option overrules the d_scanner.matched()
call
used by default when the %scanner
directive is specified. Example:
%scanner-matched-text-function myScanner.matchedText()If the function call expression contains white space then the
function-call
specification must be surrounded by double quotes ("
).
Note that an expression is expected, not an expression statement: do not include a final semicolon in the specification.
function-call
This directive is used to specify how to call the scanner function returning
the next token from the parser's lex
function. A complete function call
expression should be provided (including a scanner object, if
applicable). Example:
%scanner-token-function d_scanner.lex()If the function call contains white space then the function call specification must be surrounded by double quotes.
It is an error to use this directive in combination with an already existing internal header file (.ih file) in which the specified function is not called.
Note that an expression is expected, not an expression statement: do not include a final semicolon in the specification.
size
Defines the number of elements to be added to the generated parser's semantic
value stack when it must be enlarged. By default 10 elements are added to the
stack. This option/directive is interpreted only once, and only if size
at
least equals the default stack expansion size of 10.
nonterminal symbol
By default bisonc++ uses the nonterminal that is defined by the first rule in a grammar specification file as the start symbol. I.e., the parser tries to recognize that nonterminal when parsing input.
This default behavior may be modified using the %start
directive.
The nonterminal symbol specifies a nonterminal that may be defined
anywhere in the rules section of the grammar specification file. This
nonterminal then
becomes the grammar's start symbol.
This directive defines the type of the semantic value of tokens. The
specified type must be a default constructible type, like size_t
or
std::string
. By default, bisonc++ uses int
for the semantic value type of
its parser's tokens. To use another single semantic value type , this
directive must be used.
In programs using a simple grammar it may be sufficient to use the same data type for the semantic values of all language constructs (see, e.g., sections 6.1 and 6.2).
Any text following %stype
up to the end of the line, up to the first
of a series of trailing blanks or tabs or up to a comment-token (//
or
/*
) becomes part of the type definition. Be sure not to end a
%stype
definition in a semicolon.
%stype, %union
and %polymorphic
are mutually exclusive: only one
of these directives can be used.
Sources including the generated parser class header file should refer to the semantic value typename as STYPE_.
on|off
This directive is only interpreted when polymorphic semantic values are
used. When on
is specified (which is used by default) the parse
member
of the generated parser dynamically checks that the tag used when calling a
semantic value's get
member matches the actual tag of the semantic value.
If a mismatch is observed, then the parsing function aborts after displaying a
fatal error message. If this happens, and if the option/directive debug
was specified when bisonc++ created the parser's parsing function, then the
program can be rerun, specifying parser.setDebug(Parser::ACTIONCASES)
before calling the parsing function. As a result the case-entry numbers of the
switch
, defined in the parser's executeAction
member, are inserted
into the standard output stream. The action case number reported just before
the program displays the fatal error message tells you in which of the
grammar's action block the error was encountered.
Only used with polymorphic semantic values, and then only required when the
same parser is used in multiple threads: it ensures that each thread's
polymorphic code only accesses the errors counter (i.e., d_nErrors_
)
of its own parser.
Instead of using the %thread-safe
directive the --thread-safe
option
can also be used.
%tokenterminal token(s)
%token [ <type> ]terminal token(s)
The %token directive is used to define one or more symbolic terminal tokens. When multiple tokens are listed they must be separated by whitespace or by commas.
The <type>
specification is optional, and specifies the type of the
semantic value when receiving one of the subsequently named tokens is
received. The pointed arrows are part of the type specification; the type
itself must be a field of a %union
specification (see section 4.5.33)
or a tag defined at the %polymorphic
directive (see section
4.6.1).
bisonc++ traditionally converted symbolic tokens (including those defined
by the precedence directives (cf. section 4.5.9)) into
Parser::Tokens_
enumeration values (see section 4.5.2),
allowing the lexical scanner to return named tokens as
Parser::name
. Although this approach is still available, it is deprecated
as of bisonc++'s version 6.04.00. Starting with that version the token-path
option or directive should be used, by default defining the symbolic tokens in
the class Tokens
, made available in a separate file which can be included
by any class needing access to the grammar's symbolic tokens (cf. section
4.5.35.7).
In particular,
_
).
ABORT, ACCEPT, ERROR, clearin, debug, error, setDebugExcept for
error
, which is a predefined terminal token, these
identifiers are traditionally used names of functions in the parser class
defined by bisonc++.
classname
This directive is only processed when the token-path
directive or option
has also been specified.
Classname
defines the name of the Tokens
class that is defined when
the token-path
option (see below) is specified. By default the class name
Tokens
is used. Assuming the default, then classes that need access to the
tokens defined by bisonc++ may derive from Tokens
(in which case a token like
IDENTIFIER
can directly be used, or tokens must be provided with their
class context (e.g., Tokens::IDENTIFIER
). The enumeration Tokens_
in
the Tokens
class has public access rights.
namespace
This directive is only processed when the token-path
directive or option
has also been specified.
Namespace
defines the namespace of the class containing the Tokens_
enumeration. By default no namespace is used. The namespace defined here can
independently be specified from the Parser's namespace (specified by the
namespace
directive or option).
<type> symbol-list
To associate (non-)terminals with specific semantic value types the %type directive is used.
When %polymorphic
is used to specify multiple semantic value types,
(non-)terminals can be associated with one of the semantic value types
specified with the %polymorphic
directive.
When %union
is used to specify multiple semantic value types,
(non-)terminals can be associated with one of the union
fields specified
with the %union
directive.
With this directive, symbol-list
defines of one or more blank or comma
delimited grammatical symbols (i.e., terminal and/or nonterminal symbols);
type
is either a polymorphic type-identifier or a field name defined in
the %union
specification. The specified nonterminal(s) are automatically
associated with the indicate semantic type. The pointed arrows are part of the
type specification.
When the semantic value type of a terminal symbol is defined the
lexical scanner rather than the parser's actions must assign the
appropriate semantic value to d_val_ just prior to returning the
token. To associate terminal symbols with semantic values, terminal symbols
can also be specified in a %type
directive.
union-definition body
In the grammars of many programs different types of data are used for different
terminal and nonterminal tokens. For example, a numeric constant may
need type int
or double
, while a string needs type std::string
,
and an identifier might need a pointer to an entry in a symbol table.
Traditionally, the %union
directive has always been used to accomplish
this. The directive defines a C union-type whose fields specify one or
more data types for semantic values. The directive %union
is followed by a
pair of braces containing one or more field definitions. For example:
%union { double u_val; symrec *u_tptr; };In this example the two fields represent a
double
and a symrec
*
. The associated field names are u_val
and u_tptr
, which are used in
the %token
and %type
directives to specify types that are associated
with terminal or nonterminal symbols (see section 4.5.32).
Notes:
union
definitions; they can also be used when defining bisonc++'s %union
directives. When a class type variant is required, all required
constructors, the destructor and other members (like overloaded
assignment operators) must be able to handle the actual class type
data fields properly. A discussion of how to use unrestricted unions
is beyon this manual's scope, but can be found, e.g., in the C++
Annotations. See also section
4.6.
%union
directive is also a bit of an anachronism. In many
situations using %polymorphic
is more attractive than using
%union
(cf. section 4.6.1).
Although the %union
directive is still supported by bisonc++, its use is
largely superseded by the newer %polymorphic
directive, allowing bisonc++ and
the C++ compiler to verify that the correct types are used when semantic
values are assigned or retrieved, which, in turn, helps preventing run-time
errors.
%stype, %union
and %polymorphic
are mutually exclusive: only one
of these directives can be used.
By default, the %polymorphic
directive declares a strongly typed enum:
enum class Tag_
, and code generated by bisonc++ always uses the Tag_
scope when referring to tag identifiers. It is often possible (by
pre-associating tokens with tags, using %type
directives) to avoid
the use of tags in user-code.
If tags are explicitly used, then they must be prefixed with the Tag_
scope. Before the arrival of the C++-11 standard strongly typed enumerations
didn't exist, and explicit enum-type scope prefixes were usually omitted.
The %weak-tags
directive can be specified when the Tag_
enum should
not be declared as a strongly typed enum. This directive should not be
used, unless you know what you're doing.
<CLASS>
should be
interpreted as the name of the parser's class, Parser
by default, but
configurable using %class-name
(see section 4.5.2).
<Class>base.h
, configurable using
%baseclass-header
(see section 4.5.35.1) or %filenames
(see section 4.5.35.3);
<Class>.h
, configurable using
%class-header
(see section 4.5.35.2) or %filenames
(see section 4.5.35.3);
<Class>.ih
, configurable
using %implementation-header
(see section 4.5.35.4) or
%filenames
(see section 4.5.35.3);
filename
Filename
defines the name of the file to contain the parser's base
class. This class defines, e.g., the parser's symbolic tokens. Defaults to the
name of the parser class plus the suffix base.h
. It is always generated,
unless (re)writing is suppressed by the --no-baseclass-header
and
--dont-rewrite-baseclass-header
options. This directive is overruled by
the --baseclass-header (-b) command-line option.
It is an error to use this directive while an already existing parser class
header file does not contain #include "filename"
.
filename
Filename
defines the name of the file to contain the parser
class. Defaults to the name of the parser class plus the suffix .h
This
directive is overruled by the --class-header (-c) command-line option.
It is an error to use this directive while an already existing parser-class
header file does not define class `className'
and/or if an already
existing implementation header file does not define members of the class
`className'
.
filename
Filename
is a generic filename, used for all header files generated
by bisonc++.
filename
Filename
defines the name of the file to contain the implementation
header. It defaults to the name of the generated parser class plus the suffix
.ih
.
The implementation header should contain all directives and declarations
only used by the implementations of the parser's member functions. It is
the only header file that is included by the source file containing
parse
's implementation. User defined implementation of other class members
may use the same convention, thus concentrating all directives and
declarations that are required for the compilation of other source files
belonging to the parser class in one header file.
filename
Filename
defines the name of the source file to contain the parser member
function parse
. Defaults to parse.cc
.
pathname
Pathname
defines the directory where generated files should be written.
By default this is the directory where bisonc++ is called.
pathname
By default the symbolic tokens defined in the grammar specification are
collected in the enumeration Tokens_
which is defined in the generated
parser's base class. This is a suboptimal procedure in cases where these
tokens are also used by other classes. In those cases the tokens should be
collected in an escalated class which can be accessed by classes that are
independent of the generated Parser
class.
The directive %token-path
is used for this: it expects a pathname
path
specification of the file to contain the struct Tokens
defining the
enumeration Tokens_
containing the symbolic tokens of the generated
grammar. If this option is specified the ParserBase
class is derived from
it, thus making the tokens available to the generated parser class.
The name of the struct Tokens
can be altered using the token-class
option. By default (if token_path
is not specified) the tokens are defined
as the enum Tokens_
in the ParserBase
class. The filenames
and
target-directory
directive specifications are ignored by token-path
.
For example, the calculator performs real-life calculations because the value associated with each expression is its computed numeric value; it correcly performs addition because the action for an expression like `x + y' is to add those numbers and to return their sum.
Two ways of defining semantics have already been discussed:
A third way for defining semantic values is discussed next (cf. section
4.6.1). Shorthand notations that can be used in action blocks
are described next (cf. section ACTIONS
). Finally, Action blocks usually
appear at the end of production rules. But in fact they can be defined
anywhere in production rules. Refer to this section's final subsection
(section 4.6.2.4) for the characteristics of such mid-rule action
blocks.
In this section a simple example program is developed illustrating the use of polymorphic semantic values. The sources of the example can be retrieved from the distribution's poly directory.
One may wonder why a union
is still used by bisonc++ as C++ offers
inherently superior approaches to combine multiple types into one union
type. The C++ way to do so is by defining a polymorphic base class and a
series of derived classes implementing the various exclusive data types. The
union
approach is still supported by bisonc++, mainly for historic reasons as
it is supported by bison(1) and bison++; dropping the union
would
needlessly impede backward compatibility.
The preferred alternative to a union
, however, is a polymorphic base
class. The example program (cf. poly) uses a polymorphic semantic value
type, supporting either int
or std::string
semantic values. These
types are asociated with tags (resp. INT
and TEXT
) using the
%polymorphic
directive, discussed next.
%polymorphic
directive results in a parser using
polymorphic semantic values. Polymorphic semantic value specifications
consist of a tag, which is a C++ identifier, and a C++ type
definition.
Tags and type definitions are separated by colons, and multiple semantic value specifications are separated by semicolons. The semicolon trailing the final semantic value specification is optional.
The %polymorphic
directive may be specified only once, and the
%polymorphic, %stype
and %union
directives are mutually exclusive.
Here is an example, defining three semantic values types: int
,
std::string
and std::vector<double>
:
%polymorphic INT: int; STRING: std::string; VECT: std::vector<double>The identifier to the left of the colon is called the type-tag (or simply `tag'), and the type definition to the right of the colon is called the type-definition. Types specified at the
%polymorphic
type-definitions must be built-in types or class-type declarations. Since
bisonc++ version 4.12.00 the types no longer have to be default-constructible.
As the parser's generic semantic value type is called STYPE_
, and as
functions called by the parser may return STYPE_
values and may expect
STYPE_
arguments, grammar symbols can also be associated with the generic
STYPE_
semantic type using %type <STYPE_>
directives.
To prevent ambiguities the generic STYPE_
type cannot be specified as a
polymorphic type. E.g., a specification like GENERIC: STYPE_
cannot be
used when defining the tag/type pairs at the %polymorphic
directive.
When polymorphic type-names refer to types that have not yet been declared
by the parser's base class header, then these types must be declared in a
separate header file, included into the parser's base class header file
through the %baseclass-preinclude
directive.
parserbase.h
; an occasional implementation
is added to the parse.cc
source file.
To minimize namespace pollution most of the extra code is contained in a
namespace of its own: Meta_
. If the %namespace
directive was used then
Meta_
is nested under the namespace declared by that directive. The name
Meta_
is used because much of the code implementing polymorphic semantic
values uses template meta programming.
The enumeration 'enum class Tag_'
One notable exception to the above is the enumeration Tag_
. To simplify
its use it is declared outside of Meta_
(but inside the %namespace
namespace, if specified). Its tags are declared at the %polymorphic
directive. Targs_
is a strongly typed enumeration. The %weak-tags
directive can be used to declare a pre C++-11 standard `enum Tag_
'.
The namespace Meta_
Below, DataType
refers to the semantic value's data type that is
associated with a Tag_
identifier.
Several classes are defined in the namespace Meta_
. The most important
class is SType
, providing the interface to the semantic value
types. The class SType
becomes the parser's STYPE_
type. Each
SType
object is either a default SType
object, not containing a
semantic value, or it contains a semantic value of a single DataType
. All
operations related to semantic values are handled by this class.
The class SType
provides the following public interface:
SType
objects that were constructed by SType
's default
constructors, but they can accept values of defined polymorphic types,
which may then be retrieved from those objects.
In addition the members
SType &operator=(Type const &value) and SType &operator=(Type &&tmp)are defined for each of the polymorphic semantic value types. Up to version 6.03.00 these members were defined as member templates, but sometimes awkward compilation errors were encountered as with member templates
Type
must exactly match one of the defined polymorphic
semantic types since Type
is used to determine the appropriate
Meta_::Tag_
value. As a consequence, if, e.g., a polymorphic type
%polymorphic INT: int
is defined then an assignment like $$ =
true
fails, since the inferred type is bool
and no matching
polymorphic type is available. Now that the assignment operators are
defined as plain member functions this problem isn't encountered
anymore because standard type conversions may then be applied by the
compiler. Note that ambiguities may still be encountered. If, e.g.,
polymorphic types are defined for int
and char
and an
expression like $$ = 30U
is used the compiler cannot tell whether
$$
refers to the int
or to the char
semantic value. A
standard (static) cast, or explicitly calling the assign
member
(see the next item) solves these kind of ambiguities.
When operator=(Type const &value)
is used, the left-hand side
SType
object receives a copy of value
; when operator=(Type
&&tmp)
is used, tmp
is move-assigned to the left-hand side
SType
object;
void assign<tag>(Args &&...args)
The tag
template argument must be a Tag_
value. This member
function constructs a semantic value of the type matching tag
from
the arguments that are passed to this member (zero arguments are OK if
the type associated with tag
supports default construction). The
constructed value (not a copy of this value) is then stored in the
STYPE_
object for which assign
has been called.
As a Meta_::Tag_
value must be specified when using assign
the compiler can use the explicit tag to convert assign's
arguments
to an SType
object of the type matching the specified tag.
The member assign
can be used to store a specific polymorphic
semantic value in an STYPE_
object. It differs from the set of
operator=(Type)
members in that assign
accepts multiple
arguments to construct the requested SType
value from, whereas the
operator=
members only accept single arguments of defined
polymorphic types.
To initialize an STYPE_
object with a default STYPE_
value,
direct assignment can be used (e.g., d_lval_ = STYPE_{}
). To
assign a semantic value to a production rule using assign
the
_$$
notation must be used, as $$
is interpreted as the
polymorphic value type that is associated with the production
rule:
_$$.assign<Tag_::CHAR>(30U);
DataType &get<tag>()
, and DataType const &get<tag>() const
These members return references to the object's semantic values. The
tag
must be a Tag_
value: its specification tells the
compiler which semantic value type it must use.
When the option/directive tag-mismatches on
was specified then
get
, when called from the generated parse
function, performs a
run-time check to confirm that the specified tag corresponds to
object's actual Tag_
value. If a mismatch is observed, then the
parsing function aborts with a fatal error message. When
shorthand notations (like $$
and $1
) are used in production
rules' action blocks, then bisonc++ can determine the correct tag
,
preventing the run-time check from failing.
But once a fatal error is encountered, it can be difficult to
determine which action block generated the error. If this happens,
then consider regenerating the parser specifying the --debug
option, calling
parser.setDebug(Parser::ACTIONCASES)before calling the parser's
parse
function.
Following this the case-entry numbers of the switch
which is
defined in the parser's executeAction
member are inserted into the
standard output stream just before the matching statements are
executed. The action case number that's reported just before the
program reports the fatal error tells you in which of the grammar's
action block the error was encountered.
Tag_ tag() const
The tag matching the semantic value's polymorphic type is returned. The
returned value is a valid Tag_
value when the SType
object's
valid
member returns true
;
By default, or after assigning a plain (default) STYPE_
object to
an STYPE_
object (e.g., using a statement like $$ =
STYPE_{}
), valid
returns false
, and the tag
member
returns Meta_::sizeofTag_
.
bool valid() const
The value true
is returned if the object contains a semantic
value. Otherwise false
is returned. Note that default STYPE_
values can be assigned to STYPE_
objects, but they do not
represent valid semantic values. See also the previous description of
the tag
member.
%polymorphic
directive looks like this:
%polymorphic INT: int; TEXT: std::string;Furthermore, the grammar declares tokens
INT
and IDENTIFIER
, and
pre-associates the TEXT
tag with the identifier
nonterminal,
associates the INT
tag with the int
nonterminal. The combi
nonterminal is associated with the generic STYPE_
semantic value type:
%type <TEXT> identifier %type <INT> int %type <STYPE_> combi
The parser's grammar is simple, expecting input lines, formatted
according to the following (rule
) production rule:
rule: identifier '(' identifier ')' '\n' | identifier '=' int '\n' | combi '\n' ;
The rules for identifier
and int
return, respectively, text and an
int
value:
identifier: IDENTIFIER { $$ = d_scanner.matched(); } ; int: INT { $$ = d_scanner.intValue(); } ;
These simple assignments can be used as int
is pre-associated with the
INT
tag and identifier
is asociated with the TEXT
tag.
The combi
rule, which is used in one of the production rules of
`rule
', accepts a single int
value, as well as an identifier. So it
cannot be associated with a single polymorphic type. But as it is associated
with the generic STYPE_
type, it can pass on any polymorphic value. In
rule's
production rule the generic semantic value is then simply passed on
to process
, expecting a plain STYPE_ const &
. The function
process
has to inspect the semantic value's tag to learn what
kind of value is stored inside the received semantic value. Here are the
definition of the combi
nonterminal and action blocks for the rule
nonterminal:
combi: int | identifier ; rule: identifier '(' identifier ')' '\n' { cout << $1 << " " << $3 << '\n'; } | identifier '=' int '\n' { cout << $1 << " " << $3 << '\n'; } | combi '\n' { process($1); } ;
Note that combi's
production rules do not define action blocks. The
standard way to handle these situations is to add $$ = $1
action blocks to
non-empty production rules not defining final action blocks. This works well
in the current example, but a default-actions quiet
(or warn
) option
or directive can also be used.
The function process
, called from combi's
action block, inspects the
semantic value's tag to select the proper way of handling the received
semantic value. Here is its implementation:
void Parser::process(STYPE_ &semVal) const { if (semVal.tag() == Tag_::INT) cout << "Saw an int-value: " << semVal.get<Tag_::INT>() << '\n'; else cout << "Saw text: " << semVal.get<Tag_::TEXT>() << '\n'; }
It is easily created by flexc++(1) processing the following simple specification file.
%interactive %filenames scanner %% [ \t]+ // skip white space [0-9]+ return Parser::INT; [a-zA-Z_][a-zA-Z0-9_]* return Parser::IDENTIFIER; .|\n return matched()[0];
The reader may refer to flexc++(1) documentation for details about flexc++(1) specification files.
An action consists of C++ statements surrounded by braces, much like a compound statement in C++. Most rules have just one action at the end of the rule, following all the components. Actions in the middle of a rule are tricky and should be used only for special purposes (see section 4.6.2.4).
The components of production rules are numbered, the first component having number 1. E.g., in a production rule
nonterm: first second third
first
is component #1, second
is component #2, third
is
component #3. C++ code in action blocks may refer to semantic values of
these components using dollar-notations, where $i
refers to
the semantic value of the i^th component.
Likewise, the semantic value of the rule's nonterminal is represented by
$$
. Here is a typical example:
exp: ... | exp '+' exp { $$ = $1 + $3; } | ...This rule constructs
exp
from two exp nonterminals connected by a
plus-sign token. In the action, $1
and $3
represent the semantic
values of, respectively, the first (left-hand side) exp
component, and the
second (right-hand side) exp
component.
The sum is assigned to $$
, which becomes the semantic value of the exp
nonterminal represented by the production rule.
Depending on the specification of the default-actions
option/directive
(cf. section 4.5.4) bisonc++ may supply non-empty production rules
with default action blocks containing the statement $$ = $1
: the semantic
value of the first component of the production rule is returned as the
nonterminal's semantic value. Of course, the default action only is valid if
the two data types match. Empty production rules are not provided with
default action blocks.
Negative dollar indices (e.g., $-1) are allowed, and refer to semantic values of elements in rules before the component naming the current rule's nonterminal. This is a risky practice, and you should use it only when you know what you're doing. Here is a situation where you can use this reliably:
vardef: type varlist ';' ; varlist: varlist ',' variable { defineVar($-1, $3); } | variable { defineVar($-1, $1); } ;
As long as varlist
is only used in the vardef
rule varlist
can
be sure that the type of the variable is available as the semantic value of
the component immediately before (hence: $-1) the varlist
component in the
vardef
rule. See also section 5.6.
In addition to the dollar-notations shown here, bisonc++ supports several more
dollar-notations. The next three subsections describe the dollar-notations
that are available after specifying, respectively, the %stype, %union
or
%polymorphic
directives.
%stype
directive is used to specify a single semantic value type. By
default the semantic value type is int
. For single type semantic values
several dollar-notations are available. If some of the dollar-notations shown
below appear to be redundant: that's because their meanings are identical when
using single type semantic values. Their meanings differ, however, when using
union or polymorphic semantic values. Furthermore, below the notation $1 is
used as a generic reference to semantic values of production rule
components. Instead of $1 other available numbered dollar references can also
be used.
Here is the overview:
$$ =
A value is assigned to the rule's nonterminal's semantic value. The
right-hand side (rhs) of the assignment expression must be an expression of a
type that can be assigned to the STYPE_
type.
$$(expr)
Same as the previous dollar-notation: expr's
value is assigned to
the rule's nonterminal's semantic value.
_$$
This refers to the semantic value of the rule's nonterminal.
$$
Same as the previous item: this refers to the semantic value of the rule's nonterminal.
$$.
If STYPE_
is a class-type then this dollar-notation is shorthand
for the member selector operator, applied to the rule's nonterminal's semantic
value.
$$->
If STYPE_
is a class-type then this dollar-notation is shorthand
for the pointer to member operator, applied to the rule's nonterminal's
semantic value.
_$1
This refers to the current production rule's first component's semantic value.
$1
Same as the previous dollar-notation: this refers to the current production rule's first component's semantic value.
$1.
If STYPE_
is a class-type then this dollar-notation is shorthand
for the member selector operator, applied to the current production rule's
first component's semantic value.
$1->
If STYPE_
is a class-type then this dollar-notation is shorthand
for the pointer to member operator, applied to the current production rule's
first component's semantic value.
_$-1
This refers to the semantic value of a component in a production rule, listed immediately before the current rule's nonterminal ($-2 refers to a component used two elements before the current nonterminal, etc.).
$-1
Same as the previous item: this refers to the semantic value of a component in a production rule, listed immediately before the current rule's nonterminal.
$-1.
If STYPE_
is a class-type then this dollar-notation is shorthand
for the member selector operator, applied to the semantic value
of some production rule element, 1 element before the current rule's
nonterminal.
$-1->
If STYPE_
is a class-type then this dollar-notation is shorthand
for the pointer to member operator, applied to the semantic value
of some production rule element, 1 element before the current rule's
nonterminal.
%union
directive is used to specify a union of semantic value types.
Although each semantic value type can be used for each STYPE_
variable,
(non)terminals are in practice associated with a single type. These
associations are automatically applied through bisonc++'s dollar-notations.
Note that the %union
directive is now largely superseded by the facilities
offered by the %polymorphic
directive. Using %union
is considered
somewhat `old school', as %polymorphic
implements cleaner type
definitions, allowing bisonc++ and the C++ compiler to verify that types are
correctly being used.
In the next overview the notation $1 is used as a generic reference to semantic values of production rule components. Instead of $1 other available numbered dollar references can also be used.
%polymorphic
directive is used to specify a series of semantic value
types. (Non)terminals can be associated with exactly one of these types, or
with the generic STYPE_
semantic value type. When the latter type is used
either another STYPE_
value can be assigned to it, or a value of one of
the defined polymorphic value types can be assigned to it. At any time an
STYPE_
can only hold one single value type. Polymorphic semantic types
are type safe: types cannot be confused. Furthermore, as STYPE_
objects
are responsible for their own memory management, memory leaks cannot occur,
assuming that the different semantic value types do not leak.
In the next overview the available dollar-notations that are available with polymorphic semantic values are described. In this overview $1 is used as a generic reference to semantic values of production rule components. Instead of $1 other available numbered dollar references can also be used.
$$ =
A semantic value is assigned to the rule's nonterminal's semantic
value. The right-hand side (rhs) of the assignment expression must be an
expression of the type that is associated with $$. This assignment operation
assumes that the type of the rhs-expression equals $$'s semantic value
type. If the types don't match the compiler issues a compilation error when
compiling parse.cc
. Casting the rhs to the correct value type is possible,
but in that case the function call operator (see the next item) is preferred,
as it does not require casting. If no semantic value type was associated with
$$ then the assignment $$ = STYPE_{}
can be used.
$$(expr)
A value is assigned to the rule's nonterminal's semantic value. Expr
must be of a type that can be statically cast to $$'s semantic value type. The
required static_cast
is generated by bisonc++ and doesn't have to be
specified for expr
.
_$$
This refers to the rule's nonterminal's semantic value, disregarding any polymorphic type that might have been associated with the rule's nonterminal.
$$
If no polymorphic type was associated with the rule's nonterminal then
this is shorthand for a reference to the rule's plain STYPE_
value. If a
polymorphic value type was associated with the rule's nonterminal then this
shorthand represents a reference to a value of that particular type.
$$.
If no polymorphic type was associated with the rule's nonterminal then
this is shorthand for the member selector operator, applied to a reference to
the rule's nonterminal's STYPE_
value. If a polymorphic value type was
associated with the rule's nonterminal then this shorthand represents the
member selector operator, applied to a reference of that particular type.
$$->
If no polymorphic type was associated with the rule's nonterminal then
this is shorthand for the pointer to member operator, applied to a reference
to the rule's nonterminal's STYPE_
value. If a polymorphic value type
was associated with the rule's nonterminal then this shorthand represents the
pointer to member operator, applied to a reference of that particular type.
_$1
This refers to the current production rule's first component's generic
STYPE_
value.
$1
This shorthand refers to the semantic value of the production rule's
first element. If it was associated with a polymorphic type, then $1
refers to a value of that particular type. If no association was defined then
$1
represents a generic STYPE_
value.
$1.
If the production rule's first component's semantic value was associated
with a polymorphic type, then $1.
is shorthand for the member selector
operator, applied to the value of the associated polymorphic type. If no
association was defined then $1.
is shorthand for the member selector
operator, applied to the first component's generic STYPE_
value.
$1->
If the production rule's first component's semantic value was associated
with a polymorphic type, then $1->
is shorthand for the pointer to member
operator, applied to the value of the associated polymorphic type. If no
association was defined then $1.
is shorthand for the pointer to member
operator, applied to the first component's generic STYPE_
value.
_$-1
This refers to the generic (STYPE_
) value of a component in a
production rule, listed immediately before the current rule's nonterminal ($-2
refers to a component used two elements before the current nonterminal, etc.).
$-1
Same: this refers to the generic (STYPE_
) value of a component in a
production rule, listed immediately before the current rule's nonterminal ($-2
refers to a component used two elements before the current nonterminal, etc.).
$-1.
This is shorthand for the member selector operator applied to to the
generic STYPE_
value of some production rule element, 1 element before
the current rule's nonterminal.
$-1->
This is shorthand for the pointer to member operator applied to to the
generic STYPE_
value of some production rule element, 1 element before
the current rule's nonterminal.
$<tag>-1
This shorthand represents a reference to the semantic value of the
polymorphic type associated with tag
of some production rule element, 1
element before the current rule's nonterminal.
If, when using the generated parser's class parse
function, the
polymorphic type of that element turns out not to match the type that is
associated with tag
then a run-time fatal error results.
If that happens, and the debug
option/directive had been specified when
bisonc++ was run, then rerun the program after specifying
parser.setDebug(Parser::ACTIONCASES)
to locate the parse
function's
action block where the fatal error was encountered.
$<tag>-1.
This shorthand represents the member selector operator, applied to the
semantic value of the polymorphic type associated with tag
of some
production rule element, 1 element before the current rule's nonterminal.
If, when using the generated parser's class parse
function, the
polymorphic type of that element turns out not to match the type that is
associated with tag
then a run-time fatal error results. The procedure
suggested at the previous ($<tag>-1
) item for solving such errors can be
applied here as well.
$<tag>-1->
This shorthand represents the pointer to member selector operator, applied
to the semantic value of the polymorphic type associated with tag
of some
production rule element, 1 element before the current rule's nonterminal.
If, when using the generated parser's class parse
function, the
polymorphic type of that element turns out not to match the type that is
associated with tag
then a run-time fatal error results. The procedure
suggested at the previous ($<tag>-1
) item for solving such errors can be
applied here as well.
A mid-rule action block can refer to the components preceding it using $i
,
but it may not (cannot) refer to subsequent components because it is executed
before they have been observed.
The mid-rule action block itself counts as one of the components of the
production rule. In the example shown below, stmnt
can be referred to as
$6
in the final action block.
Mid-rule action blocks can also have semantic values. When using %union
or
%polymorphic
and a rule's nonterminal is associated with a union field or
polymorphic token, then mid-rule action blocks loose those associations. When
the $$, $$.
or $$->
shorthand notations appear in mid-action blocks of
production rules whose nonterminal is associated with a polymorphic type or
union field then a warning is issued that automatic type associations do not
apply. Using the _$$
shorthand notation prevents the warning from being
issued.
Here is an example from a hypothetical grammar, defining a let
statement
that looks like `this:
let (variable) statementHere,
variable
is the name of a variable that must only exists for the
statement's
lifetime. To parse this construct, we must insert variable
into a symbol table before parsing statement
, and must remove it
afterward. Here is how it is done:
stmt: LET '(' variable ')' { _$$.assign<SYMTAB>(symtab()); addVariable($3); } stmt { $$ = $6; restoreSymtab($5); }As soon as `
let (variable)
' has been recognized, the first action is
executed. It saves the current symbol table as the mid-rule's semantic value,
using the polymorphic tag SYMTAB
(which could be associated with, e.g., an
std::unordered_map
). Then addVariable
receives the new variable's
name, adding it to the current symbol table. Once the first action is
finished, the embedded statement (stmt
) is parsed. Note that the mid-rule
action is component number 5, so `stmt
' is component number 6.
Once statement
has been parsed, its semantic value is returned as the
semantic value of the production rule's nonterminal. Then the semantic value
from the mid-rule action block is used to restore the symbol table to its
original state. This removes the temporary let
-variable from the list so
that it won't appear to exist while the rest of the program is parsed.
Defining mid-rule action blocks before a rule has completely been recognized often leads to conflicts since the parser, because of the single look-ahead tokoen, must make a decision about the parsing sequence to use. For example, the following two rules, without mid-rule actions, can coexist because the parser can always process the open-brace token and only then look at the next token before deciding whether there is a declaration or not:
compound: '{' declarations statements '}' | '{' statements '}' ;But when we an action block is inserted before the first open-parenthesis a conflict is introduced:
compound: { prepareForLocalVariables(); } '{' declarations statements '}' | '{' statements '}' ;Now the parser, encountering the open-parenthesis, must decide whether the open-parenthesis belongs to the first or second production rule. It must make this decision because it must execute the statement in the action block if it selects the first production rule, while no action is required when selecting the second production rule (cf. section 7.0.3).
This problem cannot be solved by putting identical actions into the two rules, like so:
{ prepareForLocalVariables(); } '{' declarations statements '}' | { prepareForLocalVariables(); } '{' statements '}' ;But this does not solve the issue, because bisonc++ considers all action blocks as unique elements of production rules, and does not inspect the action blocks' contents.
A solution to the above problem consists of hiding the action block in a support nonterminal symbol which recognizes the first block-open brace and then performs the required preparations:
openblock: '{' { prepareForLocalVariables(); } ; compound: openblock declarations statements '}' | openblock statements '}' ;Now bisonc++ can execute the action in the rule for subroutine without deciding which rule for compound it eventually uses. Note that the action is now at the end of its rule. Any mid-rule action can be converted to an end-of-rule action in this way, and this is what bisonc++ actually does to implement mid-rule actions. Bisonc++ converts mid-rule action blocks to hash-tag numbered elements in production rules. In a rule like
string: { ... } TOKEN ... ;warning messages referring to the mid-rule action block could look like this:
[grammar: warning] Line 10: `rule 3: string -> #0001': ...Here, `#0001' is the hidden nonterminal merely containing the mid-rule action block. It is as though we had written this grammar:
string: #0001 TOKEN ... ; #0001: { ... } ;
|
). Each
alternative should begin with a unique identifying terminal token. The
terminal token may actually be hidden in a nonterminal rule, in which case
that nonterminal can be used as an alias for the nonterminal. In fact
identical terminal tokens may be used if at some point the terminal tokens
differ over different alternatives. Here are some examples:
// Example 1: plain terminal distinguishing tokens expr: ID | NUMBER ; // Example 2: nested terminal distinguishing tokens expr: id | number ; id: ID ; number: NUMBER ; // Example 3: eventually diverting routes expr: ID id | ID number ; id: ID ; number: NUMBER ;
ASCII
characters define a string, and multiple white-space delimited strings are
handled as one single string:
"hello" // multiple ws-delimited strings " " "world" "hello world" // same thingUsually a parser is responsible for concatenating the individual string-parts, receiving one or more
STRING
tokens from the lexical
scanner. A string
rule handles one or more incoming STRING
tokens:
string: STRING | string STRINGThe above rule can be used as a prototype for recognizing a series of elements. The token
STRING
may of course be embedded in another rule. The
generic form of this rule could be formulated as follows:
series: unit | series unitNote that the single element is first recognized, whereafter the left-recursive alternative may be recognized repeatedly.
opt_statements: // empty | opt_statements statementsThe above rule can be used as a prototype for recognizing a series of elements: the generic form of this rule could be formulated as follows:
opt_series: // empty | opt_series unitNote that the empty element is recognized first, even though it is empty, whereafter the left-recursive alternative may be recognized repeatedly. In practice this means that an action block may be defined at the empty alternative, which is then executed prior to the left-recursive alternative. Such an action block could be used to perform initializations necessary for the proper handling of the left-recursive alternative.
variables: IDENTIFIER | variables ',' IDENTIFIERThe above rule can be used as a prototype for recognizing a series of delimited elements. The generic form of this rule could be formulated as follows:
series: unit | series delimiter unitNote that the single element is first recognized, whereafter the left-recursive alternative may be recognized repeatedly. In fact, this rule is not really different from the standard rule for a series, which does not hold true for the rule to recognize an optional series of delimited elements, covered in the next section.
opt_parlist: // empty | opt_parlist ',' parameterTo define an optional series of delimited elements two rules are required: one rule handling the optional part, the other the delimited series of elements. So, the correct definition is as follows:
opt_parlist: // empty | parlist ; parlist: parameter | parlist ',' parameter ;Again, the above rule pair can be used as a prototype for recognizing an optional series of delimited elements. The generic form of these rules could be formulated as follows:
opt_series: // empty | series ; series: element | series delimiter elementNote that the
opt_series
rules neatly distinguishes the no-element
case from the case were elements are present. Usually these two cases need to
be handled quite differently, and the opt_series
rules empty alternative
easily allows us to recognize the no-elements case.
expr: NUMBER | ID | expr '+' expr | ... | '(' expr ')' ;This definition is simply characterized that the nonterminal
expr
appears within a set of parentheses, which is not too complex.
The definition of opt_statements
, however, is a bit more complex. But
acknowledging the fact that a statement
contains among other elements a
compound statement, and that a compound statement, in turn, contains
opt_statements
an opt_statements
construction can be formulated
accordingly:
opt_statements: // define an optional series // empty | opt_statements statement ; statement: // define alternatives for `statement' expr_statement | if_statement | ... | compound_statement ; compound_statement: // define the compound statement itself '{' opt_statements '}' ;
%class-name
directive, and construct parser objects of each of the defined
classes.
In multi-threaded programs each thread can create its own parser object, and
then call that object's parse
function: bisonc++ does not use static data that
are modified by the generated parser. In multi-threaded programs, when the
parser uses polymorphic semantic values, the option --thread-safe
or the
directive %thread-safe
should be specified to ensure that the code
handling the polymorphic semantic data types accesses the error-counter data
member of their own parser object.