For example, C++ defines an `expression
' nonterminal. One rule
defining an expression might be,
"An expression consists of a minus sign and another expression".Another rule could be
"An expression can be an integer".Rules often are recursive, but there must be at least one rule ending the recursion.
A commonly used formal system for describing such rules is the Backus-Naur Form or `BNF', which was developed for specifying the language Algol 60. Any grammar expressed in BNF is a context-free grammar. The input to bisonc++ is essentially machine-readable BNF.
Not all context-free languages can be handled by bisonc++, only those that are LALR(1). In brief, this means that it must be possible to tell how to parse input while `looking ahead' to just a single token. Strictly speaking, that is a description of an LR(1) grammar, and LALR(1) involves some additional restrictions which are not further elaborated in the current context; but it in real life almost all LR(1) grammars are also LALR(1) grammars. See section 7.5 for more information on this.
In the formal rules defining a language's grammar, the grammar's constructs are called nonterminals. Some symbols are not further defined, but are directly made available when matching the input to a series of regular expressions. Matching the input to regular expressions is not the task of a parser, but of a lexical scanner (e.g., flexc++(1)). Lexical scanners provide the parser with symbols, called tokens or terminal symbols, which are taken for granted by the parser.
Terminal and nonterminal symbols are used to define the rules of grammars.
We can use examples from C++ to illustrate terminal and nonterminal
symbols. C++'s terminal symbols are identifiers, constants (numeric and
string), various keywords, arithmetic operators, and other simple character
tokens. So the terminal symbols of a grammar for C++ can thus be
`IDENTIFIER', `NUMBER', `STRING', as wel as tokens for keywords
(e.g., FOR, SWITCH
, operators ('+', BIT_OR
),
and character tokens like ';', ':', '(', ')'
).
Here is a simple C++ function subdivided into tokens:
int square(int x) // INT IDENTIFIER '(' INT IDENTIFIER ')' { // '{' return x * x; // RETURN IDENTIFIER '*' IDENTIFIER ';' } // '}'
C++ rules include expression rules, statement rules, rules defining
declarations, and rules defining functions. These are defined in C++'s
grammar as nonterminal symbols `expression', `statement', `declaration' and
`function_definition'. The above example shows a function definition as well
as the sequence of tokens (starting from INT
, ending at '}'
), received
by the parser when that definition shows up in its input.
Each nonterminal symbol must be defined by at least one so-called production rule, defining the sequence of symbols that must have been observed before it can be decided that the nonterminal symbol has been encountered. For example, one kind of C++ statement is the return statement; it could be described by the following informal rule:
A `statement' can be made of a `return' keyword, an `expression' and a `semicolon'.
Formally, such a rule is written like this:
statement: RETURN expression ';' ;and in general grammar rules have the following form:
nonterminal: production(s) ;where nonterminal is the nonterminal symbol defined by the rule and productions(s) are one or more sequences of terminal and nonterminal symbols that define how
nonterminal
can be recognized. Concentrating on a
single production rule, the rule's nonterminal is called the rule's left-hand
side (LHS), the production rule itself is called the rule's right-hand side
(RHS). In the statement
example the LHS equals statement
, the RHS
equals RETURN expression ';'
. The RHS consists of three elements,
numbered, respectively, 1 through 3.
One nonterminal symbol is special in that it defines a syntactically correct specification of the defined language. It is called the start symbol. In a compiler this means a complete input program. In C++, a nonterminal symbol `optional-sequence-of-definitions-and-declarations' could be used for this.
The parser generated by bisonc++ reads a sequence of tokens, and combines the tokens using the rules of the specified grammar. If the input is valid, the end result is that the entire token sequence reduces to the start symbol nonterminal. If we use C++'s grammar, then the entire input must reducible to `optional-sequence-of-definitions-and-declarations'. If not, the parser reports a syntax error.
A nonterminal symbol in the formal grammar is represented in bisonc++ input as
an identifier, like an identifier in C++. By convention, it should be in
lower case, such as expr
, stmt
or declaration
.
The bisonc++ representation for a terminal symbol is also called a token
type. Token types as well can be represented as C++-like identifiers. By
convention, these identifiers should be upper case to distinguish them from
nonterminals: for example, INTEGER
, IDENTIFIER
, IF
or
RETURN
. A terminal symbol that stands for a particular keyword in the
language should be named after that keyword converted to upper case. The
terminal symbol error
is reserved for error recovery. See section
4.2, which also describes the current restrictions on the names of
terminal symbols.
A terminal symbol can also be represented as a character literal, just like a C++ character constant. You should do this whenever a token is just a single character (parenthesis, plus-sign, etc.): use that same character in a literal as the terminal symbol for that token.
The grammar rules also have an expression in bisonc++ syntax. For example, here is the bisonc++ rule for a C++ return statement. The semicolon in quotes is a literal character token, representing part of the C++ syntax for the statement; the naked semicolon, and the colon, are bisonc++ punctuation used in every rule.
stmt: RETURN expr ';' ;See section 4.3.
x + 4
' is
grammatically correct then `x + 1
' or `x + 3989
' is also
grammatically correct.
But actual values are very important for understanding the input's meaning. A compiler is useless if it fails to distinguish between 4, 1 and 3989 as constants that are being used in a program. Therefore, each token in a bisonc++ grammar has both a token type and a semantic value. See section 4.6 for details.
A token is a terminal symbol defined in the grammar, such as INTEGER
,
IDENTIFIER
or ',
'. Tokens are used by the parser to make decisions
about the continuation of the parsing process, but for that only the mere
tokens are required, and nothing else.
Semantic values represent any additionally available information about tokens,
such as the values of integers, or the names of identifiers. (A token such as
',
' which is just punctuation normally doesn't have a particular semantic
value.)
For example, an token might be an INTEGER
, having the semantic value
4. Another INTEGER
token could have semantic value 3989. When a grammar
rule indicates that INTEGER
is allowed, either of these token/value
combinations is acceptable because each token is INTEGER
. When the parser
accepts the token, it may store the token's semantic value for future
reference on a semantic value stack.
Nonterminal symbols often also have semantic values. For example, in a calculator, an expression typically has a semantic value that is a number. In a compiler for a programming language, an expression typically has a semantic value that is a tree structure describing the meaning of the expression.
The purpose of an action usually is to compute the semantic value of the whole construct from the semantic values of its parts. For example, suppose we have a rule stating that an expression can be the sum of two expressions. Once the parser has recognized the rule's parts, each of its subexpressions has a semantic value of its own. The action for this rule could consist of a C++ expression computing the rule's semantic value from its subexpressions.
Actions usually compute semantic values and refer to semantic values of elements of production rules. This happens so frequently that several shorthand notations can be used to simplify referring to semantic values. These shorthand notations are covered extensively in section 4.6.2 and its sub-sections.
For example, here is a rule stating that an expression can be the sum of two subexpressions:
expr: expr '+' expr { $$ = $1 + $3; } ;The action defines how the semantic value of the rule (using shorthand
$$
) is computed from the values of the two sub-expressions (shorthands
$1
and $3
).
parse
) that parses the language
described by the grammar. The class and its implementation is called a bisonc++
parser class. Keep in mind that the bisonc++ utility and the produced parser
classes are distinct pieces of software: the bisonc++ utility is a program whose
output is the bisonc++ parser class that is used by your program.
More specifically, bisonc++ generates the following files from a bisonc++ grammar file:
The task of bisonc++'s parse
member is to group tokens according to rules
defined in the grammar -- for example, to combine identifiers and operators
into expressions. As it does this, it executes the actions of the grammar
rules it has recognized.
The tokens processed by the parser produced by bisonc++ are made available by an
object called the lexical analyzer or lexical scanner. The scanner is
not produced by bisonc++, but must somehow be supplied (e.g., by writing it
yourself). The parse
member requests the next token from the lexical
analyzer whenever it needs another token. The parser itself doesn't know the
semantic values of the received tokens. Typically the lexical scanner produces
tokens by parsing the characters of the program's input text. This, however,
is not something that bisonc++ concerns itself with. See also section 5.3.1.
Bisonc++'s parsing function is a member function named parse
. This parsing
function nor the parser object for which it is called defines a complete
C++ program: you must supply some additional details. One `detail' to be
supplied is is the lexical analyzer. The parser class itself declares several
additional members which can easily be redefined to fit your needs. One of
these additional members is the error-reporting function called by parse
to report errors. By default bisonc++ provides simple, but sensible,
implementations for such members.
Having constructed a parser class and a lexical scanner class, objects of
those classes must be defined in a complete C++ program. Usually such
objects are defined in main
; you have to provide this function, and
arrange for it to call the parser's parse
function, lest the parser never
run. See chapter 5.
Note that, different from conventions used by Bison and Bison++, bisonc++ does not impose a particular name convention. In particular, there is no need to begin all variable and function names used in the bisonc++ parser with `yy' or `YY'. However, some name restrictions on symbolic tokens exist. See section 4.5.29.1 for details.
Currently, bisonc++'s parsing member function parse
may or may not be
reentrant, depending on whether or not the option --thread-safe
is specified.
The source file generated by bisonc++ containing the parsing member function not
only contains this function, but also various tables (e.g., state transition
tables) defined in the anonymous namespace. When the option --thread-safe
is provided, these tables are const
tables: their elements are not changed
by the parsing function and so the parsing function, as it only manipulates
its own local data, becomes reentrant.
For each grammatical rule in the language, an action can be defined that is performed when a production rule has completely been recognized. The action is described using C++ statements.
parse
').
parse()
and its
support functions) 1must be implemented by the programmer. Commonly additional
member functions are also be declared in the parser class' header. At the
very least the member int lex()
, calling the lexecal scanner to produce
the next available token, must be implemented (although a standardized
implementation can optionallu be generated by bisonc++). The member lex
is
called by parse's
support functions to obtain the next available
token. The member function void error(char const *msg)
may also be
re-implemented by the programmer (a simple in-line implementation is
provided by default). The member function error
is called when
parse
encounters (syntactic) errors.
int main() { Parser parser; return parser.parse(); }
Bisonc++ directives %% Grammar rulesReaders familiar with Bison may note that there is no C declaration section and no section to define Additional C code. With bisonc++ these sections are superfluous since, due to the fact that a bisonc++ parser is a class, all additionally required code can be incorporated into the parser class itself. Also, C++ classes normally only require declarations that can be provided in the classes' header files, so also the `additional C code' section was omitted from the grammar file.
The `%%' is a punctuation that appears in every bisonc++ grammar file to separate the two sections.
The bisonc++ directives section contains all necessary directive specifications, like directives declaring the names of the terminal and nonterminal symbols, and directives describing operator precedence, and the data types of semantic values of various symbols.
The file's second section (grammar rules) is used to define how to construct each nonterminal symbol from its parts.
One special directive is availble that may be used in both the directives
section and in the grammar rules section. This directive is %include
,
allowing you to split long grammar specifications into smaller, more
comprehensible and accessible chunks. The %include
directive is discussed
in more detail in section 4.5.8.