These examples are simple, but bisonc++ grammars for real programming languages are written the same way. You can copy the examples from this document into source files to create your own versions of these programs. In addition, the bisonc++ source archive contains the various source files ready for use.
All sources for this calculator are found in the demos/rpn/ directory.
%baseclass-preinclude cmath %token NUM %stype double
The grammar file's first (directive) configures bisonc++ by specifying the
tokens (see section The bisonc++ Declarations Section). Each terminal symbol that
is not a single-character literal must be declared here (Single-character
literals are normally not declared, but are represented by literal character
constants). In the current example, all arithmetic operators are designated by
single-character literals, so the only terminal symbol that needs to be
declared is NUM
, the token type for numeric constants. As bisonc++ by default
uses the type int
as the semantic value type, but a calculator usually
uses floating point values the directive %stype double
is used to indicate
that the calculator uses `double
' as the type for its semantic values.
input: // empty | input line ; line: '\n' | exp '\n' { std::cout << "\t" << $1 << std::endl; } ; exp: NUM | exp exp '+' { $$ = $1 + $2; } | exp exp '-' { $$ = $1 - $2; } | exp exp '*' { $$ = $1 * $2; } | exp exp '/' { $$ = $1 / $2; } | // Exponentiation: exp exp '^' { $$ = pow($1, $2); } | // Unary minus: exp 'n' { $$ = -$1; } ;
The rules of the rpn
`language' defined here are the expression
(using the name exp
), the line of input (line
), and the complete input
transcript (input
). Each of these nonterminal symbols has several
alternate rules, joined by the `|
' separator, separating the various
production rules. The various rules are explained next.
The semantics of the language are determined by the actions taken once nonterminals have been recognized. The actions consist of C++ code that appears inside braces. See section 4.6.2.
Actions are specified using C++, but bisonc++ provides the means for passing
semantic values between rules. Refer to section 4.6.2 below for an
extensive coverage of the various possibilities. As a short introduction, the
pseudo-variable $$
can be used in rules to represent the semantic value
for the nonterminal that the production rule is defining. Assigning a value to
$$
is the main task of many actions. The semantic values of the elements
of production rules are referred to as $1
(the first element of a
production rule), $2
(the second element), and so on.
input
:
input: // empty | input line ;This definition should be read as follows: A complete input is either an empty string, or a complete input followed by an input line. Notice that `complete input' is defined in terms of itself. This definition is said to be left recursive since the rule's nonterminal (
input
) appears as the
leftmost symbol in the production rule. See section 4.4.
The first alternative is empty: there are no symbols between the colon and the
first `|
'; this means that input can match an empty string of input (no
tokens). We write the rules this way because it is legitimate to type
Ctrl-d
(end of input) immediately after starting the calculator. By
convention empty alternatives are provided with the comment `// empty
'.
The second production rule (`input line
') handles all nontrivial input. It
means after reading any number of lines, read one more line. The left
recursion turns this rule into a loop: it processes any number of lines.
The parser's parsing function continues to process input until a grammatical error is encountered or untilthe lexical analyzer returns the end-of-file token.
line
nonterminal:
line: '\n' | exp '\n' { cout << "\t" << $1 << endl; } ;The first alternative is a newline character token; this means that
rpn
accepts a blank line (and ignores it, since it has no associated
action). The second alternative is an expression (expr
), followed by a
newline. This alternative handles all expressions that are entered by the
user. The semantic value of the exp
nonterminal is the value of $1
because the exp
nonterminal is the first symbol of the production
rule. The action simply prints the expression's value.
This action is unusual because it does not assign a value to $$
. As a
consequence, the semantic value associated with line
is not defined. This
becomes a bug if that value is ever used, but we don't use it: once rpn
has printed the value of the user's input line, that value is no longer
needed.
exp
nonterminal has several rules, one for each kind of
expression. The first rule handles the simplest kind of expression: just a
single number. The second handles additions, which are two expressions
followed by a plus-sign. The third handles subtractions, and so on.
exp: NUM | exp exp '+' { $$ = $1 + $2; } | exp exp '-' { $$ = $1 - $2; } ...Normally the production rule separator `
|
' is used to separate the
various production rules, but the rules could equally well have separately
been defined:
exp: NUM ; exp: exp exp '+' { $$ = $1 + $2; } ; exp: exp exp '-' { $$ = $1 - $2; } ; ...Most rules have actions assoicated with them computing the value of the expression using the values of the elements of production rules. For example, in the rule for addition,
$1
refers to the first component exp
and
$2
refers to the second one. The third component, '+
', has no semantic
value associated with it, but if it had you could refer to it as $3
. When
the parser's parsing function recognizes a sum expression using this rule, the
sum of the two subexpressions' values is produced as the value of the entire
expression. See section 4.6.2.
You don't have to define actions for every rule. When a rule has no action,
bisonc++ by default copies the value of $1
into $$
. This is what happens
in the first rule (the one that uses NUM
).
The formatting shown here shows the recommended layout, but bisonc++ does not require it. You can alter whitespace as much as you like.
lex
), which is a predefined member
of the parser class. See section 5.3.1.
Our calculator only needs a simple lexical analyzer. It skips blanks and tabs,
then reads numbers, returning them as NUM
tokens. Any other character that
isn't part of a number is a separate token. Note that the token-value for such
a single-character token is the character itself.
The lexical analyzer function's return value is a numeric value representing a
token. If the token is a literal character, then its numeric code is the
character's ASCII
code; if the token is the name of a token defined by bisonc++
in a grammar specification file, then such named tokens are defined by bisonc++ in
a C++ enumeration. In the current example, therefore, NUM
becomes an
enumeration value, returned by the lex
member as the `value' NUM
.
Semantic values of nonterminals are stored in the parser's data member
d_val_
(comparable to the variable yylval
used by, e.g., Bison). This
data member has int
as its default type, but by specifying %stype
in
the grammar file's directive section this default type is modified (to, e.g.,
double
).
A token value of zero is returned once end-of-file is encountered (the parsing function produced by bisonc++ interprets any nonpositive token value as end-of-input).
Here is the lexical scanner's implementation:
#include "Parser.ih" // Lexical scanner returns a double floating point // number on the stack and the token NUM, or the ASCII // character read if not a number. Skips all blanks // and tabs, returns 0 for EOF. int Parser::lex() { char c; // get the next non-ws character while (std::cin.get(c) && (c == ' ' || c == '\t')) ; if (!std::cin) // no characters were obtained return 0; // indicate End Of Input if (c == '.' || isdigit(c)) // if a digit char was found { std::cin.putback(c); // return the character std::cin >> d_val_; // extract a number return NUM; // return the NUM token } return c; // otherwise return the extracted char. }
main
function is a very small one. It
constructs a parser object and then calls its parsing function to
start the calculator:
#include "main.ih" int main() { Parser parser; parser.parse(); }
parse
encounters a syntax error, it calls the error reporting
member function (error
) to print an error message. A very basic in-line
implementation is provided by bisonc++ in the parser class internal header file
(see chapter 5), which can easily be redefined by the
programmer. The error
member's default implementation is OK for
rpn
.
Once error
returns, bisonc++'s parser may recover from the error and continue
parsing if the grammar contains a suitable error rule (see chapter
8). Otherwise, the parsing function parse
returns a non-zero
value. No error production rules were used for rpn
, so the calculator
terminates at the first syntax error.. Not very nice in a real-life
calculator, but it is acceptable for this first example.
rpn
this means that a source file (main.cc
) is constructed
holding main
, and a file parser/lex.cc
containing the lexical
scanner's implementation. Note that I've put all the parser's files in a
separate directory as well (also see section 3.7).
In rpn's parser directory the file grammar
contains the grammar specification. Bisonc++ constructs a parser class and a
parsing member function from this file after issuing the command:
b() grammarFrom this, bisonc++ produced the following files:
Parser.h
, the parser class definition;
Parserbase.h
, the parser's base class definition, defining, among
other, the grammatical tokens to be used by externally defined lexical
scanners;
Parser.ih
, the internal header file, to be included by all
implementations of the parser class' members;
parse.cc
, the parsing member function.
Parserbase.h
and parse.cc
will be re-created each
time bisonc++ is re-run. Parser.h
and Parser.ih
may safely be modified
by the programmer, e.g., to add new members to to the parser class. These two
files will not be overwritten by bisonc++, unless explicitly instructed to do
so.
# List files (recursively) in the (current) examples/rpn directory. % ls -R .: build* main.cc parser/ rpn.ih ./parser: grammar lex.cc Parser.ih # Create `rpn' using the `build' script: % ./build program # List files again, ./rpn is the constructed program % ls -R .: build* main.cc parser/ program* rpn.ih ./parser: grammar grammar.output lex.cc parse.cc Parserbase.h Parser.h Parser.ih
Here is an example session using the calculator:
% program 4 9 + 13 3 7 + 3 4 5 *+- -13 3 7 + 3 4 5 * + - n Note the unary minus, `n' 13 5 6 / 4 n + -3.16667 3 4 ^ Exponentiation 81 ^D End-of-file indicator %
rpn
to handle infix operators instead of postfix. Infix
notation requires us to define operator precedence and to extend the grammar
with nested expressions. Here is bisonc++'s grammar specification for calc
, an
infix calculator:
%baseclass-preinclude cmath %stype double %token NUM %left '-' '+' %left '*' '/' %left NEG // negation--unary minus %right '^' // exponentiation %% input: // empty | input line ; line: '\n' | exp '\n' { std::cout << "\t" << $1 << '\n'; } ; exp: NUM | exp '+' exp { $$ = $1 + $3; } | exp '-' exp { $$ = $1 - $3; } | exp '*' exp { $$ = $1 * $3; } | exp '/' exp { $$ = $1 / $3; } | '-' exp %prec NEG { $$ = -$2; } | // Exponentiation: exp '^' exp { $$ = pow($1, $3); } | '(' exp ')' { $$ = $2; } ;The functions
lex
, error
and main
are identical to the ones
used with rpn
.
This example illustrates several important new features:
%left
declares token types
stating that the mentioned operators are left-associative. The directives
%left
and %right
(right associativity) are used instead of %token
,
which is merely used to declare a token identifier. The tokens mentioned with
%left
and %right
are single-character tokens, which ordinarily don't
need to be declared. We declare them here to specify their priority and
associativity.
%left
and %right
are specified, the lower the priority of
the operators that are following these keywords. So addition/subtraction have
the lowest precedence, multiplication and division are next, then unary minus
(NEG
), and exponentiation has the highest precedence. See section
4.5.9.
%prec
directive encountered
in the grammar section at the definition of the unary minus operator
production rule. The %prec
directive indicates that the minus operator in
the production rule `'-' exp
' has NEG's
, rather than the minus
('-'
) operator's precedence. See section 7.3.
Here is a sample run of calc
:
% calc 4 + 4.5 - (34/(8*3+-3)) 6.88095 -56 + 2 -54 3 ^ 2 9
parse
returns
after calling error
. This means that an erroneous input line ends the
calculator. Now we show how to recover from erroneous input.
The bisonc++ grammar-specification language includes the reserved symbol error
,
which can be used in production rules. In the example below it has been added
as a new alternative for recognizing the line
nonterminal:
line: '\n' | exp '\n' { std::cout << "\t" << $1 << '\n'; } | error '\n' ;
This addition to the grammar allows the calculator to recover from syntax
errors. If a syntax error is encountered, the third production rule is
activated, skipping all tokens until a newline token has been encounters. At
that point line
has been recognized, and parsing continues afresh at the
next input line (the error
member function is also called, printing its
message). Different from bison's error
implementation, bisonc++ proceeds on the
assumption that whenever error
is used in a production rule it is the
grammar constructor's intention to have the parser continue parsing
(therefore, a statement like `yyerrok
', encountered in bison grammars is
superfluous in bisonc++ grammars). The reserved symbol error
itself causes the
parsing function to skip all subsequent input until a token that can follow
error
has been encountered. In the above implementation that token is the
newline character `\n
' (see chapter 8).
This form of error recovery deals with syntax errors. There are other
kinds of errors; for example, divisions by zero, which raises an exception
signal that is normally fatal. A real calculator program must handle this
signal and use whatever it takes to discard the rest of the current line of
input and resume parsing thereafter. Handling such signals and other forms of
semantic errors is not discussed here, as it is not specific to bisonc++
programs. But once a semantic error has been encountered, handling functions
may call ERROR()
, resulting in the same procedure as the one that's used
for syntax errors: all subsequenct tokens are skipped until a token has been
encountered that can follow the reserved symbol error
.
+, - * /
and ^
. It would be nice to have a calculator that
allows us to use some other mathematical functions as well, such as sin
,
cos
, etc..
It is easy to add new operators to the infix calculator as long as they are
only single-character literals. The parser's member lex
returns all
non-number characters as tokens, so only some new grammar production rules
need to be added to the grammar when such tokens must be recognized. But we
want something more flexible: built-in functions that can be called like this:
function_name (argument)At the same time, we add memory to the calculator, allowing us to use variables. Here is a sample session with the multi-function calculator:
pi = 3.141592653589 3.14159 sin(pi) 7.93266e-13 alpha = beta1 = 2.3 2.3 alpha 2.3 ln(alpha) 0.832909 exp(ln(beta1)) 2.3Note that multiple assignment and nested function calls are supported.
mfcalc
calculator shows several new
features. Here is the bisonc++ directive section for the mfcalc
multi-function
calculator (line numbers were added for referential purposes, they are not
part of the declaraction section as used in the actual grammar file):
1 %union 2 { 3 double u_val; 4 double *u_symbol; 5 double (*u_fun)(double); 6 } 7 8 %token <u_val> NUM // Simple double precision number 9 %token <u_symbol> VAR // Variable 10 %token <u_fun> FNCT // Function 11 %type <u_val> exp 12 13 %right '=' 14 %left '-' '+' 15 %left '*' '/' 16 %left NEG // negation--unary minus 17 %right '^' // exponentiationThe above specifications show two new features of bisonc++'s grammar specification language, allowing semantic values to have different data types.
%union
directive (given in lines 1 through 6) allows semantic
values to have various data types (see section 4.5.33). It is used
instead of %stype
, and defines Parser::STYPE_
as a the union type:
all semantic values now have this type. Now semantic values can have any of
the types that are defined as the union's fields.
%type
directive is used to associate (non)terminal tokens with
the types of the fields of the union:
double
(for exp
and NUM
);
double
, being a pointer to entries in
mfcalc
's symbol table, used with VAR
tokens.
double
argument and
returning a double
value, used with FNCT
tokens.
NUM,
VAR, FNCT
, and exp
. The %type
directive first specifies (between
angle brackets) one of the union's field types, followed by a list of (blank
or comma separated) nonterminal or terminal symbols which are associated with
the specified union field type.
=
', defined in line
13: by making the assignment operator right-associative we can allow
sequential assignments of the form a = b = c = expression
.
calc
. Three rules, mentioning VAR
or
FNCT
, are new:
input: // empty | input line ; line: '\n' | exp '\n' { cout << "\t" << $1 << endl; } | error '\n' ; exp: NUM | VAR { $$ = *$1; } | VAR '=' exp { $$ = *$1 = $3; } | FNCT '(' exp ')' { $$ = (*$1)($3); } | exp '+' exp { $$ = $1 + $3; } | exp '-' exp { $$ = $1 - $3; } | exp '*' exp { $$ = $1 * $3; } | exp '/' exp { $$ = $1 / $3; } | '-' exp %prec NEG { $$ = -$2; } | // Exponentiation: exp '^' exp { $$ = pow($1, $3); } | '(' exp ')' { $$ = $2; } ;
The symbol table itself varies in size and contents once mfcalc
is
used. It is defined as the data member
d_symbols
in the Parser's header file. In contrast, the function table
has a fixed size and contents. Because of this the function table is
defined as a static data member. Both tables are defined as
std::unordered_map
containers: their keys are
std::string
objects, their values, respecively, double
s and double
(*)(double)
s. Here is the declaration of d_symbols
and s_functions
as
used in mfcalc
's parser:
std::unordered_map<std::string, double> d_symbols; static std::unordered_map<std::string, double (*)(double)> s_functions;As
s_functions
is a static member, it can be initialized compile
time from an array of pairs. To ease the definition of such an array a
private typedef
typedef std::pair<char const *, double (*)(double)> FunctionPair;is added to the parser class, as well as a private array
static FunctionPair s_funTab[];These definitions allow us to initialize
s_functions
in a separate
source file (data.cc
):
#include "Parser.ih" Parser::FunctionPair Parser::s_funTab[] = { FunctionPair("sin", sin), FunctionPair("cos", cos), FunctionPair("atan", atan), FunctionPair("ln", log), FunctionPair("exp", exp), FunctionPair("sqrt", sqrt), }; unordered_map<string, double (*)(double)> Parser::s_functions ( Parser::s_funTab, Parser::s_funTab + sizeof(Parser::s_funTab) / sizeof(Parser::FunctionPair) );By simply editing the definition of
s_funTab
, additional
functions can be added to the calculator.
mfcalc
, the parser's member function lex()
must now recognize
variables, function names, numeric values, and the single-character arithmetic
operators. Strings of alphanumeric characters not starting with a digit are
recognized as either variables or functions depending on the table in which
they are found. By arranging lex
's logic such that the function table is
searched first it is simple to ensure that no variable can ever have the name
of a predefined function. The implementation used here, in which two
different tables are used for the arithmetic functions and the variable
symbols is appealing because it's simple to implement. However, it also has
the drawback of being difficult to scale to more generic calculators, using,
e.g., different data types and different types of functions. In such
situations a single symbol table is more preferable, where the keys are the
identifiers (variables, function names, predefined constants, etc.) while the
values are objects describing their characteristics. A re-implementation of
mfcalc
using an integrated symbol table is suggested in one of the
exercises of the next section 6.5.
The parser's lex
member has these characteristics:
mfcalc
's parser (d_val
) is itself also a union
,
the numerical value can be extracted into d_val_.u_val
, and a
NUM
token can be returned.
s_functions
map. If found, d_val_.u_fun
is given the function's
address, found as the value of the s_functions
map element
corresponding to the read identifier, and token FNCT
is returned.
If the symbol is not found in s_functions
the address of the
value ofn d_symbols
associated with the received identifier is
assigned to d_val_.u_symbol
and token VAR
is returned. Note
that this automatically defines newly used variables, since
d_symbols[name]
automatically inserts a new element in a map if
d_symbol[name]
wasn't already there.
lex
member function:
#include "Parser.ih" /* Lexical scanner returns a double floating point number on the stack and the token NUM, or the ASCII character read if not a number. Skips all blanks and tabs, returns 0 for EOF. */ int Parser::lex() { char c; // get the next non-ws character while (cin.get(c) && (c == ' ' || c == '\t')) ; if (!cin) // no characters were obtained return 0; // indicate End Of Input if (c == '.' || isdigit (c)) // if a digit char was found { cin.putback(c); // return the character cin >> d_val_.u_val; // extract a number return NUM; // return the NUM token } if (!isalpha(c)) // c doesn't start an identifier: return c; // return a single character token. // in all other cases, an ident is entered. Recognize a var or function string word; // store the name in a string object while (true) // process all alphanumerics: { word += c; // add 'm to `word' if (!cin.get(c)) // no more chars? then done here break; if (!isalnum(c)) // not an alphanumeric: put it back and done. { cin.putback(c); break; } } // Now lookup the name as a function's name unordered_map<string, double (*)(double)>::iterator function = s_functions.find(word); // Got it, so return FPTR if (function != s_functions.end()) { d_val_.u_fun = function->second; return FNCT; } // no function, so return a VAR. Set // u_symbol to the symbol's address in the // d_symbol map. The map will add the // symbol if not yet there. d_val_.u_symbol = &d_symbols[word]; return VAR; }
mfcalc
, the following steps are suggested:
mfcalc.cc
. Actually, it is already available,
since all implementations of main()
used so far are identical to
each other.
parser
:
grammar
;
bisonc++ grammar
to produce the files Parser.h
,
Parserbase.h
, Parser.ih
and parse.cc
;
Parser.h
so as to include FunctionPair,
s_functions, s_funTab
and d_symbols
;
Parser.ih
so as to include cmath
and optionally
`using namespace std
', which is commented out by default;
data.cc
and lex.cc
to initialize the static
data and to contain the lexical scanner, respectively.
mfcalc
in mfcalc.cc
's directory using the
following command:
g++ -o mfcalc *.cc parser/*.cc
mfcalc
's
implementation and operating mode:
cmath
' to the
Parser::s_functions
;
Symbol
in which the symbol type, and an
appropriate value for the symbol is stored. Define only one map d_symbols
in the Parser, and provide the Symbol
class with means to obtain the
appropriate values for the various token types.
%union
directive, and change it into %stype
Symbol
. Hint: use the %preinclude-header
directive to make
Symbol
known to the parser's base class.
CONST
for numerical constants (like PI
, (E)),
and pre-define some numerical constants;
get()
and set()
member pair in Symbol
, and use the appropriate
member in the appropriate expr
rule; use ERROR()
to initiate error
recovery.