Next: Homograph disambiguation, Previous: Tokenizing, Up: Text analysis [Contents][Index]
Tokens are further analysed into lists of words. A word is an atom that can be given a pronunciation by the lexicon (or letter to sound rules). A token may give rise to a number of words or none at all.
For example the basic tokens
This pocket-watch was made in 1983.
would give a word relation of
this pocket watch was made in nineteen eighty three
Becuase the relationship between tokens and word in some cases is
complex, a user function may be specified for translating tokens into
words. This is designed to deal with things like numbers, email
addresses, and other non-obvious pronunciations of tokens as zero or
more words. Currently a builtin function
builtin_english_token_to_words
offers much of the necessary
functionality for English but a user may further customize this.
If the user defines a function token_to_words
which takes two
arguments: a token item and a token name, it will be called by the
Token_English
and Token_Any
modules. A substantial
example is given as english_token_to_words
in
festival/lib/token.scm.
An example of this function is in lib/token.scm. It is quite elaborate and covers most of the common multi-word tokens in English including, numbers, money symbols, Roman numerals, dates, times, plurals of symbols, number ranges, telephone number and various other symbols.
Let us look at the treatment of one particular phenomena which shows
the use of these rules. Consider the expression "$12 million" which
should be rendered as the words "twelve million dollars". Note the word
"dollars" which is introduced by the "$" sign, ends up after the end of
the expression. There are two cases we need to deal with as there are
two tokens. The first condition in the cond
checks if the
current token name is a money symbol, while the second condition check
that following word is a magnitude (million, billion, trillion, zillion
etc.) If that is the case the "$" is removed and the remaining numbers
are pronounced, by calling the builtin token to word function. The
second condition deals with the second token. It confirms the previous
is a money value (the same regular expression as before) and then
returns the word followed by the word "dollars". If it is neither of
these forms then the builtin function is called.
(define (token_to_words token name) "(token_to_words TOKEN NAME) Returns a list of words for NAME from TOKEN." (cond ((and (string-matches name "\\$[0-9,]+\\(\\.[0-9]+\\)?") (string-matches (item.feat token "n.name") ".*illion.?")) (builtin_english_token_to_words token (string-after name "$"))) ((and (string-matches (item.feat token "p.name") "\\$[0-9,]+\\(\\.[0-9]+\\)?") (string-matches name ".*illion.?")) (list name "dollars")) (t (builtin_english_token_to_words token name))))
It is valid to make some conditions return no words, though some care should be taken with that, as punctuation information may no longer be available to later processing if there are no words related to a token.
Next: Homograph disambiguation, Previous: Tokenizing, Up: Text analysis [Contents][Index]