Next: Token to word rules, Up: Text analysis [Contents][Index]
A crucial stage in text processing is the initial tokenization of text. A token in Festival is an atom separated with whitespace from a text file (or string). If punctuation for the current language is defined, characters matching that punctuation are removed from the beginning and end of a token and held as features of the token. The default list of characters to be treated as white space is defined as
(defvar token.whitespace " \t\n\r")
While the default set of punctuation characters is
(defvar token.punctuation "\"'`.,:;!?(){}[]") (defvar token.prepunctuation "\"'`({[")
These are declared in lib/token.scm but may be changed for different languages, text modes etc.