Next: , Up: Text analysis   [Contents][Index]


15.1 Tokenizing

A crucial stage in text processing is the initial tokenization of text. A token in Festival is an atom separated with whitespace from a text file (or string). If punctuation for the current language is defined, characters matching that punctuation are removed from the beginning and end of a token and held as features of the token. The default list of characters to be treated as white space is defined as

(defvar token.whitespace " \t\n\r")

While the default set of punctuation characters is

(defvar token.punctuation "\"'`.,:;!?(){}[]")
(defvar token.prepunctuation "\"'`({[")

These are declared in lib/token.scm but may be changed for different languages, text modes etc.