Lexicon requirements (Festival Speech Synthesis System)

13.6 Lexicon requirements

For English there are a number of assumptions made about the lexicon which are worthy of explicit mention. If you are basically going to use the existing token rules you should try to include at least the following in any lexicon that is to work with them.

The letters of the alphabet, when a token is identified as an acronym it is spelled out. The tokenization assumes that the individual letters of the alphabet are in the lexicon with their pronunciations. They should be identified as nouns. (This is to distinguish a as a determiner which can be schwa’d from a as a letter which cannot.) The part of speech should be nn by default, but the value of the variable token.letter_pos is used and may be changed if this is not what is required.
One character symbols such as dollar, at-sign, percent etc. Its difficult to get a complete list and to know what the pronunciation of some of these are (e.g hash or pound sign). But the letter to sound rules cannot deal with them so they need to be explicitly listed. See the list in the function mrpa_addend in festival/lib/dicts/oald/oaldlex.scm. This list should also contain the control characters and eight bit characters.
The possessive 's should be in your lexicon as schwa and voiced fricative (z). It should be in twice, once as part speech type pos and once as n (used in plurals of numbers acronyms etc. e.g 1950’s). 's is treated as a word and is separated from the tokens it appears with. The post-lexical rule (the function postlex_apos_s_check) will delete the schwa and devoice the z in appropriate contexts. Note this post-lexical rule brazenly assumes that the unvoiced fricative in the phoneset is s. If it is not in your phoneset copy the function (it is in festival/lib/postlex.scm) and change it for your phoneset and use your version as a post-lexical rule.
Numbers as digits (e.g. "1", "2", "34", etc.) should normally not be in the lexicon. The number conversion routines convert numbers to words (i.e. "one", "two", "thirty four", etc.).
The word "unknown" or whatever is in the variable token.unknown_word_name. This is used in a few obscure cases when there just isn’t anything that can be said (e.g. single characters which aren’t in the lexicon). Some people have suggested it should be possible to make this a sound rather than a word. I agree, but Festival doesn’t support that yet.