Next: Text modes, Up: TTS [Contents][Index]
Text to speech works by first tokenizing the file and chunking the
tokens into utterances. The definition of utterance breaks is
determined by the utterance tree in variable eou_tree
. A default
version is given in lib/tts.scm. This uses a decision tree to
determine what signifies an utterance break. Obviously blank lines are
probably the most reliable, followed by certain punctuation. The
confusion of the use of periods for both sentence breaks and
abbreviations requires some more heuristics to best guess their
different use. The following tree is currently used which
works better than simply using punctuation.
(defvar eou_tree '((n.whitespace matches ".*\n.*\n\\(.\\|\n\\)*") ;; 2 or more newlines ((1)) ((punc in ("?" ":" "!")) ((1)) ((punc is ".") ;; This is to distinguish abbreviations vs periods ;; These are heuristics ((name matches "\\(.*\\..*\\|[A-Z][A-Za-z]?[A-Za-z]?\\|etc\\)") ((n.whitespace is " ") ((0)) ;; if abbrev single space isn't enough for break ((n.name matches "[A-Z].*") ((1)) ((0)))) ((n.whitespace is " ") ;; if it doesn't look like an abbreviation ((n.name matches "[A-Z].*") ;; single space and non-cap is no break ((1)) ((0))) ((1)))) ((0)))))
The token items this is applied to will always (except in the
end of file case) include one following token, so look ahead is
possible. The "n." and "p." and "p.p." prefixes allow access to the
surrounding token context. The features name
, whitespace
and punc
allow access to the contents of the token itself. At
present there is no way to access the lexicon form this tree which
unfortunately might be useful if certain abbreviations were identified
as such there.
Note these are heuristics and written by hand not trained from data, though problems have been fixed as they have been observed in data. The above rules may make mistakes where abbreviations appear at end of lines, and when improper spacing and capitalization is used. This is probably worth changing, for modes where more casual text appears, such as email messages and USENET news messages. A possible improvement could be made by analysing a text to find out its basic threshold of utterance break (i.e. if no full stop, two spaces, followed by a capitalized word sequences appear and the text is of a reasonable length then look for other criteria for utterance breaks).
Ultimately what we are trying to do is to chunk the text into utterances that can be synthesized quickly and start to play them quickly to minimise the time someone has to wait for the first sound when starting synthesis. Thus it would be better if this chunking were done on prosodic phrases rather than chunks more similar to linguistic sentences. Prosodic phrases are bounded in size, while sentences are not.
Next: Text modes, Up: TTS [Contents][Index]