POS Example (Festival Speech Synthesis System)

29.1 POS Example

This example shows how we can use part of the standard synthesis process to tokenize and tag a file of text. This section does not cover training and setting up a part of speech tag set See POS tagging, only how to go about using the standard POS tagger on text.

This example also shows how to use Festival as a simple scripting language, and how to modify various methods used during text to speech.

The file examples/text2pos contains an executable shell script which will read arbitrary ascii text from standard input and produce words and their part of speech (one per line) on standard output.

A Festival script, like any other UNIX script, it must start with the the characters #! followed by the name of the festival executable. For scripts the option -script is also required. Thus our first line looks like

#!/usr/local/bin/festival -script

Note that the pathname may need to be different on your system

Following this we have copious comments, to keep our lawyers happy, before we get into the real script.

The basic idea we use is that the tts process segments text into utterances, those utterances are then passed to a list of functions, as defined by the Scheme variable tts_hooks. Normally this variable contains a list of two function, utt.synth and utt.play which will synthesize and play the resulting waveform. In this case, instead, we wish to predict the part of speech value, and then print it out.

The first function we define basically replaces the normal synthesis function utt.synth. It runs the standard festival utterance modules used in the synthesis process, up to the point where POS is predicted. This function looks like

(define (find-pos utt)
"Main function for processing TTS utterances.  Predicts POS and
prints words with their POS"
  (Token utt)
  (POS utt)
)

The normal text-to-speech process first tokenizes the text splitting it in to “sentences”. The utterance type of these is Token. Then we call the Token utterance module, which converts the tokens to a stream of words. Then we call the POS module to predict part of speech tags for each word. Normally we would call other modules ultimately generating a waveform but in this case we need no further processing.

The second function we define is one that will print out the words and parts of speech

(define (output-pos utt)
"Output the word/pos for each word in utt"
 (mapcar
  (lambda (pair)
    (format t "%l/%l\n" (car pair) (car (cdr pair))))
  (utt.features utt 'Word '(name pos))))

This uses the utt.features function to extract features from the items in a named stream of an utterance. In this case we want the name and pos features for each item in the Word stream. Then for each pair we print out the word’s name, a slash and its part of speech followed by a newline.

Our next job is to redefine the functions to be called during text to speech. The variable tts_hooks is defined in lib/tts.scm. Here we set it to our two newly-defined functions

(set! tts_hooks (list find-pos output-pos))

So that garbage collection messages do not appear on the screen we stop the message from being outputted by the following command

(gc-status nil)

The final stage is to start the tts process running on standard input. Because we have redefined what functions are to be run on the utterances, it will no longer generate speech but just predict part of speech and print it to standard output.

(tts_file "-")