Utterance types (Festival Speech Synthesis System)

14.2 Utterance types

The primary purpose of types is to define which modules are to be applied to an utterance. UttTypes are defined in lib/synthesis.scm. The function defUttType defines which modules are to be applied to an utterance of that type. The function utt.synth is called applies this list of module to an utterance before waveform synthesis is called.

For example when a Segment type Utterance is synthesized it needs only have its values loaded into a Segment relation and a Target relation, then the low level waveform synthesis module Wave_Synth is called. This is defined as follows

(defUttType Segments
  (Initialize utt)
  (Wave_Synth utt))

A more complex type is Text type utterance which requires many more modules to be called before a waveform can be synthesized

(defUttType Text
  (Initialize utt)
  (Text utt)
  (Token utt)
  (POS utt)
  (Phrasify utt)
  (Word utt)
  (Intonation utt)
  (Duration utt)
  (Int_Targets utt)
  (Wave_Synth utt)
)

The Initialize module should normally be called for all types. It loads the necessary relations from the input form and deletes all other relations (if any exist) ready for synthesis.

Modules may be directly defined as C/C++ functions and declared with a Lisp name or simple functions in Lisp that check some global parameter before calling a specific module (e.g. choosing between different intonation modules).

These types are used when calling the function utt.synth and individual modules may be called explicitly by hand if required.

Because we expect waveform synthesis methods to themselves become complex with a defined set of functions to select, join, and modify units we now support an addition notion of SynthTypes like UttTypes these define a set of functions to apply to an utterance. These may be defined using the defSynthType function. For example

(defSynthType Festival
  (print "synth method Festival")

  (print "select")
  (simple_diphone_select utt)

  (print "join")
  (cut_unit_join utt)

  (print "impose")
  (simple_impose utt)
  (simple_power utt)

  (print "synthesis")
  (frames_lpc_synthesis utt)
  )

A SynthType is selected by naming as the value of the parameter Synth_Method.

Duration the application of the function utt.synth there are three hooks applied. This allows addition control of the synthesis process. before_synth_hooks is applied before any modules are applied. after_analysis_hooks is applied at the start of Wave_Synth when all text, linguistic and prosodic processing have been done. after_synth_hooks is applied after all modules have been applied. These are useful for things such as, altering the volume of a voice that happens to be quieter than others, or for example outputing information for a talking head before waveform synthesis occurs so preparation of the facial frames and synthesizing the waveform may be done in parallel. (see festival/examples/th-mode.scm for an example use of these hooks for a talking head text mode.)