Next: Phrase breaks, Previous: Text analysis, Up: Top [Contents][Index]
Part of speech tagging is a fairly well-defined process. Festival includes a part of speech tagger following the HMM-type taggers as found in the Xerox tagger and others (e.g. DeRose88). Part of speech tags are assigned, based on the probability distribution of tags given a word, and from ngrams of tags. These models are externally specified and a Viterbi decoder is used to assign part of speech tags at run time.
So far this tagger has only been used for English but there
is nothing language specific about it. The module POS
assigns the tags. It accesses the following variables for
parameterization.
pos_lex_name
The name of a "lexicon" holding reverse probabilities of words
given a tag (indexed by word). If this is unset or has the
value NIL
no part of speech tagging takes place.
pos_ngram_name
The name of a loaded ngram model of part of speech tags (loaded
by ngram.load
).
pos_p_start_tag
The name of the most likely tag before the start of an utterance. This is typically the tag for sentence final punctuation marks.
pos_pp_start_tag
The name of the most likely tag two before the start of an utterance. For English the is typically a simple noun, but for other languages it might be a verb. If the ngram model is bigger than three this tag is effectively repeated for the previous left contexts.
pos_map
We have found that it is often better to use a rich tagset for
prediction of part of speech tags but that in later use (phrase breaks
and dictionary lookup) a much more constrained tagset is better. Thus
mapping of the predicted tagset to a different tagset is supported.
pos_map
should be a a list of pairs consisting of a list of tags
to be mapped and the new tag they are to be mapped to.
Note is it important to have the part of speech tagger match the tags used in later parts of the system, particularly the lexicon. Only two of our lexicons used so far have (mappable) part of speech labels.
An example of the part of speech tagger for English can be found in lib/pos.scm.
Next: Phrase breaks, Previous: Text analysis, Up: Top [Contents][Index]