Next: , Previous: , Up: Top   [Contents][Index]


16 POS tagging

Part of speech tagging is a fairly well-defined process. Festival includes a part of speech tagger following the HMM-type taggers as found in the Xerox tagger and others (e.g. DeRose88). Part of speech tags are assigned, based on the probability distribution of tags given a word, and from ngrams of tags. These models are externally specified and a Viterbi decoder is used to assign part of speech tags at run time.

So far this tagger has only been used for English but there is nothing language specific about it. The module POS assigns the tags. It accesses the following variables for parameterization.

pos_lex_name

The name of a "lexicon" holding reverse probabilities of words given a tag (indexed by word). If this is unset or has the value NIL no part of speech tagging takes place.

pos_ngram_name

The name of a loaded ngram model of part of speech tags (loaded by ngram.load).

pos_p_start_tag

The name of the most likely tag before the start of an utterance. This is typically the tag for sentence final punctuation marks.

pos_pp_start_tag

The name of the most likely tag two before the start of an utterance. For English the is typically a simple noun, but for other languages it might be a verb. If the ngram model is bigger than three this tag is effectively repeated for the previous left contexts.

pos_map

We have found that it is often better to use a rich tagset for prediction of part of speech tags but that in later use (phrase breaks and dictionary lookup) a much more constrained tagset is better. Thus mapping of the predicted tagset to a different tagset is supported. pos_map should be a a list of pairs consisting of a list of tags to be mapped and the new tag they are to be mapped to.

Note is it important to have the part of speech tagger match the tags used in later parts of the system, particularly the lexicon. Only two of our lexicons used so far have (mappable) part of speech labels.

An example of the part of speech tagger for English can be found in lib/pos.scm.


Next: Phrase breaks, Previous: Text analysis, Up: Top   [Contents][Index]