Train n-gram language model
ngram_build offers basic ngram language model estimation.
Input data format**:
Two input formats are supported. In sentence_per_line format, the program will deal with start and end of sentence (if required) by using special vocabulary items specified by -prev_tag
, -prev_prev_tag
and -last_tag
. For example, the input sentence:
the cat sat on the mat
would be treated as
... prev_prev_tag prev_prev_tag prev_tag the cat sat on the mat last_tag
where prev_prev_tag
is the argument to -prev_prev_tag
, and so on. A default set of tag names is also available. This input format is only useful for sliding-window type applications (e.g. language modelling for speech recognition).
The second input format is ngram_per_line
which is useful for either non-sliding-window applications, or where the user requires an alternative treatment of start/end of sentence to that provided above. Now the input file simply contains a complete ngram per line. For the same example as above (to build a trigram model) this would be:
prev_prev_tag prev_tag the prev_tag the cat the cat sat cat sat on sat on the on the mat the mat last_tag
Representation**:
The internal representation of the model becomes important for higher values of N where, if V is the vocabulary size, becomes very large. In such cases, we cannot explicitly hold probabilities for all possible ngrams, and a sparse representation must be used (i.e. only non-zero probabilities are stored).
Getting more robust probability estimates**:
The common techniques for getting better estimates of the low/zero frequency ngrams are provided: namely smoothing and backing-off
Testing an ngram model**:
Use the ngram_test program.