A speech recognizer converts the observed acoustic
signal into the corresponding orthographic
representation of the spoken
sentence. The recognizer chooses its guess from a finite vocabulary
of words that can be recognized. For simplicity, we assume
that a word is uniquely identified by its spelling.
Dramatic progress has been demonstrated in solving the speech
recognition problem via the use of a statistical
model of the joint distribution
of the sequence of spoken words W and the corresponding observed
sequence of acoustic information O. This approach, pioneered by the
IBM Continuous Speech Recognition group, is called the
source-channel model. In this
approach, the speech recognizer determines an estimate
of
the identity of the spoken word sequence from the observed acoustic
evidence O by using the a posteriori distribution
. To
minimize its error rate, the recognizer chooses that word sequence
that maximizes the a posteriori distribution:

where
is the
probability of the sequence of n-words W and
is the
probability of observing the acoustic evidence O when the sequence
W is spoken. The a priori distribution
of what words might
be spoken (the source) is referred to as a language
model (LM). The observation probability model
(the channel)
is called the acoustic model. We discuss in
this section, various approaches and issues for building the language
model.
The source-channel model has also been used in optical character recognition (OCR) where the observation sequence is the image of the printed characters, in handwriting recognition where the observation is the sequence of strokes on a tablet, or in machine translation (MT) where the observation is a sequence of words in one language and W represents the desired translation in another language. For all these applications, a language model is key. Therefore, the work on language modeling has a wide spectrum of applications.
For a given word sequence
of n words, we rewrite
the LM probability as:

where
is chosen
appropriately to handle the initial condition. The probability of
the next word
depends on the history
of words that have
been spoken so far. With this factorization the complexity of the
model grows exponentially with the length of the history. To have a
more practical and parsimonious model, only some aspects of the
history are used to affect the probability of the next word. One
way
to achieve this is to use a mapping
that
divides the space of histories into K equivalence classes. Then we
can use as a model:

Some of the most successful models of the past two decades are the simple n-gram models, particularly the trigram model (n=3) where only the most recent two words of the history are used to condition the probability of the next word. The probability of a word sequence becomes:

To estimate the trigram probabilities, one can use a large corpus of text, called the training corpus to estimate trigram frequencies:

where
is the number of times the sequence of words
is observed and
is the number of times the
sequence
is observed. For a vocabulary size V there
are
possible trigrams, which for 20,000 words translates to 8
trillion trigrams. Many of these trigrams will not be seen in the
training corpus. So these unseen trigrams will have zero probability
using the trigram frequency as an estimate of the trigram
probability. To solve this problem one needs a smooth estimate of the
probability of unseen events. This can be done by linear
interpolation of trigram,
bigram, and unigram frequencies and a
uniform distribution on the vocabulary.

where
and
are estimated by the ratio of the
appropriate bigram and unigram counts. The weights of the linear
interpolation are estimated by maximizing the probability of new
held-out data different from the data
used to estimate the n-gram frequencies. The
forward-backward algorithm can be
used to perform this maximum likelihood estimation problem.
In general, one uses more than one
vector; one may
want to rely more on the trigram frequencies for those histories that
have a high count as compared to those histories that have a low
count in the training data. To achieve this, one can use a bucketing
scheme on the bigram and unigram counts of the history
to determine the interpolation weight vector
. Typically, 100 to 1,000 buckets are
used. This method of smoothing is called deleted
interpolation [BJM83].
Other smoothing schemes have been proposed such as backing-off,
co-occurrence smoothing, and count
re-estimation. In the work on language
modeling, corpora varying in size from about a million to 500 million
words have been used to build trigram models. Vocabulary sizes
varying from 1,000 to 267,000 words have also been used. We discuss
in the following section the perplexity measure for
evaluating a language model.
Given two language models, one needs to compare them. One way is to use them in a recognizer and find the one that leads to the lower recognition error rate. This remains the best way of evaluating a language model. But to avoid this expensive approach one can use the information theory quantity of entropy to get an estimate of how good a LM might be. The basic idea is to average the log probability on per word basis for a piece of new text not used in building the language model.
Denote by p the true distribution, that is unknown to us, of a segment of new text x of k words. Then the entropy on a per word basis is defined

If every word in a vocabulary of size
|V| is equally likely then the entropy would be
; for
other distributions of the words
.
To determine the probability of this segment of text we will use our
language model denoted by
which is different from the
true unknown distribution p of the new text. We can compute the
average logprob on a per word basis defined as:

One can show that
; i.e., the
average logprob is no lower than the entropy of the test text.
Obviously our goal is to find that LM which has an average
logprob that is as close as possible to the entropy of the text.
A related measure to the average logprob called
perplexity is used to evaluate a
LM. Perplexity is defined as
. Perplexity is, crudely
speaking, a measure of the size of the set of words from which the
next word is chosen given that we observe the history of spoken
words. The perplexity of a LM depends on the domain of
discourse. For radiology reports, one expects less variation in the
sentences than in general English. Table
shows the
perplexity of several domains for large vocabulary (20,000 to 30,000
words) dictation systems. The lowest perplexity that
has been published on the standard Brown Corpus
of 1 million words of American English is about 247 which corresponds
to an entropy of 1.75 bits/character.
Table: Perplexity of trigram models for different domains.
The error rate of a speech recognizer is no less than the percentage
of spoken words that are not in its vocabulary V. So a major part
of building a language model is to select a vocabulary that will have
maximal coverage on new text spoken to the
recognizer. This remains a human intensive effort. A corpus of text
is used in conjunction with dictionaries to determine appropriate
vocabularies. A tokenizer
(a system that segments text
into words) is needed. Then a unigram count for all of
the spellings that occur in a corpus is determined. Those words that
also occur in the dictionary are included. In addition a human
screens the most frequent subset of new spellings to determine if
they are words.
Table
shows the coverage of new text using a fixed
vocabulary of a given size for English. For more inflectional
languages such as French or German
larger vocabulary sizes are required to achieve coverage similar to
that of English. For a user of a speech recognition system, a more
personalized vocabulary can be much more effective that a general
fixed vocabulary. Table
shows the coverage as new words
are added to a starting vocabulary of 20,000 words as more text is
observed. In addition, Table
indicates the size of text
recognized to add that many words. For many users, the dynamic
coverage will be much better than what is shown in Table
with coverage ranging from 98.4% to 99.6% after 800 words are
added.
Table: Static coverage of unseen text as a function of vocabulary size.
Table: Dynamic coverage of unseen text as a function of vocabulary
size and amount of new text.
A number of improvements have been proposed for the trigram LM. We give a brief overview of these models.
Class Models: Instead of using the actual
words, one can use a set of word classes (which may be overlapping,
i.e., a word may belong to many classes). Classes based on the part
of speech tags, or the morphological
analysis of words, or the semantic
information have been tried. Also,
automatically derived classes based on some statistical models of
co-occurrence have been tried (see [BDPdS
90]).
The general class model is:

If the classes are non-overlapping, then c(w) is unique and the probability is:

These tri-class models have had higher perplexities that the corresponding trigram model. However, they have led to a reduction in perplexity when linearly combined with the trigram model.
Dynamic Models: Another idea introduced in [DK90] is to take into account the document-long history to capture the burstiness of words. For example, in this section the probability that the word model will occur is much higher than its average frequency in general text. Using a cache of the recently observed words one can build a more dynamic LM using either the class model [DK90] or the trigram model [JMRS91]. Expanding on this idea, one can can also affect the probability of related words called triggered words (see [LRR93]).
Mixture Models: Another approach is based on clustering corpora into several clusters. The linear combination of cluster-specific trigram models is used for modeling new text:

where
is estimated from the
j-th cluster of text. Another type of mixture is to use a sentence
level mixture as in [IOR94].
Structure-based Models: instead of using the most recent words' identity to define the equivalence class of a history, the state of a parser has been used to define the conditioning event [GZ92]. Also, the use of link grammar to capture long distance bigrams has been proposed recently [LST92].
There are several areas of research that can be pursued for improved language modeling.