next up previous contents index
Next: 1.7: Speaker Recognition Up: Spoken Language Input Previous: 1.5: HMM Methods in Speech Recognition

1.6: Language Representation

Salim Roukos
IBM T. J. Watson Research Center, Yorktown Heights, New York, USA

A speech recognizer converts the observed acoustic signal into the corresponding orthographic representation of the spoken sentence. The recognizer chooses its guess from a finite vocabulary of words that can be recognized. For simplicity, we assume that a word is uniquely identified by its spelling.gif

Dramatic progress has been demonstrated in solving the speech recognition problem via the use of a statistical model of the joint distribution of the sequence of spoken words W and the corresponding observed sequence of acoustic information O. This approach, pioneered by the IBM Continuous Speech Recognition group, is called the source-channel model. In this approach, the speech recognizer determines an estimate of the identity of the spoken word sequence from the observed acoustic evidence O by using the a posteriori distribution . To minimize its error rate, the recognizer chooses that word sequence that maximizes the a posteriori distribution:

where is the probability of the sequence of n-words W and is the probability of observing the acoustic evidence O when the sequence W is spoken. The a priori distribution of what words might be spoken (the source) is referred to as a language model (LM). The observation probability model (the channel) is called the acoustic model. We discuss in this section, various approaches and issues for building the language model.

The source-channel model has also been used in optical character recognition (OCR) where the observation sequence is the image of the printed characters, in handwriting recognition where the observation is the sequence of strokes on a tablet, or in machine translation (MT) where the observation is a sequence of words in one language and W represents the desired translation in another language. For all these applications, a language model is key. Therefore, the work on language modeling has a wide spectrum of applications.

1.6.1: Trigram Language Model

For a given word sequence of n words, we rewrite the LM probability as:

where is chosen appropriately to handle the initial condition. The probability of the next word depends on the history of words that have been spoken so far. With this factorization the complexity of the model grows exponentially with the length of the history. To have a more practical and parsimonious model, only some aspects of the history are used to affect the probability of the next word. One waygif to achieve this is to use a mapping that divides the space of histories into K equivalence classes. Then we can use as a model:

Some of the most successful models of the past two decades are the simple n-gram models, particularly the trigram model (n=3) where only the most recent two words of the history are used to condition the probability of the next word. The probability of a word sequence becomes:

To estimate the trigram probabilities, one can use a large corpus of text, called the training corpus to estimate trigram frequencies:

where is the number of times the sequence of words is observed and is the number of times the sequence is observed. For a vocabulary size V there are possible trigrams, which for 20,000 words translates to 8 trillion trigrams. Many of these trigrams will not be seen in the training corpus. So these unseen trigrams will have zero probability using the trigram frequency as an estimate of the trigram probability. To solve this problem one needs a smooth estimate of the probability of unseen events. This can be done by linear interpolation of trigram, bigram, and unigram frequencies and a uniform distribution on the vocabulary.

where and are estimated by the ratio of the appropriate bigram and unigram counts. The weights of the linear interpolation are estimated by maximizing the probability of new held-out data different from the data used to estimate the n-gram frequencies. The forward-backward algorithm can be used to perform this maximum likelihood estimation problem.

In general, one uses more than one vector; one may want to rely more on the trigram frequencies for those histories that have a high count as compared to those histories that have a low count in the training data. To achieve this, one can use a bucketing scheme on the bigram and unigram counts of the history to determine the interpolation weight vector . Typically, 100 to 1,000 buckets are used. This method of smoothing is called deleted interpolation [BJM83]. Other smoothing schemes have been proposed such as backing-off, co-occurrence smoothing, and count re-estimation. In the work on language modeling, corpora varying in size from about a million to 500 million words have been used to build trigram models. Vocabulary sizes varying from 1,000 to 267,000 words have also been used. We discuss in the following section the perplexity measure for evaluating a language model.

1.6.2: Perplexity

Given two language models, one needs to compare them. One way is to use them in a recognizer and find the one that leads to the lower recognition error rate. This remains the best way of evaluating a language model. But to avoid this expensive approach one can use the information theory quantity of entropy to get an estimate of how good a LM might be. The basic idea is to average the log probability on per word basis for a piece of new text not used in building the language model.

Denote by p the true distribution, that is unknown to us, of a segment of new text x of k words. Then the entropy on a per word basis is defined

If every word in a vocabulary of size |V| is equally likely then the entropy would be ; for other distributions of the words .

To determine the probability of this segment of text we will use our language model denoted by which is different from the true unknown distribution p of the new text. We can compute the average logprob on a per word basis defined as:

One can show that ; i.e., the average logprob is no lower than the entropy of the test text. Obviously our goal is to find that LM which has an average logprob that is as close as possible to the entropy of the text.

A related measure to the average logprob called perplexity is used to evaluate a LM. Perplexity is defined as . Perplexity is, crudely speaking, a measure of the size of the set of words from which the next word is chosen given that we observe the history of spoken words. The perplexity of a LM depends on the domain of discourse. For radiology reports, one expects less variation in the sentences than in general English. Table gif shows the perplexity of several domains for large vocabulary (20,000 to 30,000 words) dictation systems. The lowest perplexity that has been published on the standard Brown Corpus of 1 million words of American English is about 247 which corresponds to an entropy of 1.75 bits/character.


Table: Perplexity of trigram models for different domains.

1.6.3: Vocabulary Size

The error rate of a speech recognizer is no less than the percentage of spoken words that are not in its vocabulary V. So a major part of building a language model is to select a vocabulary that will have maximal coverage on new text spoken to the recognizer. This remains a human intensive effort. A corpus of text is used in conjunction with dictionaries to determine appropriate vocabularies. A tokenizergif (a system that segments text into words) is needed. Then a unigram count for all of the spellings that occur in a corpus is determined. Those words that also occur in the dictionary are included. In addition a human screens the most frequent subset of new spellings to determine if they are words.

Table gif shows the coverage of new text using a fixed vocabulary of a given size for English. For more inflectional languages such as French or German larger vocabulary sizes are required to achieve coverage similar to that of English. For a user of a speech recognition system, a more personalized vocabulary can be much more effective that a general fixed vocabulary. Table gif shows the coverage as new words are added to a starting vocabulary of 20,000 words as more text is observed. In addition, Table gif indicates the size of text recognized to add that many words. For many users, the dynamic coverage will be much better than what is shown in Table gif with coverage ranging from 98.4% to 99.6% after 800 words are added.


Table: Static coverage of unseen text as a function of vocabulary size.


Table: Dynamic coverage of unseen text as a function of vocabulary size and amount of new text.

1.6.4: Improved Language Models

A number of improvements have been proposed for the trigram LM. We give a brief overview of these models.

Class Models: Instead of using the actual words, one can use a set of word classes (which may be overlapping, i.e., a word may belong to many classes). Classes based on the part of speech tags, or the morphological analysis of words, or the semantic information have been tried. Also, automatically derived classes based on some statistical models of co-occurrence have been tried (see [BDPdS90]). The general class model is:

If the classes are non-overlapping, then c(w) is unique and the probability is:

These tri-class models have had higher perplexities that the corresponding trigram model. However, they have led to a reduction in perplexity when linearly combined with the trigram model.

Dynamic Models: Another idea introduced in [DK90] is to take into account the document-long history to capture the burstiness of words. For example, in this section the probability that the word model will occur is much higher than its average frequency in general text. Using a cache of the recently observed words one can build a more dynamic LM using either the class model [DK90] or the trigram model [JMRS91]. Expanding on this idea, one can can also affect the probability of related words called triggered words (see [LRR93]).

Mixture Models: Another approach is based on clustering corpora into several clusters. The linear combination of cluster-specific trigram models is used for modeling new text:

where is estimated from the j-th cluster of text. Another type of mixture is to use a sentence level mixture as in [IOR94].

Structure-based Models: instead of using the most recent words' identity to define the equivalence class of a history, the state of a parser has been used to define the conditioning event [GZ92]. Also, the use of link grammar to capture long distance bigrams has been proposed recently [LST92].

1.6.5: Future Directions

There are several areas of research that can be pursued for improved language modeling.



next up previous contents
Next: 1.7: Speaker Recognition Up: Spoken Language Input Previous: 1.5: HMM Methods in Speech Recognition