Renato De Mori & Fabio Brugnara
McGill University, Montréal, Québéc, Canada
Istituto per la Ricerca Scientifica e Tecnologica, Trento, Italy
Modern architectures for Automatic Speech Recognition (ASR) are mostly software architectures generating a sequence of word hypotheses from an acoustic signal. The most popular algorithms implemented in these architectures are based on statistical methods. Other approaches can be found in [WL90] where a collection of papers describes a variety of systems with historical reviews and mathematical foundations.
A vector of acoustic features is computed every 10 to 30 msec. Details of this component can be found in section . Various possible choices of vectors together with their impact on recognition performance are discussed in [HUGN93].
Sequences of vectors of acoustic parameters are treated as observations of acoustic word models used to compute , the probability of observing a sequence of vectors when a word sequence W is pronounced. Given a sequence , a word sequence is generated by the ASR system with a search process based on the rule:
corresponds to the candidate having maximum a-posteriori probability (MAP). is computed by Acoustic Models (AM), while is computed by Language Models (LM).
For large vocabularies, search is performed in two steps. The first generates a word lattice of the n-best word sequences with simple models to compute approximate likelihoods in real-time. In the second step more accurate likelihoods are compared with a limited number of hypotheses. Some systems generate a single word sequence hypothesis with a single step. The search produces an hypothesized word sequence if the task is dictation. If the task is understanding then a conceptual structure is obtained with a process that may involve more than two steps. Ways for automatically learning and extracting these structures are described in [KDMM94].
In a statistical framework, an inventory of elementary probabilistic models of basic linguistic units (e.g., phonemes) is used to build word representations. A sequence of acoustic parameters, extracted from a spoken utterance, is seen as a realization of a concatenation of elementary processes described by hidden Markov models (HMMs). An HMM is a composition of two stochastic processes, a hidden Markov chain, which accounts for temporal variability, and an observable process, which accounts for spectral variability. This combination has proven to be powerful enough to cope with the most important sources of speech ambiguity, and flexible enough to allow the realization of recognition systems with dictionaries of tens of thousands of words.
A hidden Markov model is defined as a pair of stochastic processes . The process is a first order Markov chain, and is not directly observable, while the process is a sequence of random variables taking values in the space of acoustic parameters, or observations.
Two formal assumptions characterize HMMs as used in speech recognition. The first-order Markov hypothesis states that history has no influence on the chain's future evolution if the present is specified, and the output independence hypothesis states that neither chain evolution nor past observations influence the present observation if the last chain transition is specified.
Letting be a variable representing observations and be variables representing model states, the model can be represented by the following parameters:
with the following definitions:
A useful tutorial on the topic can be found in [Rab89].
HMMs can be classified according to the nature of the elements of the B matrix, which are distribution functions.
Distributions are defined on finite spaces in the so called discrete HMMs. In this case, observations are vectors of symbols in a finite alphabet of N different elements. For each one of the Q vector components, a discrete density is defined, and the distribution is obtained by multiplying the probabilities of each component. Notice that this definition assumes that the different components are independent. Figure shows an example of a discrete HMM with one-dimensional observations. Distributions are associated with model transitions.
Figure: Example of a discrete HMM. A transition probability and an
output distribution on the symbol set is associated with every
transition.
Another possibility is to define distributions as probability densities on continuous observation spaces. In this case, strong restrictions have to be imposed on the functional form of the distributions, in order to have a manageable number of statistical parameters to estimate. The most popular approach is to characterize the model transitions with mixtures of base densities g of a family G having a simple parametric form. The base densities are usually Gaussian or Laplacian, and can be parameterized by the mean vector and the covariance matrix. HMMs with these kinds of distributions are usually referred to as continuous HMMs. In order to model complex distributions in this way a rather large number of base densities has to be used in every mixture. This may require a very large training corpus of data for the estimation of the distribution parameters. Problems arising when the available corpus is not large enough can be alleviated by sharing distributions among transitions of different models. In semicontinuous HMMs [HAJ90], for example, all mixtures are expressed in terms of a common set of base densities. Different mixtures are characterized only by different weights.
A common generalization of semicontinuous modeling consists of interpreting the input vector y as composed of several components , each of which is associated with a different set of base distributions. The components are assumed to be statistically independent, hence the distributions associated with model transitions are products of the component density functions.
Computation of probabilities with discrete models is faster than with continuous models, nevertheless it is possible to speed up the mixture densities computation by applying vector quantization (VQ) on the gaussians of the mixtures [Boc93].
Parameters of statistical models are estimated by iterative learning algorithms [Rab89] in which the likelihood of a set of training data is guaranteed to increase at each step.
[BDFK92] propose a method for extracting additional acoustic parameters and performing transformations of all the extracted parameters using a Neural Network (NN) architecture whose weights are obtained by an algorithm that, at the same time, estimates the coefficients of the distributions of the acoustic models. Estimation is driven by an optimization criterion that tries to minimize the overall recognition error.
Words are usually represented by networks of phonemes. Each path in a word network represents a pronunciation of the word.
The same phoneme can have different acoustic distributions of observations if pronounced in different contexts. Allophone models of a phoneme are models of that phoneme in different contexts. The decision as to how many allophones should be considered for a given phoneme may depend on many factors, e.g., the availability of enough training data to infer the model parameters.
A conceptually interesting approach is that of polyphones [STNE92]. In principle, an allophone should be considered for every different word in which a phoneme appears. If the vocabulary is large, it is unlikely that there are enough data to train all these allophone models, so models for allophones of phonemes are considered at a different level of detail (word, syllable, triphone, diphone, context independent phoneme). Probability distributions for an allophone having a certain degree of generality can be obtained by mixing the distributions of more detailed allophone models. The loss in specificity is compensated by a more robust estimation of the statistical parameters due to the increasing of the ratio between training data and free parameters to estimate.
Another approach consists of choosing allophones by clustering possible contexts. This choice can be made automatically with Classification and Regression Trees (CART). A CART is a binary tree having a phoneme at the root and, associated with each node , a question about the context. Questions are of the type, ``Is the previous phoneme a nasal consonant?'' For each possible answer (YES or NO) there is a link to another node with which other questions are associated. There are algorithms for growing and pruning CARTs based on automatically assigning questions to a node from a manually determined pool of questions. The leaves of the tree may be simply labeled by an allophone symbol. Papers by [BdSG91] and [HL91] provide examples of the application of this concept and references to the description of a formalism for training and using CARTs.
Each allophone model is an HMM made of states, transitions and probability distributions. In order to improve the estimation of the statistical parameters of these models, some distributions can be the same or tied. For example, the distributions for the central portion of the allophones of a given phoneme can be tied reflecting the fact that they represent the stable (context-independent) physical realization of the central part of the phoneme, uttered with a stationary configuration of the vocal tract.
In general, all the models can be built by sharing distributions taken from a pool of, say, a few thousand cluster distributions called senones. Details on this approach can be found in [HH93].
Word models or allophone models can also be built by concatenation of basic structures made by states, transitions and distributions. These units, called fenones, were introduced by [BBdS93b]. Richer models of the same type but using more sophisticated building blocks, called multones, are described in [BBdS93a].
Another approach consists of having clusters of distributions characterized by the same set of Gaussian probability density functions. Allophone distributions are built by considering mixtures with the same components but with different weights [DM94].
The probability of a sequence of words is computed by a Language Model (LM). In general can be expressed as follows:
Motivations for this approach and methods for computing these probabilities are described in the following section.
H2>1.5.4: Generation of Word Hypotheses
Generation of word hypotheses can result in a single sequence of words, in a collection of the n-best word sequences, or in a lattice of partially overlapping word hypotheses.
This generation is a search process in which a sequence of vectors of acoustic features is compared with word models. In this section, some distinctive characteristics of the computations involved in speech recognition algorithms will be described, first focusing on the case of a single-word utterance, and then considering the extension to continuous speech recognition.
In general, the speech signal and its transformations do not exhibit clear indication of word boundaries, so word boundary detection is part of the hypothesization process carried out as a search. In this process, all the word models are compared with a sequence of acoustic features. In the probabilistic framework, ``comparison'' between an acoustic sequence and a model involves the computation of the probability that the model assigns to the given sequence. This is the key ingredient of the recognition process. In this computation, the following quantities are used:
and can be used to compute the total emission probability as
An approximation for computing this probability consists of following only the path of maximum probability. This can be done with the quantity:
The computations of all the above probabilities share a common framework, employing a matrix called a trellis, depicted in Figure . For the sake of simplicity, we can assume that the HMM in Figure represents a word and that the input signal corresponds to the pronunciation of an isolated word.
Every trellis column holds the values of one of the just introduced probabilities for a partial sequence ending at different time instants, and every interval between two columns corresponds to an input frame. The arrows in the trellis represent model transitions composing possible paths in the model from the initial time instant to the final one. The computation proceeds in a column-wise manner, at every time frame updating the scores of the nodes in a column by means of recursion formulas which involve the values of an adjacent column, the transition probabilities of the models, and the values of the output distributions for the corresponding frame. For and coefficients, the computation starts from the leftmost column, whose values are initialized with the values of , and ends at the opposite side, computing the final value with () or (). For the coefficients, the computation goes from right to left.
The algorithm for computing coefficients is known as the Viterbi algorithm, and can be seen as an application of dynamic programming for finding a maximum probability path in a graph with weighted arcs. The recursion formula for its computation is the following:
By keeping track of the state j giving the maximum value in the above recursion formula, it is possible, at the end of the input sequence, to retrieve the states visited by the best path, thus performing a sort of time-alignment of input frames with models states.
All these algorithms have a time complexity , where M is the number of transitions with non-zero probability and T is the length of the input sequence. M can be at most equal to , where S is the number of states in the model, but is usually much lower, since the transition probability matrix is generally sparse. In fact, a common choice in speech recognition is to impose severe constraints on the allowed state sequences, for example for j<i, j>i+2, as is the case of the model in Figure .
In general, recognition is based on a search process which takes into account all the possible segmentations of the input sequence into words, and the a-priori probabilities that the LM assigns to sequences of words.
Good results can be obtained with simple LMs based on bigram or trigram probabilities. As an example, let us consider a bigram language model. This model can be conveniently incorporated into a finite state automaton as shown in Figure , where dashed arcs correspond to transitions between words with probabilities of the LM.
Figure: Bigram LM represented as a weighted word graph.
stands for , stands for . The leftmost
node is the starting node, rightmost ones are finals.
After substitution of the word-labeled arcs with the corresponding HMMs, the resulting automaton becomes a big HMM itself, on which a Viterbi search for the most probable path, given an observation sequence, can be carried out. The dashed arcs are to be treated as empty transitions, i.e., transitions without an associated output distribution. This requires some generalization of the Viterbi algorithm. During the execution of the Viterbi algorithm, a minimum of backtracking information is kept to allow the reconstruction of the best path in terms of word labels. Note that the solution provided by this search is suboptimal in the sense that it gives the probability of a single state sequence of the composite model and not the total emission probability of the best word model sequence. In practice, however, it has been observed that the path probabilities computed with the above mentioned algorithms exhibit a dominance property, consisting of a single state sequence accounting for most of the total probability [ME91].
The composite model grows with the vocabulary, and can lead to large search spaces. Nevertheless the uneven distribution of probabilities among different paths can help. It turns out that, when the number of states is large, at every time instant, a large portion of states have an accumulated likelihood which is much less than the highest one, so that it is very unlikely that a path passing through one of these states would become the best path at the end of the utterance. This consideration leads to a complexity reduction technique called beam search [NMNP92], consisting of neglecting states whose accumulated score is lower than the best one minus a given threshold. In this way, computation needed to expand bad nodes is avoided. It is clear from the naivety of the pruning criterion that this reduction technique has the undesirable property of being not admissible, possibly causing the loss of the best path. In practice, good tuning of the beam threshold results in a gain in speed by an order of magnitude, while introducing a negligible amount of search errors.
When the dictionary is of the order of tens of thousands of words, the network becomes too big, and others methods have to be considered.
At present, different techniques exist for dealing with very large vocabularies. Most of them use multi-pass algorithms. Each pass prepares information for the next one, reducing the size of the search space. Details of these methods can be found in [AHH93,ADNS94,MBDW93,KAM94].
In a first phase a set of candidate interpretations is represented in an object called word lattice, whose structure varies in different systems: it may contain only hypotheses on the location of words, or it may carry a record of acoustic scores as well. The construction of the word lattice may involve only the execution of a Viterbi beam-search with memorization of word scoring and localization, as in [ADNS94], or may itself require multiple steps, as in [AHH93,MBDW93,KAM94]. Since the word lattice is only an intermediate result, to be inspected by other detailed methods, its generation is performed with a bigram language model, and often with simplified acoustic models.
The word hypotheses in the lattice are scored with a more accurate language model, and sometimes with more detailed acoustic models. Lattice rescoring may require new calculations of HMM probabilities [MBDW93], may proceed on the basis of precomputed probabilities only [ADNS94,AHH93], or even exploit acoustic models which are not HMMs [KAM94]. In [AHH93], the last step is based on an search [Nil71] on the word lattice, allowing the application of a long distance language model, i.e., a model where the probability of a word may not only depend on its immediate predecessor. In [ADNS94] a dynamic programming algorithm, using trigram probabilities, is performed.
A method which does not make use of the word lattice is presented in [Pau94]. Inspired by one of the first methods proposed for continuous speech recognition (CSR) [Jel69], it combines both powerful language modeling and detailed acoustic modeling in a single step, performing an based search.
Interesting software architectures for ASR have been recently developed. They provide acceptable recognition performance almost in real time for dictation of large vocabularies (more than 10,000 words). Pure software solutions require, at the moment, a considerable amount of central memory. Special boards make it possible to run interesting applications on PCs.
There are aspects of the best current systems that still need improvement. The best systems do not perform equally well with different speakers and different speaking environments. Two important aspects, namely recognition in noise and speaker adaptation, are discussed in section . They have difficulty in handling out of vocabulary words, hesitations, false starts and other phenomena typical of spontaneous speech. Rudimentary understanding capabilities are available for speech understanding in limited domains. Key research challenges for the future are acoustic robustness, use of better acoustic features and models, use of multiple word pronunciations and efficient constraints for the access of a very large lexicon, sophisticated and multiple language models capable of representing various types of contexts, rich methods for extracting conceptual representations from word hypotheses and automatic learning methods for extracting various types of knowledge from corpora.