next up previous contents
Next: About this document Up: Using CSLU-C for Speech Previous: Phoneme Probability Models

Word Models, Lexical trees and Grammars

In this chapter we explore various issues regarding word pronunciation models, lexical-trees and grammar searches. We will be looking at keyword spotting using lexical tree's and also continuous speech recognition using a finite state based grammar search. All C-code referenced in this chapter can be found in the files months.c and time.c in the tutorial/csluc subdirectory of the distribution.

Word pronunciation models

To build a recognizer for a given set of words, requires knowledge of how the words are pronounced. For example the word ``greeting'' would be pronounced phonemically as follows:

table567

Here the pronounciation model takes the format

table572

This represents a word pronounced with symbol a optionally followed by either symbols (1)b c (grouped with ()'s), (2)d or (3)e, followed by f. Thus [] means optional, () groups symbols and tex2html_wrap_inline1400 denotes alternatives within the grouping specified. A phoneme may be formed as a concatenation of any valid character, except the meta characters [] (). Meta characters may however be escaped using a single backslash. The above example would therefore expand into the following pronunciation models.

table577

Given the word pronunciations for each of the words in our vocabularly we create word model objects using the buildWord function. Word model objects describes the word pronunciations in a more machine accessable form.

typedef struct {
  char *name;
  char *pro;
} pronunT;

static pronunT months[] = { 
  {"january",   "dZ @ n [j] u 3r i:"},
  {"february",  "f E [bc] b [9r] [j] u 3r i:"},
  {"march",     "m A 9r tSc tS"},
  {"april",     "ei [pc] ph 9r I l"},
  {"may",       "m ei"},
  {"june",      "dZ u n"},
  {"july",      "dZ u l aI"},
  {"august",    "A gc g ^ s tc th"},
  {"september", "s E pc [ph] tc th E m [bc] b 3r"},
  {"october",   "A kc [kh] tc th oU bc b 3r"},
  {"november",  "n oU v E m [bc] b 3r"},
  {"december",  "dc d i: s E m [bc] b 3r"},
};
static int num_months = sizeof(months)/sizeof(pronunT);

The following example procedure shows how we create a set of word model objects for the above defined pronuncation models.

 

static int initWords(GS_Result *Cres, pronunT *pronunciations, int nwords, 
                     wordT ***words)
{
  int i;
  wordT **wordlist;

  *words = (wordT **)malloc(nwords * sizeof(wordT *));
  wordlist = *words;
  for(i=0; i<nwords; i++) wordlist[i] = NULL;

  for(i=0; i<nwords; i++) {
    if(!(wordlist[i] = buildWord(Cres, pronunciations[i].pro, 
                                 pronunciations[i].name)))
      goto word_err;
  }
  
  return GS_OK;
word_err:
  for(i=0; i<nwords; i++) 
    if(wordlist[i]) freeWord(wordlist[i]);
  free((char *)wordlist);
  *words = NULL;
  return GS_ERROR;
}

Figure 5.1 depicts graphically the word model for the word april.

  figure585
Figure 5.1:   Graphical representation of a word model object for the word ``april''.

These word objects are however not useful when building a search engine such as a lexical tree or a grammar search. Before they can become useful the word pronunciation need to be described in terms of the phoneme probability estimator symbol set. The expandWord function is used for this purpose. The input to the expandWord function is a list of word models, created using buildWord function and also the recognizer description object. The recognizer description object(chapter 4) as you may recall, describes the relation between word pronunciations and its own respective symbol set. The following example shows how to generate a list of word model objects in terms of the symbol set of the general purpose recognizer supplied with the development environment.

 

static int expandWords(GS_Result *Cres, probnameT *p, wordT **words, 
                       int nwords, wordT ***xwords)
{
  int i;
  wordT **wordlist;
  
  *xwords = (wordT **)malloc(nwords * sizeof(wordT *));
  wordlist = *xwords;
  for(i=0; i<nwords; i++) wordlist[i] = NULL;
  
  for(i=0; i<nwords; i++) {
    if(!(wordlist[i] = expandWord(Cres, words[i], p)))
      goto expand_err;
    printWord(wordlist[i]);
  }        
  
  return GS_OK;
expand_err:
  for(i=0; i<nwords; i++) 
    if(wordlist[i]) freeWord(wordlist[i]);
  free((char *)wordlist);
  *xwords = NULL;
  return GS_ERROR;
}

These word models created with the expandWord function are often referred to as context expanded word models, since the base word pronunciation models are expanded to include the phoneme probability estimator phonemes, which are typically context dependent.

  figure598
Figure 5.2:   Graphical representation of the context expanded word model object for the word ``april''.

Figure 5.2 depicts the expanded word model object for the word ``april''. Once the word models have been expanded such that the word pronunciations are in terms of phoneme probability estimator symbols, they can be used to build either a lexical tree or a finite state grammar search.

Lexical trees

In a lexical tree nodes or states in the Viterbi search are shared among words. A lexical tree consisting of the words

 {"june"       "dZ u n"}
 {"july"       "dZ u l aI"}

would combine the section of each words, such that the search is described as depicted in figure 5.3.

  figure608
Figure 5.3:   Graphical representation of a lexical tree.

Since the number of unique word initial states in a lexical tree are limited to a closed set of phonemes, independent of the number of words in the vocabularly, this greatly saves in both memory and computational requirement.

In a system, lexical trees can only be used for isolated word recognition. To increase robustness and provide word spotting capability we need some way of distinguishing out of vocabulary speech events. Given the above vocabulary description, we would like the ability to recognize the spoken word even if it was uttered in a continuously spoken sentence. For example if a person said ``I was born in january'', then we would expect to recognize the word ``january'', within the context of the rest of the words spoken. For this purpose we use the following two methods of background speech modelling.

An Any word is a special type of word where the states or phonemes which describe the word, may transition to any of the states or phonemes in the word. Figure 5.4 depicts an Any model, consisting of the phonemes .pau, s and n. Since any state or phoneme can transition to any other state or phoneme within the Any model, this particular model, will be very effective in handling backgroud silence, background hissing which typically sounds like an s phoneme, and also background droning which is modelled using the n phoneme.

  figure620
Figure 5.4:   Graphical representation of the Any Model used in a lexical tree search for keyword-spotting.

 

static anyModelT any[] = {
  {"<.pau>", 1.0},
  {"s>$sil", 0.8},
  {"$sil<s", 0.8},
  {"n>$mid", 0.8},
  {"$mid<n", 0.8},
};
static int num_any = sizeof(any)/sizeof(anyModelT);

The any-model is specified by a list of phonemes in the model and for each phoneme an associated state transition penalty. Since some phonemes in the any model may also be likely initial or final states in the vocabularly it is necessary to penalize these transitions so they do not have the same score as when they enter or exit a word in the vocabularly.

Robustness is also greatly increased by using a garbage model. This means that the score for the background is more than the ``silence'' output of our phoneme probability estimator. Two simple garbage models currently implemented, are the median of the top N gif sorted phoneme probabilities and the max of a collection of phonemes likely to have high probabilities in noise. The ideal garbage model will have a better score in noise and out of vocabularly speech events than any of the vocabulary words.

Given the Any model and the list of expanded word models, we are now ready to create our search engine, using the buildTreeSearch function.

   

int buildSearch(GS_Result *result, recogT *recog) 
{
        .
        .
  if(!buildTreeSearch(result, &(recog->tree), xwords, nwords, any, num_any,
                      DEF_NBEST, DEF_FRAMESZ, recog->rr))
    goto build_err;
  initTreeSearch(recog->tree, DEF_SHORTPEN, DEF_LONGPEN,
                 1.0, DEF_NBEST);
        .
        .
}

The default values DEF_NBEST, DEF_FRAMESZ, DEF_SHORTPEN and DEF_LONPEN are defined in the header file tree.h.

The lexical tree search functions as a word spotter with the object only to find the most likely word within a spoken utterance. Typically it is not neccessary to know exactly where in time the word occurred, but rather whether a word in the specified vocabulary was spoken or not. There are however cases where we wish to know exactly where the word occurred in time, and also, what the corresponding phoneme alignment was. This is especially useful for automatic phoneme labeling or for word confidence measurements.

The lexical tree search has the capability of remembering the phoneme alignment of the requested N-best words recognized. Each time a state in a word is entered, this information is stored on what is referred to as a backtrace heap. After recognition the best path is retrieved from the backtrace heap by tracing the state sequence back in time. A backtrace heap can be created as follows:

 

int buildSearch(GS_Result *result, recogT *recog) 
{
        .
        .
  recog->btheap = btHeapInit(20000,20000);
        .
        .
}

Here the first input parameter of the btHeapInit function(20000) refers to the initial heap size and the second value the heap growth size. Whenever the backtrace heap reaches full capacity it will grow according to the number of growth cells specified by the second parameter.

This same backtrace heap may also be used when doing a grammar search to retrieve the best word and/or phoneme alignment. This will be explained later in the chapter. The updateTreeSearch function is used to perform a pipelined recognition update. The input is assumed to be log probabilities. In this example we use a wrapper function to convert the output neural network probabilities to log probabilities for the search.

 

static int SearchUpdate(recogT *recog, float *prob, int num_prob)
{
  int i;
  float *fbuf;

  fbuf = (float *)dbmalloc(num_prob * sizeof(float));

  for(i=0; i<num_prob; i++)
    fbuf[i] = log((prob[i] < 0.00001)? 0.00001:(double)prob[i]);
  updateTreeSearch(recog->tree,recog->btheap,fbuf);

  dbfree(fbuf);
  return 1;
}

After all frames of speech has been processed, the N best search answers and associated aligments may be retrieved through multiple calls to the seachTreeNextBest function. Note that the phoneme-level alignment will only be available if a backtrace heap was used during recognition.

static int PerformRecognition(recogT *recog, short *speech, int numsamples)
{
        .
        .
  best_word = searchTreeNextBest(recog->tree,recog->btheap, &phtokens,
                                 &n_phtokens, &score);
  printf("%s: %f\n", recog->tree->words[best_word], score);
  for(i=0; i<n_phtokens; i++) {
    printf("%10s %3d %3d %7.4f\n", phtokens[i].name, phtokens[i].begin,
           phtokens[i].end, phtokens[i].score);
  }
        .
        .
}

Figure 5.5 depicts the spoken word(january), together with its best alignment.

  figure643
Figure 5.5:   Phoneme alignment for the recognized word ``january''.

Building grammars

In the previous section we discussed keyword spotting using lexical trees. Keyword spotting, although useful does have limited capabilities. For example scheduling an appointment would be very difficult to do using only a keyword spotting mechanism, since the number of possible phrases can grow exponentially, depending on the task requirements. To make keyword spotting work for such a task would require a less natural communcation between man and machine and furthermore each phrase would have to be created as a unique word.

Instead we would like the user to respond in a much more natural form, e.g. ``seven thirty pm''. For this purpose we need to create a grammar which will direct the Viterbi search to recognize the expected sequence of words. Figure 5.6 depicts a typical state based grammar for scheduling an appointment.

  figure652
Figure 5.6:   State based grammar for scheduling an appointment.

The syntax rules for the definition of the grammar network are as follows. A node in the network is represented by a model name corresponding to a word pronunciation model. Additionally each node holds an external name which is used during the search when printing the identity of the node in the search path.

          name  = char{char}
          model = name["%" ( "%" | name )]

Here char represents any character except one of the special characters {} [] <> tex2html_wrap_inline1400 = $ () ;. These however may be escaped using a single backslash. The first part of the model definition is the internal name. This corresponds to the name of the word model used. The second option is the external name. If the external name is not specified then it is set to be the same as the internal name. If a second % symbol is specified then the external name is set to a zero length string. This option may be used, for example, to specify silence and/or noise models, so that they will not be output as part of the search output string.

Network definitions may also contain variables

          variable = $name
Variables are identified with a leading $ character. Variables are used to define sub networks, and must be defined before they appear within the right hand side of a rule using the form

          subnet = variable ``='' expr ``;''

An expression (expr) consists of a set of alternative sequences representing parallel branches of the network.

          expr = sequence{``|'' sequence} 
          sequence = factor{factor}

Each sequence is composed of a sequence of factors where a factor is either a model name, a variable representing some sub-network or an expression contained within various sorts of brackets.

          factor = ``('' expr ``)''   |
                   ``{'' expr ``}''   |
                   ``<'' expr ``>''   |
                   ``['' expr ``]''   |
                   model              |
                   variable

Ordinary parenthese () denote simple factoring, curly braces {} denote zero or more repitions and angle brackets <> denote one ore more repititions. Square brackets [] are used to enclose optional items. The complete network is defined as a list of sub-network definitions followed by a single expression withing parentheses.

          network = {subnet} ``(`` expr ``)''

C style comments may be placed anywhere in the text of the network definition.

buildGrammar creates a grammar object, with nodes and transitions describing the desired grammar state transitions. The following example depicts the steps needed to build a grammar object. The grammar specification for this example can be found in the file time.grammar.

$hour   =  one | two | three | four | five | six |  
           seven | eight | nine | ten | eleven | twelve; 
$ampm    =  am | pm; 
$halfday =  noon | midnight; 
$subhour =  thirty | oclock; 
$any     =  { justsil%% | sil%% };
$grammar =  $any (  
                  [twelve] [justsil%%] $halfday | 
                  $hour [justsil%%] [$subhour] [justsil%%] [$ampm]
                 ) $any;

The full source code for all code discussed in this section can be found in the example code time.c.

 

static int buildSearch(GS_Result *result, recogT *recog) 
{
        .
        .
  grammarT *g;
        .
        .
  nwords = num_time;
  if(initWords(Cres, time, nwords, &words) != GS_OK) 
    goto build_err;
  if(expandWords(Cres, recog->rr, words, nwords, &xwords) != GS_OK) 
    goto build_err;
  if(!(g=buildGrammar(Cres, buffer, "time"))) 
    goto build_err;
        .
        .
  freeGrammar(g);
        .
        .
}

Given the grammar, we are now ready to build the finite state grammar which will direct the Viterbi search according to the required word transitions. This is done using the createSearch command.

 

static int buildSearch(GS_Result *result, recogT *recog) 
{
        .
        .
  if(!(recog->search = createSearch(Cres,recog->rr, &g, 1, xwords, nwords,
                                    1.0, DEF_SHORTPEN, DEF_LONGPEN, 
                                    DEF_FRAMESZ, 1))) 
    goto build_err;
  recog->btheap = btHeapInit(20000,20000);
        .
        .
}

Note also that it is neccessary to utilize a backtrace heap, since the grammar based search needs the backtrace heap, to store the word and/or phoneme level alignment information. Similar to the updateTreeSearch function we can now perform an updateSearch on an example file. The updateSearch function performs an update on the state scores using each phoneme probability frame in the input phoneme probability matrix.

 

static int SearchUpdate(recogT *recog, float *prob, int num_prob)
{
  int i;
  float *fbuf;

  fbuf = (float *)dbmalloc(num_prob * sizeof(float));

  for(i=0; i<num_prob; i++)
    fbuf[i] = log((prob[i] < 0.00001)? 0.00001:(double)prob[i]);
  updateSearch(recog->search,recog->btheap,fbuf);

  dbfree(fbuf);
  return 1;
}

After all frames of speech has been processed, the best search answer and associated word and phoneme aligments may be retrieved using the findSearchBest function. Note that the phoneme-level alignment will only be available if it was requested during the search initialization phase. Since the grammar search may be created as a collection of a number of sub grammars, the findSearchBest function also returns the name of the subgrammar which corresponds to the best word path.

static int PerformRecognition(recogT *recog, short *speech, int numsamples)
{
        .
        .
  gname = findSearchBest(recog->search, recog->btheap, &wrdtokens,
                         &n_wrdtokens, &phtokens, &n_phtokens, &score);
  for(i=0; i<n_wrdtokens; i++) 
    printf("%s ", wrdtokens[i].name);
  printf("\n");
  printf("score: %f\n", score);
  printf("\n");
  
  for(i=0; i<n_wrdtokens; i++) {
    printf("%s %d %d  %f\n", wrdtokens[i].name, wrdtokens[i].begin,     
           wrdtokens[i].end, wrdtokens[i].score);
  }
  printf("\n");
        .
        .
}

  figure671
Figure 5.7:   Word and phoneme alignment for the recognized word string.

Figure 5.7 depicts the word level alignment for this test utterance.


next up previous contents
Next: About this document Up: Using CSLU-C for Speech Previous: Phoneme Probability Models

Johan Schalkwyk
Wed Nov 27 10:08:24 PST 1996