next up previous contents
Next: Phoneme Probability Models Up: Using CSLU-C for Speech Previous: A Quick Tutorial

Feature Extraction

Speech recognition at its most elementary level, comprises a collection of algorithms drawn from a wide variety of disciplines, including statistical pattern recognition, communication theory, signal processing and linguistics among others. Although each of these areas is relied on to varying degrees in different recognizers, the greatest important common denominator of all recognition systems is the signal processing front-end, which converts the speech waveform to some of type of parametric representation. This parametric representation is then used for further analysis and processing. This chapter is devoted to discussion of the signal processing algorithms currently available within the CSLU-C environment. These are

In this chapter we will present examples on how to use each of these signal processing routines. The file analysis.c contains the complete source code for all examples discussed in this chapter.

Power spectral analysis (FFT)

One of the more common techniques of studying a speech signal is via the power spectrum. The power spectrum of a speech signal describes the frequency content of the signal over time.

The first step towards computing the power spectrum of the speech signal is to perform a Discrete Fourier Transform (DFT). A DFT computes the frequency information of the equivalent time domain signal. Since a speech signal contains only real point values, we can make use of this fact and use a real-point Fast Fourier Transform (FFT) for increased efficiency. The resulting output contains both the magnitude and phase information of the original time domain signal.

fftDefault returns a fftParamT structure which contains the default parameters needed for the FFT analysis. This parameter structure may be altered directly if other than default parameters are needed.

   

  {
    fftParamT *param;
    fftdataT *p;
    float **fft;
    int i,j,frames;

    /* create a fft data object using default parameters */
    param = fftDefault();
    p = fftInit(param);
    free((char *)param);
        .
        .
  }

In this example we use the VECALLOC macro to allocate a two dimensional array. The VECFREE macro is used to free this allocated memory.

   

#define VECALLOC(v,y,x) {v=(float **)dbmalloc((y)*sizeof(float *)); \
v[0]=(float *)dbmalloc((y) * (x) * sizeof(float)); \
memset((char *)v[0], 0, (y)*(x)*sizeof(float)); \
for(i=1; i<y; i++) v[i] = v[i-1] + x;}

#define VECFREE(v) {dbfree((char *)v[0]); dbfree((char *)v);}

In all of the analysis routines defined in the analysis library the *BufferFrames and *BufferCompute functions are used to present the speech waveform according to the desired frame increment and analysis window size. It is assumed that the necessary memory has been allocated to store the output values.

fftBufferFrames calculates the number of output frames based on the parameters of the fftdataT structure. The next step is then to allocate the necessary memory based on the total number of frames computed and then process the data using the fftBufferCompute function call, as shown below.

The logPowerSpectrum function converts this output to contain only log magnitude information which can then be used to plot the familiar spectogram.

     

    /* frame based fft analysis */
    frames = fftBufferFrames(p, numsamples);
    VECALLOC(fft, frames, p->fftlength);
    fftBufferCompute(fspeech, numsamples, p, fft);
    for(i=0; i<frames; i++)
      logPowerSpectrum(fft[i], fft[i], p->fftlength);

Figure 3.1 depicts the spectrogram plotted using the power spectral analysis of the speech waveform.

  figure312
Figure 3.1:   Power spectral analysis of speech.

Linear predictive analysis (LPC)

One of the more powerful analysis techniques is the method of linear prediction. Linear predictive analysis of speech has become the predominant technique for estimating the basic parameters of speech. Linear predictive analysis provides both an accurate estimate of the speech parameters and also an efficient computational model of speech.

The basic idea behind linear predictive analysis is that a specific speech sample at the current time can be approximated as a linear combination of past speech samples. Through minimizing the sum of squared differences (over a finite interval) between the actual speech samples and linear predicted values a unique set of parameters or predictor coefficients can be determined. These coefficients form the basis for linear predictive analysis of speech. Linear predictive coefficients are computed using the LPC functions defined in the analysis library.

In reality the actual predictor coefficients are never used in recognition, since they typical show high variance. The predictor coefficient are transformed to a more robust set of parameters known as cepstral coefficients [].

The lpcDefault function defines the parameters needed to govern the LPC computation process. lpcDefault returns a lpcParamT structure which contains the necessary parameters needed for the LPC frame based computation. This parameter structure may then be altered directly if other than default values are needed.

   

  {
    lpcParamT *param;
    lpcdataT *p;
    float **lpc;
    int i, j, frames;

    /* create a lpc data object using default paramaters */
    param = lpcDefault();
    param->doenergy = 1;
    p = lpcInit(param);
    free((char *)param);
       .
       .
  }

After the parameters have been defined we can now compute the frame based LPC analysis of the speech signal. Similar to the FFT example we use the lpcBufferFrames and lpcBufferCompute functions to process the speech, computing the LP cepstral coefficients for each frame of speech.

   

    /* frame based lpc analysis */
    frames = lpcBufferFrames(p, numsamples);
    VECALLOC(lpc, frames, p->lpc_output);
    lpcBufferCompute(fspeech, numsamples, p, lpc);

Figure 3.2 depicts the LPC spectrogram computed from a 12'th order LPC analysis of the speech waveform.

  figure333
Figure 3.2:   Linear predictive analysis of speech.

Perceptual linear prediction (PLP)

Perceptual linear prediction, similar to LPC analysis, is based on the short-term spectrum of speech. In contrast to pure linear predictive analysis of speech, perceptual linear prediction(PLP) modifies the short-term spectrum of the speech by several psychophysically based transformations. The PLP cepstral coefficients are computed using the PLP functions defined in the analysis library.

Just like most other short-term spectrum based techniques this method is vulnerable when the short-term spectral values are modified by the frequency response of the communication channel. The PLP functions provides limited capability of dealing with these distortion by employing a RASTA [] (Relative Spectral) filter which makes PLP analysis more robust to linear spectral distortions.

Before the we can compute the frame based PLP analysis, we first need to define the parameters which govern the computation process. The default parameters are defined using the plpDefault function. plpDefault returns a plpParamT structure which contains the necessary parameters needed for the PLP frame based computation. This parameter structure may be altered directly if other than default parameters are needed.

   

  {
    plpParamT *param;
    plpdataT *p;
    float **plp;
    int i, j, frames;

    /* create a plp data object using default paramaters */
    param = plpDefault();
    param->rasta = 0.0;
    param->doenergy = 1; 
    if(!(p = plpInit(Cres, param))) {
      fprintf(stderr, "plpInit: %s\n", Cres->result);
      exit(1);
    }
    free((char *)param);
        .
        .
  }

The rasta coefficient, currently set to 0.0 indicates that no RASTA processing is being done. This value may vary between 0.0 (no rasta) to 1.0(full rasta). For intermediate values, the output represents a mixture of both RASTA filtered and unfiltered PLP cepstral coefficients.

After the parameters have been defined we can compute the frame based PLP analysis of the speech signal. The plpBufferFrames and plpBufferCompute functions computes the PLP cepstral coefficients for each frame of speech.

   

    /* frame based plp analysis */
    frames = plpBufferFrames(p, numsamples);
    VECALLOC(plp, frames, p->plp_output);
    plpBufferCompute(fspeech, numsamples, p, plp);

Figure 3.3 depicts the PLP spectrogram computed from a 7'th order PLP analysis of the speech waveform.

  figure184
Figure 3.3:   Perceptual linear predictive(PLP) analysis of speech.

Mel scale cepstral analysis (MEL)

Mel scale cepstral analysis is very similar to perceptual linear predictive analysis of speech, where the short term spectrum is modified based on psychophysically based spectral transformations. In this method, however, the spectrum is warped according to the MEL Scale, whereas in PLP the spectrum is warped according to the Bark Scale. The main difference between Mel scale cepstral analysis and perceptual linear prediction is related to the output cepstral coefficients. The PLP model discussed above uses an all-pole model to smooth the modified power spectrum. The output cepstral coefficients are then computed based on this model. In contrast Mel scale cepstral analysis uses cepstral smoothing to smooth the modified power spectrum. This is done by direct transformation of the log power spectrum to the cepstral domain using an inverse Discrete Fourier Transform(DFT).

Similar to PLP, Mel scale analysis has the option of using a RASTA filter to compensate for linear channel distortions. The default Mel scale analysis parameters are defined using the melDefault function. melDefault returns a melParamT structure which contains the necessary parameters needed for the Mel scale frame based computation.

   

{
    melParamT *param;
    meldataT *p;
    float **mel;
    int i, j, frames;

    /* create a mel data object using default parameters */
    param = melDefault();
    param->mel_order = 10;
    param->rasta = 0.0;
    param->doenergy = 1;
    p = melInit(Cres, param);
    free((char *)param);
}

Similar to the examples defined above, the melBufferFrames and melBufferCompute functions computes the Mel scale cepstral frames based analysis for each frame of speech.

   

    /* frame based mel cepstral analysis */
    frames = melBufferFrames(p, numsamples);
    VECALLOC(mel,frames,p->mel_output);
    melBufferCompute(fspeech,numsamples,p,mel);

Figure 3.4 depicts the MEL spectrogram computed from a 8'th order MEL analysis of the speech waveform.

  figure373
Figure 3.4:   Mel scale filter bank analysis of speech.

Computing the first-order derivative

Another useful signal processing technique used when studying robust features, is the time derivative of the feature vector. The set of delta functions defined in the analysis library computes the first-order time derivative of an input feature vector sequence.

The deltaCompute function computes the first-order time derivative using a 1'st order difference approximation. Higher order approximations may be set by changing the order parameter of the deltaParamT structure. Similar to the other signal processing functions the default parameters may be retrieved using the deltaDefault function.

   

  {
    lpcParamT *lpcparam;
    lpcdataT *p;
    deltaParamT *deltaparam;
    deltaT *dd;
    float **lpc;
    float **delta;
    int i,j, frames;

    /* create a lpc data object using default parameters */
    lpcparam = lpcDefault();
    p = lpcInit(lpcparam);
    free((char *)lpcparam);

    /* create a delta data object */
    deltaparam = deltaDefault();
    deltaparam->fdim = lpcparam->lpc_output;
    dd = deltaInit(deltaparam);
        .
        .
  }

The deltaBufferCompute function implements the first-order time derivative. This function computes both the first order derivative of each feature coefficient and the magnitude of the derivative. Higher order derivatives can be obtained through successive calls of the deltaBufferCompute function.

 

    /* frames based delta lpc analysis */
    frames = lpcBufferFrames(p, numsamples);
    VECALLOC(lpc,frames,p->lpc_output);
    lpcBufferCompute(fspeech,numsamples,p,lpc);
    
    VECALLOC(delta, frames, p->lpc_output+1);
    deltaBufferCompute(lpc,frames,delta,dd,1);

Figure 3.5 depicts the magnitude of the first order derivative calculated from the LPC cepstrum obtained earlier in the chapter.

  figure393
Figure 3.5:   Magnitude of first order derivative, calculated from the LPC cepstrum.

Relative spectra filtering(RASTA)

To compensate for linear channel distortions the analysis library provides the ability to perform RASTA filtering. The RASTA filter can be used either in the log spectral or cepstral domains. In effect the RASTA filter band passes each feature coefficient. Linear channel distortions appear as an additive constant in both the log spectral and the cepstral domains. The high-pass portion of the equivalent band pass filter alleviates the effect of convolutional noise introduced in the channel. The low-pass filtering helps in smoothing frame to frame spectral changes. The rasta functions are used for this purpose. The default RASTA filter parameters are defined using the rastaDefault function. rastaDefault returns a rastaParamT data structure which contains the necessary parameters needed for the frame-based rasta processing.

   

  {
    lpcParamT *lpcparam;
    lpcdataT *p;
    rastaParamT *rastaparam;
    rastafilterT *rf;
    float **lpc;
    int i, j, frames;

    /* create a lpc data object using default paramaters */
    lpcparam = lpcDefault();
    p = lpcInit(lpcparam);
    free((char *)lpcparam);

    /* create a rasta filter object */
    rastaparam = rastaDefault();
    rastaparam->nfilts = lpcparam->lpc_output;
    rf = rastaInit(rastaparam);
    free((char *)rastaparam);
         .
         .
  }

The rastaBufferCompute function is used to filter the log domain coefficients.

 

    /* frame based rasta lpc analysis */
    frames = lpcBufferFrames(p, numsamples);
    VECALLOC(lpc,frames,p->lpc_output);
    lpcBufferCompute(fspeech,numsamples,p,lpc);
    rastaBufferCompute(lpc, frames, lpc, rf);

Energy normalization

One of the problems of working with the energy in a speech signal, is that it typically has a great deal of variance. These include variance in loudness, the recording as well as the variance in the signal energy between different phoneme sounds. An energy normalization algorithm is used to reduce this variance, such that either loud or soft speech recordings will not effect the underlying recognition technology.

The energy coefficient is normalized using an automatic gain control filter (AGC), with a look-ahead buffer of 160 ms. The normalization is performed using a variable gain amplifier in which the gain is controlled by a peak detector on the energy feature. The peak detector has a decay factor of 0.999 and also includes a limiter to prevent excessive gain during silence. Currently the energy normalization leads to an inherent delay of 160ms within a pipelined recognition process.

The energyNorm function implements the above described algorithm. This routine functions on the 0'th coefficient of the input feature object. The energyNormDefault and energyNormInit functions are used to create an enormT data object which contains the default parameters which govern the energy normalization process.

   

  {
    plpParamT *plpparam;
    plpdataT *p;
    enormParamT *enormparam;
    enormT *en;
    float **plp;
    int i, j, frames;
    
    /* create a plp data object using default parameters */
    plpparam = plpDefault();
    p = plpInit(Cres, plpparam);
    free((char *)plpparam);

    /* create energy normalization data object using default parameters */
    enormparam = energyNormDefault();
    en = energyNormInit(enormparam);
    free((char *)enormparam);
  }

The energyNorm function computes on a buffer of feature vectors as shown below.

 

    /* frame based plp analysis, with energy normalization */
    frames = plpBufferFrames(p,numsamples);
    VECALLOC(plp,frames,p->plp_output);
    plpBufferCompute(fspeech,numsamples,p,plp);
    energyNorm(plp, frames, plp, en, 1);

Chapter references

chapterbib419


next up previous contents
Next: Phoneme Probability Models Up: Using CSLU-C for Speech Previous: A Quick Tutorial

Johan Schalkwyk
Wed Nov 27 10:08:24 PST 1996