next up previous contents
Next: Noise-Regularized Adaptive Filtering Up: On-line Iterative Methods Previous: On-line Predictive Enhancement


Maximum-likelihood Estimation and Dual Kalman Filtering

A better motivated approach is to consider the problem of finding the maximum-likelihood estimates of the speech and the model parameters given the noisy data. However, even for linear models, this represents a difficult nonlinear optimization problem. Lim and Oppenheim [16], proposed finding an approximate maximum a posteriori estimation solution by iteratively Wiener filtering and using a least-squares approach to fit an LPC model. Since then, a number of researchers have proposed variations on this method, which include using the Expectation-Maximization (EM) algorithm [44,45,46,47], accommodating colored noise [48], and placing perceptual constraints on the iterative search [49].

Wan and Nelson [17] have proposed a related approach, where neural autoregressive models are used. The speech model is the same nonlinear autoregression as in Equations 4 and 5 of Section 14.2.2. However, to avoid using a single model to describe the entire nonstationary speech signal (or requiring the complexity of model-switching methods), the speech is windowed into approximately stationary segments, with a different model used for each segment. With a state-space representation of the speech model, the EKF method discussed in Section 14.4.2 gives the maximum-likelihood estimate of the speech assuming the model is known. However, as no clean data set is used, the model parameters themselves must now be learned on-line from the noisy data for each window of speech. To allow the simultaneous estimation of the speech model and speech signal, a separate set of state-equations for the parameters of the neural network (weight vector $ \bw$) is formulated:
$\displaystyle \bw_k$ $\displaystyle =$ $\displaystyle \bw_{k-1} + \alpha_k$ (17) 
$\displaystyle y_k$ $\displaystyle =$ $\displaystyle f(\bx_{k-1},\bw_k) + v_k + n_k,$ (18) 

where the state transition is simply an identity matrix and the covariance of $ \alpha_k$ is selected to improve convergence. The neural network $ f(\bx_{k-1},\bw_k)$ plays the role of a time-varying nonlinear observation on $ \bw$. An EKF can now be written to compute the maximum-likelihood estimate of the model, assuming the state $ \bx$ is known. The use of the EKF for weight estimation can also be related to Recursive Least Squares (RLS), where the covariance for $ \alpha_k$ plays the role of the ``forgetting factor'' in RLS [50]. Hence, this method represents an efficient second-order on-line optimization method.

Figure: The Dual Extend Kalman Filter (Dual EKF). EKFx and EKFw represent the filters for the states and the weights, respectively.

Figure: Cleaning Noisy Speech With The Dual EKF. Nonstationary white noise was generated artificially and added to the speech to create the noisy signal $ y$. The SNR improvement is 9.94 dB.

This weight EKF can be run in parallel with the EKF for state estimation, resulting in the Dual Extended Kalman Filter (Dual EKF) [51], shown in Figure 14.9. At each time step, the current estimate of $ \bx$ is used by the weight filter, and the current estimate of $ \bw$ is used by the state filter 9. This provides a very effective method for solving the maximum-likelihood estimates for the speech signal given only the noise source. Additional issues related to recurrent training, error coupling, the relationship of the algorithm to EM, as well as a two-observation form of the weight EKF are discussed in [51,52].

The result of applying the Dual EKF to a speech signal corrupted with simulated nonstationary bursting noise is shown in Figure 14.10. The method was applied to successive 64ms (512 point) windows of the signal, with a new window starting every 8ms (64 points). A normalized Hamming window was used to emphasize data in the center of the window, and deemphasize data in the periphery10. Feedforward networks with 10 inputs, 4 hidden units, and 1 output were used. Weights typically converged in less than 20 epochs. The results in the figure were computed assuming both $ \sigma^2_v$ and $ \sigma^2_n$ were known. The average SNR is improved by 9.94 dB, with little resultant distortion. When $ \sigma^2_n$ and $ \sigma^2_v$ are estimated using only the noisy signal,11 similar results are achieved with an SNR improvement of 8.50 dB. In comparison, classical techniques of spectral subtraction [19] and adaptive RASTA processing [53] achieve SNR improvements of only .65 and 1.26 dB, respectively. Experiments where real-world colored noise is added to the signal have also been performed. An advantage of the Kalman framework is that colored noise can be elegantly addressed by incorporating an additional state-space representation of the noise process. This modification affects both the state-estimation and the weight estimation equations.

In principle, this method can accommodate any speaker, noise, or noise level encountered. In this sense, it is more in the spirit of spectral subtraction, which works independently of the type of signal it is estimating. However, like spectral subtraction, the Dual EKF algorithm requires estimation of noise statistics.

While the approach does away with the need for a training set, there is considerable computational cost in training the neural networks on-line. Furthermore, the windowing of the data, which addresses the nonstationarity issue, also limits the size of the network structures that can be used. While for small windows of speech, compact models are sufficient ( e.g., vocoder technology), this also questions whether the approach fully utilizes the flexibility of neural modeling.

A possible direction of research which addresses some of these issues is an intermediate approach which makes some use of pre-trained models. This would be a state-dependent approach which selects among pre-trained class-based models using an HMM (see Section 14.4.2), and then adapts the selected model on-line to the noisy data12. This could produce faster convergence, avoid the need to explicitly window the data, and allow larger networks to be used.


next up previous contents
Next: Noise-Regularized Adaptive Filtering Up: On-line Iterative Methods Previous: On-line Predictive Enhancement   Contents