In this chapter we have provided an overview of a number of different neural network approaches to speech enhancement. We can summarize the techniques and their assumptions as follows:
The extended Kalman filtering approach uses a predictive neural model (trained on clean speech) in a state-space framework in order to produce approximate maximum-likelihood estimates of the speech. The assumptions are that the signals are stationary, and the statistics of the speech to be enhanced will be the same as those of the training set. The speech signal is assumed to be well modeled by an autoregressive process driven by white Gaussian noise, and the additive noise is also assumed to be Gaussian (but can be colored); estimates of both the process and additive noise variances are assumed available. Compensation for channel effects is only possible when a model of the channel is available.
The simplest approach to training a speech enhancer on-line is to build an adaptive predictor of the speech. However, model-order constraints must be introduced to prevent prediction of the noise. It is assumed that the correlation length of the noise is less than that of the speech signal.
The second approach uses the EKF speech estimator in conjunction with a second EKF parameter estimator. Using short windows of the noisy speech and running these two estimators in parallel results in the Dual EKF algorithm. This method provides an efficient approximate maximum likelihood estimation for both the speech and the model parameters. Assumptions are the same as in the basic EKF approach, with the exception that stationarity is assumed only over short-term windows.
The last approach, referred to as Noise-Regularized Adaptive Filtering, utilized a novel expansion of the MSE cost function to allow training of direct time-domain filters on-line; in the linear case, this approach is closely related to a time-domain implementation of spectral subtraction and signal subspace embedding. This approach does not assume a model for the speech production. However, it is assumed that the second order approximation to a necessary noise regularization term is sufficient to allow correct MSE training of the network. Other assumptions are the same as in the Dual EKF approach.
While considerable progress has been made with these techniques, a number of key areas must still be addressed before we can expect widespread acceptance. Most important is the establishment of consistent evaluations to allow proper benchmarking between different approaches. Standardized databases should be used, with a variety of noise sources that include real world examples and go beyond the simple white Gaussian noise assumption. Performance should be determined from established metrics (improvement in SNR, segmental SNR, Itakura distance, weighted spectral slope measures, mean-opinion-scores, recognition accuracy, etc.). In addition, the basic techniques presented here must evolve to better incorporate perceptually relevant metrics for optimization. This is an area where research on neural networks still lags considerably behind the traditional speech community.
Finally, the accurate estimation of the corrupting noise statistics remains a weak link in the algorithms that require these estimates as inputs. Research must be conducted to improve these estimates, or new techniques developed which avoid the need for explicit knowledge of the noise statistics.
In spite of the decades of work that has gone into understanding speech signals and issues in speech enhancement, the seemingly simple task of removing noise remains a formidable challenge. While it is still too early to draw definite conclusions, neural networks appear to offer an appropriate and powerful tool for further progress in this challenge.