While the Fourier transform domain is computationally convenient, other domains such as the log power spectral, cepstral, and LPC domains are often more desirable to work in. For example, the log spectral domain is thought to offer a more perceptually relevant measure of distortion than the spectral domain, and also allows homomorphic filtering  to better compensate for channel distortion. If the goal of the enhancement is to improve perceptual quality, then there are clear advantages to minimizing distortion in a perceptually relevant domain. Furthermore, if the enhancement is for a front end to an ASR system, the transform-domain chosen for recognition usually dictates the domain in which speech enhancement occurs.
Unfortunately, removing noise in these alternative transform-domains can no longer be done with a simple linear subtraction of the noise from the speech. Because the transformed noise and speech are combined in a nonlinear way ( e.g., through a log function), a nonlinear form of ``subtraction'' is required. The necessary nonlinear function depends on the transform-domain and can be approximated by a neural network. This neural network ``subtraction'' or transform-domain mapping is illustrated in Figure 14.5. As in the time-domain approach, a representative training set of clean speech and corresponding noisy speech is used to train the network. However, one advantage of the transform-domain is that time-alignment of the inputs and targets is not as critical as with a time-domain mapping. This allows for generation of training data in more realistic settings using less controlled recording devices.
Although neural network ``subtraction'' allows for the use of nonlinear transforms, this flexibility comes at considerable cost. In classic spectral subtraction, the ``mapping'' is simply a subtraction and is functionally independent of the noise level and noise spectrum. This is precisely due to the linearity of the Fourier transform. However, if another transform is used, the neural network must be able to provide different mappings for different noise types and levels. This can be attempted by incorporating additional inputs to the network which encode estimates of the SNR and/or noise distribution in some way. Clearly, this requires that the training set includes a representative sample of the noise levels and distributions which are expected to be encountered by the final system.
During operation, errors in the noise spectrum estimates will result in degradation of performance. The problems associated with estimating the noise statistics are fundamentally the same as in classic spectral subtraction. However, artifacts such as musical noise may be less severe due to properties of the transform domain chosen. For example, the variance in the short-term spectral estimation using an LPC analysis will be lower than a direct DFT approach.6 Also, the nature of the nonlinear mapping is less likely to produce values that require the kind of strict truncation used in linear spectral subtraction. Finally, the use of a nonlinear mapping allows for a fair amount of freedom in the choice of transform domain, which can therefore be chosen for its inherent robustness and perceptual qualities. The associated enhancement technique might then be less affected by variations in the noise spectral estimation.
Like the noise source, the distribution of the speech signal also affects the neural network mapping. Training with a single network averages across all speech signals and effectively assumes stationarity of speech (both within a speech signal and between different speakers). This assumption is the same as in the direct time-domain approaches. However, it is not fully understood whether this assumption is more severe in the time-domain versus some transform domain. Approaches which skirt this issue through the use of multiple networks will be discussed in the next sections.
A number of researchers have performed preliminary investigations based on neural transform domain mappings. We summarize some of this work here.
|F16 Noise (SNR)||-6dB||0dB||6dB||12dB||18dB|
|Car Noise (SNR)||-6dB||0dB||6dB||12dB||18dB|
|F16 Noise (SNR)||0dB||3dB||9dB||15dB||21dB||dB|