The earliest and most straightforward use of neural networks for speech enhancement is as a direct nonlinear time-domain filter. This is illustrated in Figure 14.1, in which a multi-layer network is used to map a windowed segment of the noisy speech to an estimate of the clean speech. The number of inputs depends on the sampling rate of the speech signal and typically is set to cover 5 to 10 ms of data. The number of outputs is usually equal to the number of inputs. To train the network, clean speech is artificially corrupted to create noisy input data. The clean speech signal is used as a target that is time-aligned with the inputs. Standard backpropagation or any variety of training methods can be employed to minimize the mean-squared error (MSE) between the target and the output of the network.
![]() |
Data is presented by sliding the input window across the noisy speech.
At each step, the window is shifted by an increment,
, between
and the window length
. When
, the estimation window moves
along without overlap. For increments
, the resultant
overlapping windows provide redundant estimates.
In the case of a single time step increment,
(which most closely
corresponds to traditional filter implementations), the network
topology could be simplified to have only a single output.
However, it has generally been recognized that using multiple outputs
aids in the training process. The extra outputs balance the forward
and backward flow of signals during training, and allow for a greater
number of shared hidden units to be used with improved
generalization. After training, the estimate can be taken from one of
the center-most outputs, discarding the rest.
![]() |
![]() |
A sample experimental result using the direct time-domain approach is
shown in Figure 14.2. A neural network with 41 inputs, two
hidden layers of 41 units each, and 41 outputs was used
(41:41:41:41). The network was trained on 35 different TIMIT sentences
from different speakers. To produce the noisy inputs, pink noise was
added to the speech at randomly selected SNRs, chosen uniformly
between 0 and 6 dB.2 The network was tested using a
TIMIT speaker not in the training set. The shifting increment was
, and the center output was used to generate the
estimates. Figure 14.3 shows the network's
performance on this sentence over a range of initial SNRs, both on
pink noise (characteristic of the training data) and white noise (not
characteristic of the training data). While impressive performance is
achieved for SNRs within the training set range, note the fall-off in
improvement for other SNRs (as well as for white noise).
A number of researchers have reported superior results over linear filtering by using methods similar to the one described above [6,7,8,9]. Tamura [9] gives a detailed analysis of the role of the different layers in the networks, suggesting that the hidden layers provide a transformed representation of the signal and noise which facilitates their separation. Use of a neural network also allows for compensation of nonlinear channel effects, and some relaxation of the requirement that the additive noise be independent of the signal. In one application, Le and Mason report results on the method for noise introduced by a low bit-rate CELP encoder [7].
An additional variation on the filtering method results by restricting the number of units at a hidden layer to be less than the number of input or output units. This can provide noise suppression through dimensionality reduction similar to Ephraim's method based on signal subspace embedding [10]. Ephraim's idea is that the clean speech resides in a low dimensional subspace of the noisy speech space; after first removing the dimensions which contain only noise, enhancement is performed in the subspace. Discussion of four-layer networks used for dimensionality reduction can be found in [11,12] (although results for speech enhancement have not been reported).
In still another variation, researchers at Defense Group
Incorporated have implemented a recurrent structure by feeding back
the network filter outputs3.
Their system is a hybrid which uses traditional enhancement methods to
preprocess the data before feeding it into the neural
network. Reported results appear favorable in comparisons to a number
of other traditional methods.
The advantage of time-domain filtering is the ease and efficiency of
implementation.
Effectively, the neural network
approximates the conditional expectation
, where
is the clean speech and
is the
windowed noisy input. Note that the conditional expectation
corresponds to a linear
estimator only when all signal statistics are Gaussian (clearly
unrealistic for speech signals and real world noise sources). This is a
strong motivation for the use of nonlinear neural networks.
Once trained, a single fixed neural network is used to provide speech estimates. However, this also underscores the disadvantages of the approach. Using a fixed network implies a single, fixed expectation inferred from the entire training set. The corresponding conditional probability density can be written as
![]() |
(3) |
In general, the direct time-domain filtering approach is most applicable for reducing fixed noise types, or for compensating a distortion that is associated with a specific recording or communication channel. For example, the latter case was considered by Dahl and Claesson [6], who trained a neural network on a specific speaker and noise environment for a car-phone application.