next up previous contents
Next: Extended Kalman Filtering with Up: Neural Time-Domain Filtering Methods Previous: Neural Time-Domain Filtering Methods


Direct Time-Domain Mapping


The earliest and most straightforward use of neural networks for speech enhancement is as a direct nonlinear time-domain filter. This is illustrated in Figure 14.1, in which a multi-layer network is used to map a windowed segment of the noisy speech to an estimate of the clean speech. The number of inputs depends on the sampling rate of the speech signal and typically is set to cover 5 to 10 ms of data. The number of outputs is usually equal to the number of inputs. To train the network, clean speech is artificially corrupted to create noisy input data. The clean speech signal is used as a target that is time-aligned with the inputs. Standard backpropagation or any variety of training methods can be employed to minimize the mean-squared error (MSE) between the target and the output of the network.

Figure 14.1: Illustration of a neural network filter which maps an input data vector to an output vector.
\begin{figure}\epsfxsize = 3.5in\center{\leavevmode\epsfbox {tdmapping.eps}}\end{figure}

Data is presented by sliding the input window across the noisy speech. At each step, the window is shifted by an increment, $ L$, between $ 1$ and the window length $ M$. When $ L=M$, the estimation window moves along without overlap. For increments $ L < M$, the resultant overlapping windows provide redundant estimates.

In the case of a single time step increment, $ L=1$ (which most closely corresponds to traditional filter implementations), the network topology could be simplified to have only a single output. However, it has generally been recognized that using multiple outputs aids in the training process. The extra outputs balance the forward and backward flow of signals during training, and allow for a greater number of shared hidden units to be used with improved generalization. After training, the estimate can be taken from one of the center-most outputs, discarding the rest.

Figure: Experimental test set results using a direct time-domain neural filter. Original SNR was 3.0 dB with pink noise. Improvement was 7.04 dB.

Figure: SNR improvement of a direct time-domain neural filter trained on 35 different sentences mixed with pink noise. During training, SNRs were varied between 0 and 6 dB.

A sample experimental result using the direct time-domain approach is shown in Figure 14.2. A neural network with 41 inputs, two hidden layers of 41 units each, and 41 outputs was used (41:41:41:41). The network was trained on 35 different TIMIT sentences from different speakers. To produce the noisy inputs, pink noise was added to the speech at randomly selected SNRs, chosen uniformly between 0 and 6 dB.2 The network was tested using a TIMIT speaker not in the training set. The shifting increment was $ L=1$, and the center output was used to generate the estimates. Figure 14.3 shows the network's performance on this sentence over a range of initial SNRs, both on pink noise (characteristic of the training data) and white noise (not characteristic of the training data). While impressive performance is achieved for SNRs within the training set range, note the fall-off in improvement for other SNRs (as well as for white noise).

A number of researchers have reported superior results over linear filtering by using methods similar to the one described above [6,7,8,9]. Tamura [9] gives a detailed analysis of the role of the different layers in the networks, suggesting that the hidden layers provide a transformed representation of the signal and noise which facilitates their separation. Use of a neural network also allows for compensation of nonlinear channel effects, and some relaxation of the requirement that the additive noise be independent of the signal. In one application, Le and Mason report results on the method for noise introduced by a low bit-rate CELP encoder [7].

An additional variation on the filtering method results by restricting the number of units at a hidden layer to be less than the number of input or output units. This can provide noise suppression through dimensionality reduction similar to Ephraim's method based on signal subspace embedding [10]. Ephraim's idea is that the clean speech resides in a low dimensional subspace of the noisy speech space; after first removing the dimensions which contain only noise, enhancement is performed in the subspace. Discussion of four-layer networks used for dimensionality reduction can be found in [11,12] (although results for speech enhancement have not been reported).

In still another variation, researchers at Defense Group Incorporated have implemented a recurrent structure by feeding back the network filter outputs3. Their system is a hybrid which uses traditional enhancement methods to preprocess the data before feeding it into the neural network. Reported results appear favorable in comparisons to a number of other traditional methods.

The advantage of time-domain filtering is the ease and efficiency of implementation. Effectively, the neural network approximates the conditional expectation $ E[x_k\vert{\bf y}_k]$, where $ x_k$ is the clean speech and $ {\bf y}_k =[y_{k-M/2}, y_{k-M/2+1}, \cdots , y_{k+M/2}]$ is the windowed noisy input. Note that the conditional expectation corresponds to a linear estimator only when all signal statistics are Gaussian (clearly unrealistic for speech signals and real world noise sources). This is a strong motivation for the use of nonlinear neural networks.

Once trained, a single fixed neural network is used to provide speech estimates. However, this also underscores the disadvantages of the approach. Using a fixed network implies a single, fixed expectation inferred from the entire training set. The corresponding conditional probability density can be written as

$\displaystyle \rho(x_k\vert{\bf y}_k) = \frac{\rho({\bf y}_k\vert x_k) \rho(x_k)}{\rho({\bf y}_k)}.$ (3) 

Thus assuming the expectation is constant is equivalent to assuming that both $ x_k$ and $ y_k$, have constant density functions. In other words, both the noise and the speech signals would have to be stationary processes. This is clearly not the case. While training on a variety of different SNRs and speakers can greatly improve generalization (as seen in the experiment in this section), this does not explicitly account for the nonstationarity. Some researchers have incorporated pitch information as additional inputs to attempt to account somewhat for the nonstationarity of the speech. Moakes reported on this in the context of radial basis function networks [8]. Variations in the speech and noise statistics can also be addressed by using an estimate of the time-specific SNR as an additional input to the network. Related ``switching'' based methods are addressed in Section 14.4.

In general, the direct time-domain filtering approach is most applicable for reducing fixed noise types, or for compensating a distortion that is associated with a specific recording or communication channel. For example, the latter case was considered by Dahl and Claesson [6], who trained a neural network on a specific speaker and noise environment for a car-phone application.


next up previous contents
Next: Extended Kalman Filtering with Up: Neural Time-Domain Filtering Methods Previous: Neural Time-Domain Filtering Methods   Contents