next up previous contents
Next: Neural Transform-Domain Mappings Up: Neural Transform-Domain Methods Previous: Neural Transform-Domain Methods

Spectral Subtraction

Figure: Spectral Subtraction Enhancement. The estimated noise spectrum is subtracted from the spectrum of a window of noisy speech. The noisy phase is combined with the result before computing the inverse DFT to produce the enhanced speech waveform.

The popularity of spectral subtraction is largely due to its relative simplicity and ease of implementation. As shown in Figure 14.4, the short-term power spectrum $ \Ph_y$ (magnitude squared of the short-term Fourier transform) of the noisy signal is computed, and an estimate of the short-term noise spectrum $ \Ph_n$ is subtracted out to produce the estimated spectrum $ \Ph_x$ of the clean speech. Explicitly,

$\displaystyle \Ph_x = [\Ph_y^\gamma - \alpha \Ph_n^\gamma]^{1/\gamma}$ (14) 

where the scaling factor $ \alpha$ allows for emphasis or deemphasis of the noise estimate, and $ \gamma$ allows for several variants, including power subtraction ( $ \gamma = 1$) and magnitude subtraction ( $ \gamma =0.5$).

The estimate $ \Ph_x$ is combined with the phase from the original noisy signal to produce an estimate of the Fourier transform of $ x$. Finally, the inverse Fourier transform is applied with the overlap-and-add method to construct a time-domain estimate of the speech waveform $ \hat{x}$. The assumption is that the phase information is not important (perceptually), so only an estimate of the magnitude of the speech is required. Central to the linear spectral subtraction method is the additivity of the speech and noise spectra in the Fourier transform domain, allowing for simple linear subtraction of the noise spectrum estimate5.

Typically, the noise spectrum $ \Ph_n$ is approximated from a window of the signal where no speech is present. This requires that the speech be accurately segmented into speech and non-speech parts. A related approach developed by Hirsch [20], estimates the noise level within a frequency subband by taking a histogram of spectral magnitudes over several successive time windows. The assumption is that the most frequently occurring value represents the magnitude of the noise in that band. Hence, care must be taken to compute the histogram over segments that have a sufficient number of non-speech segments. Another approach based on assumptions of a bimodal distribution of the total histogram of the logarithmic spectral energies is presented in [21,22].

Regardless of how the noise statistics are estimated, the true short-term spectrum of the noise for the specific segment being processed will always have finite variance (this is true even for a stationary signal). Thus the noise estimate $ \Ph_n$ will always over or under estimate the true noise level. This represents a fundamental problem with spectral subtraction and other transform-based methods. The consequence is that when $ \Ph_y$ (which includes the true short-term noise signal) is near the level of the estimated noise spectrum, spectral subtraction results in some randomly located negative values of the quantity $ \Ph_y -\Ph_n$. These negative values are clipped at zero to give a valid power spectrum, resulting in a series of annoying low-level tones (``musical noise'') throughout the estimated signal $ \hat{x}$.

While a great deal of work has been done to try to reduce these effects [23,24,25,5,26,1], they can only be eliminated completely if the analysis window is increased to be of infinite length. This would allow the ``short-term'' noise spectrum to converge to the true spectrum. On the other hand, speech dictates that a finite short-term window be used to account for nonstationarities in the signal.

Note also that spectral subtraction only achieves a maximum-likelihood estimate of the (stationary) speech signal when the analysis window goes to infinity. This is in contrast to the Kalman approach, which can provide maximum-likelihood estimates using only finite length input windows. However, to be optimal, the KF requires prior knowledge of the true autoregressive model $ f(\cdot)$. In practice, this model can only be estimated. On the other hand, spectral subtraction is model free, but sacrifices maximum-likelihood estimates in order to accommodate the nonstationarity of the signals.

next up previous contents
Next: Neural Transform-Domain Mappings Up: Neural Transform-Domain Methods Previous: Neural Transform-Domain Methods   Contents