Spectral Subtraction

The popularity of spectral subtraction is largely due to its relative
simplicity and ease of implementation. As shown in
Figure 14.4, the short-term power spectrum
(magnitude squared of the short-term Fourier transform) of the noisy
signal is computed, and an * estimate* of the short-term noise
spectrum is subtracted out to produce the estimated spectrum
of the clean speech. Explicitly,

(14) |

where the scaling factor allows for emphasis or deemphasis of the noise estimate, and allows for several variants, including power subtraction ( ) and magnitude subtraction ( ).

The estimate is combined with the phase from the
original noisy signal to produce an estimate of the Fourier
transform of . Finally, the inverse Fourier transform is applied with
the overlap-and-add method to construct a time-domain estimate of the
speech waveform . The assumption is that the phase
information is not important (perceptually), so only an estimate
of the magnitude of the speech is required.
Central to the linear spectral subtraction method is the additivity of
the speech and noise spectra in the Fourier transform domain, allowing
for simple linear subtraction of the noise spectrum estimate^{5}.

Typically, the noise spectrum is approximated from a window of the signal where no speech is present. This requires that the speech be accurately segmented into speech and non-speech parts. A related approach developed by Hirsch [20], estimates the noise level within a frequency subband by taking a histogram of spectral magnitudes over several successive time windows. The assumption is that the most frequently occurring value represents the magnitude of the noise in that band. Hence, care must be taken to compute the histogram over segments that have a sufficient number of non-speech segments. Another approach based on assumptions of a bimodal distribution of the total histogram of the logarithmic spectral energies is presented in [21,22].

Regardless of how the noise statistics are estimated, the true
short-term spectrum of the noise for the specific segment being
processed will always have finite variance (this is true even for a
stationary signal). Thus the noise estimate will * always*
over or under estimate the true noise level. This represents a
fundamental problem with spectral subtraction and other
transform-based methods. The consequence is that when (which
includes the true short-term noise signal) is near the level of the
estimated noise spectrum, spectral subtraction results in some
randomly located negative values of the quantity
. These negative values are clipped at zero to give a valid
power spectrum, resulting in a series of annoying low-level tones
(``musical noise'') throughout the estimated signal .

While a great deal of work has been done to try to reduce these effects [23,24,25,5,26,1], they can only be eliminated completely if the analysis window is increased to be of infinite length. This would allow the ``short-term'' noise spectrum to converge to the true spectrum. On the other hand, speech dictates that a finite short-term window be used to account for nonstationarities in the signal.

Note also that spectral subtraction only achieves a maximum-likelihood estimate of the (stationary) speech signal when the analysis window goes to infinity. This is in contrast to the Kalman approach, which can provide maximum-likelihood estimates using only finite length input windows. However, to be optimal, the KF requires prior knowledge of the true autoregressive model . In practice, this model can only be estimated. On the other hand, spectral subtraction is model free, but sacrifices maximum-likelihood estimates in order to accommodate the nonstationarity of the signals.