Figure 1 - Spectrogram of the word "compute." The vertical axis represents frequencies up to 8000 Hz , the horizontal axis shows positive time toward the right, and the colors represent the most important acoustic peaks for a given time frame, with red representing the highest energies, then in decreasing order of importance, orange, yellow, green, cyan, blue, and magenta, with gray areas having even less energy and white areas below a threshold decibel level.
An experienced spectrogram reader has no trouble identifying the word "compute" from the visually salient patterns in the image above. To give one example, the vertical burst of energy followed by a red area at the bottom and lesser energy above at the extreme right of the spectrogram is a typical pattern for the sound 't' at the end of a syllable or word. The other speech sounds, or phonemes, in the word "compute", are equally distinct in their shapes; the initial unstressed syllable /kh ^ m/, the silence and bilabial burst of /pc ph/, and the stressed vowel /ju/ which represents the passage from a high front vowel to a high back vowel by the falling F2, and the proximity of the alveolar plosive by a subsequent rise in F2 toward the alveolar locus of 1800 Hz.
Speech consists of vibrations produced in the vocal tract. The vibrations themselves can be represented by speech waveforms. It is not possible to read the phonemes in a waveform, but if we analyze the waveform into its frequency components, we obtain a spectrogram which can be deciphered.
The notion of frequency is very important in many branches of science. When physical events are cyclical, or nearly cyclical, frequency is a measure of how many times the cycle repeats per unit time. The unit of frequency is the Hertz, abbreviated as Hz. If something is happening 10 times per second, it has a frequency of 10 Hz.
The audible frequency range in human beings extends from 20 Hz to 20,000 Hz (20 kHz). Human beings cannot hear vibrations which occur at frequencies less than 20 times per second, nor can we detect frequencies greater than 20 kHz. Dolphin vocalizations and bat hunting cries, however, have frequency components up to 100 kHz, since members of these species have a more extensive audible range than the primates. Speech sounds contain energies at all frequencies in the audible range, although it is thought that most phonetic information is concentrated below 8000 Hz. Telephone speech cuts off frequencies above 3500 Hz, but we are able to communicate without major difficulties over the telephone, although with more need for clear speech and more requests for repetition.
We apply a mathematical technique called Fourier analysis to the speech waveform in order to discover what frequencies are present at any given moment in the speech signal. The result of Fourier analysis is a spectrum. After we have computed the spectrum for one short section or window (typically 5 to 20 milliseconds) of speech, we compute the spectrum for the adjoining window, and so on to the end of the waveform. In general, neighboring spectra vary slowly and smoothly, reflecting the slow movements of the vocal tract relative to the length of the analysis window.
A spectrogram such as the one at the top of this page is created by displaying all of the spectra computed from the speech waveform together. The vertical axis in a spectrogram represents frequency, with 0 Hz at the bottom. The lines visible in the spectrogram on this page each represent 1000 Hz along the frequency axis, so that the spectrogram contains 8000 Hz in total. All of the spectra computed by the Fourier transform are displayed parallel to this vertical or y-axis. The horizontal axis represents time; as we move right along the x-axis we shift forward in time, traversing one spectrum after another. Spectrograms are normally computed and kept in computer memory as a two-dimensional array of acoustic energy values. For a given spectrogram S, the strength of a given frequency component f at a given time t in the speech signal is represented by the darkness or color of the corresponding point S(t,f).
Spectrograms have traditionally been displayed in a gray-scale rendition, where the darkness of a given point is proportional to the energy at that time and frequency. Click here to see a gray-scale version of Figure 1 above. At the CSLU we are experimenting with the use of color to highlight the important features of a spectrogram. In the spectrogram in Figure 1 we use shades of red to mean increasing energy along the frequency axis, blue to mean decreasing energy, and yellow and green to mean an energy maximum. Areas which are white do not have enough energy to be of interest to us.
We often display the waveform and spectrogram for the same segment of speech one on top
of the other. In such displays, it is easy to see the relation between patterns in the
waveform and the corresponding patterns in the spectrogram. Click
here to see the combined waveform and spectrogram for the spectrogram on this page.
In order to understand how we can read the patterns in spectrograms, you first need to
know something about phonemes. What are phonemes?