Each phoneme is distinguished by its own unique pattern in the spectrogram. For voiced phonemes, the signature involves large concentrations of energy called formants; within each formant, and typically across all active formants, there is a characteristic waxing and waning of energy in all frequencies which is the most salient characteristic of what we call the human voice; this cyclic pattern is caused by the repetitive opening and closing of the vocal cords which occurs at an average of 125 times per second in the average adult male, and approximately twice as fast (250 Hz) in the adult female, giving rise to the sensation of pitch. Voicing is a relatively long-lasting phenomenon in speech; during voicing, the spectral or frequency characteristics of a formant evolves as phonemes unfold and succeed one another. Formants which are relatively unchanging over time are found in the monophthong vowels and the nasals; formants which are more variable over time are found in the diphthong vowels and the approximants, but in all cases the rate of change is relatively slow. In addition, there is good spectral continuity, the exceptions being the singularities involved in the beginning and end of the nasals (/m/, /n/, and /N/) and the lateral /l/.
Formant values can vary widely from person to person, but the spectrogram reader learns to recognize patterns which are independent of particular frequencies and which identify the various phonemes with a high degree of reliability.
The monophthong vowels have strong stable formants; in addition, these vowels can usually be easily distinguished by the frequency values of the first two or three formants, which are called F1, F2, and F3. For these reasons the monophthong vowels are often used to illustrate the concept of formants, but it is important to remember that all voiced phonemes have formants, even if they are not as easy to recognize and classify as the monophthong vowel formants. Voiceless sounds are not usually said to have formants; instead, the plosives should be visualized as a great burst of energy across all frequencies occurring after relative silence, while the aspirates and fricatives are better considered as clouds or oceans of relatively smooth energy along both the time and frequency axes.
In the vowels, F1 can vary from 300 Hz to 1000 Hz. The lower it is, the closer the tongue is to the roof of the mouth. The vowel /i:/ as in the word 'beet' has one of the lowest F1 values - about 300 Hz; in contrast, the vowel /A/ as in the word 'bought' (or 'Bob' in speakers who distinguish the vowels in the two words) has the highest F1 value - about 950 Hz. Pronounce these two vowels and try to determine how your tongue is configured for each.
F2 can vary from 850 Hz to 2500 Hz; the F2 value is proportional to the frontness or backness of the highest part of the tongue during the production of the vowel. In addition, lip rounding causes a lower F2 than with unrounded lips. For example, /i:/ as in the word 'beet' has an F2 of 2200 Hz, the highest F2 of any vowel. In the production of this vowel the tongue tip is quite far forward and the lips are unrounded. At the opposite extreme, /u/ as in the word 'boot' has an F2 of 850 Hz; in this vowel the tongue tip is very far back, and the lips are rounded.
F3 is also important is determining the phonemic quality of a given speech sound, and the higher formants such as F4 and F5 are thought to be significant in determining voice quality.
Click here to look at the formants and other acoustic markers, which we call spectral cues, for the phonemes of American English.
Now you know about waveforms, spectrograms, phonemes, and formants. It's time to look at some spectrograms.
Figure 1 - The formants for the vowel /@/ as in the word 'bat', which has nearly equal frequency separation between F1 and F2, on the one hand, and F2 and F3, on the other. In our color spectrograms, the formants can be traced by following the green and yellow bands between the increasing red and decreasing blue bands. In this vowel, F1 is at about 900 Hz, F2 at about 1600 Hz, and F3 at about 2400 Hz. There is also an F4 at 3600 Hz.