![]() |
![]() |
|||||||||||||||||||||||||||||||||||||
22 Language v1.3General Description The 22 Language corpus consists of telephone speech from 22 languages: Eastern Arabic, Cantonese, Czech, Farsi, French, German, Hindi, Hungarian, Japanese, Korean, Malay, Mandarin, Italian, Polish, Portuguese, Russian, Spanish, Swedish, Swahili, Tamil, Vietnamese, and English. Unfortunately French is not available. The corpus contains fixed vocabulary utterances (e.g. days of the week) as well as fluent continuous speech. We were expecting at least 300 callers in each language. Each utterance is verified by a native speaker to determine if the caller followed instructions when answering the prompts. Some of the calls in each language are transcribed orthographically. Recording Details All of the data in this corpus were collected over digital telephone lines. The digital data were recorded with the CSLU T1 digital data collection system. These files were sampled at 8 khz 8-bit and stored as ulaw files. All of the wave files were converted to riff format with 16-bit linear coding. Directory Structure There are several top-level directories in this distribution: docs, labels, misc, speech, trans. The speech directory contains the speech data files. Each speech filename has the following structure:
For example: This utterance is from the English speaker 105 and contains the answer to the question "What is your native language?". As a participant proceeds through the data collection protocol, he is asked a series of questions. Each of the responses is stored as a separate speechfile. The utterance type code relates the recorded utterance to the protocol questions. The description of the protocol shows all of the utterance codes. These audio and text files are subdivided into directories based on their call number mod 10. So, these files would be found in /speech/10. Verification Each utterance included in the 22 Language Corpus has gone through a process of verification. Native speakers of each language did verification. The verifiers were asked to listen to each utterance and decide if the speaker responded appropriately to the prompt. In addition, the verifiers made judgements about the age, gender, and dialect of each speaker. Two native talkers verified the utterances in each language independently. Subsequently, they reexamined each utterance for which there was disagreement and produced an info file containing the 'resolved' judgements. Note: we resolved differences in Spanish, Vietnamese and Swahili by chosing the person with the overwhelmingly correct responses. For the other languages in the corpus we resolved every disagreement by hand. Initially we asked the verifiers to make two judgement that are not now included in the release: |
||||||||||||||||||||||||||||||||||||||
![]() |
||||||||||||||||||||||||||||||||||||||