Multilanguage Telephone Speech v1.2
The OGI Multi-language Telephone Speech Corpus consists of telephone speech from 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, Vietnamese. The corpus contains fixed vocabulary utterances (e.g. days of the week) as well as fluent continuous speech. The current release includes recorded utterances from about 2052 speakers, for a total of about 38.5 hours of speech.
Each subject called the CSLU data collection system by dialing a toll-free number. An analog telephone line was connected to a Gradient Technologies box. Data from incoming calls were recorded by the Gradient box. The sampling rate was 8khz and the files were stored in 16bit linear format on a UNIX file system. Each utterance was recorded as a separate file.
This corpus was collected and developed in 1992.
Most subjects were respondents to postings on USEnet newsgroups. Subjects were asked to contribute their voice to science to help with the research.
As per the protocol (see below), each caller was asked to speak for one minute about any topic. In six of the languages some of these files, referred to as "stories", were selected for hand generated fine-phonetic transcriptions. The languages were: English(208), German(101), Hindi(68), Japanese(64), Mandarin(70), Spanish(108). The numbers in parentheses indicate the number of "stories" transcribed for that language.
Y. K. Muthusamy, Ph.D. Thesis, "A Segmental Approach to Automatic Language Identification," OGI Technical Report No. CSLU 93-002, Nov. 24, 1993. "The OGI Multi-language Telephone Speech Corpus" Y. K. Muthusamy, R. A. Cole and B. T. Oshika Proceedings of the International Conference on Spoken Language Processing, Banff, Alberta, Canada, October 1992.