Current Corpora

Pricing and ordering details can be found online at OHSU's Office of Technology & Research Collaborations. Follow the menu options under Industrial Opportunities to Speech & Language or contact Michele Gunness, Licensing Associate & Compliance Manager at 503-494-4184.

Other questions can be directed to our corpus group manager.

22 Language v1.5 * (sample)
The 22 Language corpus consists of telephone speech from 21 languages. Some of the calls in each language are transcribed orthographically.
Alphadigit v1.3 (sample)
The Alphadigit Corpus is a collection of 78,044 examples from 3,025 speakers saying 6 digit strings of letters and digits over the telephone.
Apple Words and Phrases v1.3 (sample)
Developed with support from Apple Computer, Inc. 3008 calls were collected and each caller repeated a list of command phrases as they were prompted.
Cellular Words and Phrases v1.3 (sample)
Consists of utterances gathered from 336 callers who were using cellular telephones. Each caller listened and responded to a series of pre-recorded prompts.
The clearspeechjph corpus contains microphone speech from a single speaker (JPH), who spoke 140 sentences (70 sentences each from Material A and Material B) in both "clear" and "conversational" speaking styles.
Foreign Accented English v1.2 (sample)
The corpus contains 4925 telephone quality utterances from native speakers of 23 languages speaking English.
Isolet v1.3 (sample)
Isolet is a coprus of letters of the English alphabet spoken in isolation. The database consists of 7800 spoken letters, 2 productions of each letter by 150 speakers.
Kids' Speech v1.1 (sample)
This final release of the Kids' Speech Corpus comprises spontaneous and scripted utterances from children in grades K through 10. All children read approximately 60 items from a total list of 319 phonetically-balanced but simple words, sentences, or digit strings. Each utterance of spontaneous speech begins with a recitation of the alphabet and contains a monologue of about one minute in duration. Orthographic transcriptions of each spontaneous utterance are included, and transcriptions of the scripted utterances are available via table lookup. All files have been verified for accuracy.
Multi Channel Overlapping Numbers Corpus (MONC) v1.0 (sample)
A portion of the Numbers corpus played through loudspeakers, re-recorded on a 12-channel table-top microphone array in a meeting room.
NOTE: This corpus is currently not available for commercial use.
Multi-Language Telephone Speech v1.2 (sample)
The OGI Multi-language Telephone Speech Corpus consists of telephone speech from 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, Vietnamese. Time-aligned phonetic transcriptions are available for some of the utterances.
Names v1.3 (sample)
The Names Corpus is a collection of first and last name utterances. All of the utterance have been phonetically transcribed.
National Cellular v2.3 * (sample)
Consists of cellular telephone speech from 2336 callers from locations throughout the United States.
Numbers v1.3 (sample)
The Numbers Corpus is a collection of naturally produced numbers. The utterances were taken from other CSLU telephone speech data collections, and include isolated digit strings, continuous digit strings, and ordinal/cardinal numbers.
Portland Cellular v1.3 (sample)
The Portland Cellular Corpus consists of utterances gathered from callers who were using cellular telephones in Portland, Oregon area.
Speaker Recognition v1.1 * (sample)
The Speaker Recognition corpus (formerly known as Speaker Verification), consists of telephone speech from 91 participants. Each participant has recorded speech in twelve sessions over a two-year period.
Spelled and Spoken Words v1.2 (sample)
The Spelled and Spoken Words corpus consists of spelled and spoken words. From over 4000 callers. 1000 callers also recited the English alphabet with pauses between the letters. In addition, a subset of the calls has been phonetically labeled.
SR4X v1.2 (sample)
This corpus is a collection of 36 speakers saying 11 words 6 times on 4 different channels.
Stories v1.2 (sample)
The Stories Corpus is made up of extemporaneous speech collected from English speakers in the CSLU Multi-language Telephone Speech data collection
The Spoltech Brazilian Portuguese v1.0 (sample)
The Spoltech Brazilian Portuguese corpus consists of prompted sentences and answers to questions, recorded in a number of regions of Brazil. The speech data (8080 utterances) from 477 speakers have been recorded at 44.1 kHz, and there are 2572 orthographic transcriptions and 5507 time-aligned phoneme-level transcriptions. The acoustic environment was not controlled, in order to provide realistic background conditions.
VOICES v1.0 (sample)
The VOICES Corpus contains 12 speakers reading 50 phonetically rich sentences. The recording procedure involved a "mimicking" approach which resulted in a high degree of natural time-alignment between different speakers.
NOTE: Corpus VOICES is available for commercial via a special licensing agreement, not as part of the standard membership agreement.
Yes/No v1.2 (sample)
The Yes/No Corpus is a collection of answers to yes/no questions from other CSLU corpora.
Corpus NameNumber of SpeakersNumber of Speech FilesDuration of Speech FilesOrthographic non-Time Align TranscriptionDuration of Orthographic non-Time Align TranscriptionNumber of Time Align Phonetic Transcription FilesDuration of Time Align Phonetic Transcription Files
22 language20685019198:48:031975840:23:36--
* The marked corpora were developed with support from the National Science Foundation.