Foreign Accented English v1.2

Structure | Protocol | Versions | Misc



The Foreign Accented English (FAE) corpus consists of American English utterances by non-native speakers. The corpus contains 4925 telephone quality utterances from native speakers of 23 languages. Three independent judgements of accent were made on each utterance by native American English speakers.

Recording Conditions

The data were collected with the CSLU T1 digital data collection system described in "Digital Data Collection at CSLU". The sampling rate was 8 khz and the files were stored in 8-bit m-law format.

File Naming Conventions

Each utterance is stored in an individual file, whose name indicates the language and session number of the caller. For example:
  • FAR00100.wav
The leading 'F' specifies that the file is a part of the FAE corpus. The next two letters, "AR" in this case, indicate the native language of the speaker. The final 5 digits represent the session number that was assigned during recording. The "wav" extension indicates that this is a speech file. If the file has a corresponding information file (see the verification section below) the file will be named the same but with an "inf" extension instead of "wav".

AR Arabic BP Brazilian
Portuguese
CA Cantonese CZ Czech
FA Farsi FR French GE German HI Hindi
HU Hungarian IN Indonesian IT Italian JA Japanese
KO Korean MA Mandarin MY Malay PO Polish
PP Iberian
Portuguese
RU Russian SD Swedish SP Spanish
SW Swahili TA Tamil VI Vietnamese


Speech File Formats

The speech files in this corpus are stored in the RIFF standard file format. This file format is 16-bit linearly encoded.

Verification

Some of the files in this corpus are also included in the CSLU 22 Language Speech corpus. Those files have been verified by a native speaker of the language. A variety of information about the speaker was collected into an "info" file. There are info files for 1785 of the calls, since native speakers have not yet screened all of the calls. As an example, these are the contents of AR00145.inf:
		145 general dialect bahrain
		145 general gender male
		145 general age adult
		145 general connection good
		145 general intelligibility good
	
The first field is the call number, the second is the comment category (all are general), the third field contains the variety of information being presented, and the final field is the value of that particular item. Thus this file tells us that the speaker is an adult male who speaks the Bahrain dialect of Arabic. We can also see that the level of connection (line) quality and speaker intelligibility were good.

Accent Judgements

Three native speakers of American English independently listened to each utterance. They made judgements of the accent on a 4-point scale, according to the following guidelines.
  1. Negligible/No Accent: Not accented at all, or difficult to determine if there is even an accent present.
  2. Mild Accent: Accent can be heard through most of the speech, but does not hinder understanding.
  3. Strong Accent: The accent is strong in all speech, and makes understanding difficult.
  4. Very Strong Accent: Intelligibility is hindered, and multiple listening were necessary to understand the speaker.


The accent judgements were based solely on the phonetic variation caused by the foreign language influence. They were not based on improper grammar or word choice.

Error Checking

A list of all calls which were judged "1" by one judge, and "4" by another was generated and these conflicts were checked by one of the judges. During this phase, judges could only change their own incorrect judgements. If a judge was not available to check their side of a "1/4" conflict, then the utterance was excluded from the corpus. A total of 29 utterances were excluded from the corpus for this reason. If the utterance has a "-" for its accent judgement, then it was not heard by that judge.

The judgement information is located in the file called judge.db in the misc/archives/ directory. The file contains one line for each utterance in the corpus, with the three accent judgements and the name of the file. The file format is:
AR00145        3       2       3
		
This example tells us that judges one and three felt that the speaker had a strong(3)accent, while judge two felt that the accent was mild(2).

Confusion Matrices

We generated the following confusion matrices to show the agreement between the three judges based on language.

Development and Evaluation Sets

The training, development, and testing sets for the FAE Corpus are defined based on the call number of each of the files. The training set contains 60% of the data while the other two sets contain 20% each. A simple mechanism of using the call number modulo 5 is used to determine the set that a file belongs to. The following table summarizes this.
mod 5 Set
0 Development
1,2,3 Training
4 Test