![]() |
![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Foreign Accented English v1.2The Foreign Accented English (FAE) corpus consists of American English utterances by non-native speakers. The corpus contains 4925 telephone quality utterances from native speakers of 23 languages. Three independent judgements of accent were made on each utterance by native American English speakers. Recording Conditions The data were collected with the CSLU T1 digital data collection system described in "Digital Data Collection at CSLU". The sampling rate was 8 khz and the files were stored in 8-bit m-law format. File Naming Conventions Each utterance is stored in an individual file, whose name indicates the language and session number of the caller. For example:
Speech File Formats The speech files in this corpus are stored in the RIFF standard file format. This file format is 16-bit linearly encoded. Verification Some of the files in this corpus are also included in the CSLU 22 Language Speech corpus. Those files have been verified by a native speaker of the language. A variety of information about the speaker was collected into an "info" file. There are info files for 1785 of the calls, since native speakers have not yet screened all of the calls. As an example, these are the contents of AR00145.inf: 145 general dialect bahrain 145 general gender male 145 general age adult 145 general connection good 145 general intelligibility goodThe first field is the call number, the second is the comment category (all are general), the third field contains the variety of information being presented, and the final field is the value of that particular item. Thus this file tells us that the speaker is an adult male who speaks the Bahrain dialect of Arabic. We can also see that the level of connection (line) quality and speaker intelligibility were good. Accent Judgements Three native speakers of American English independently listened to each utterance. They made judgements of the accent on a 4-point scale, according to the following guidelines.
The accent judgements were based solely on the phonetic variation caused by the foreign language influence. They were not based on improper grammar or word choice. Error Checking A list of all calls which were judged "1" by one judge, and "4" by another was generated and these conflicts were checked by one of the judges. During this phase, judges could only change their own incorrect judgements. If a judge was not available to check their side of a "1/4" conflict, then the utterance was excluded from the corpus. A total of 29 utterances were excluded from the corpus for this reason. If the utterance has a "-" for its accent judgement, then it was not heard by that judge. The judgement information is located in the file called judge.db in the misc/archives/ directory. The file contains one line for each utterance in the corpus, with the three accent judgements and the name of the file. The file format is: AR00145 3 2 3 Confusion Matrices We generated the following confusion matrices to show the agreement between the three judges based on language. Development and Evaluation Sets The training, development, and testing sets for the FAE Corpus are defined based on the call number of each of the files. The training set contains 60% of the data while the other two sets contain 20% each. A simple mechanism of using the call number modulo 5 is used to determine the set that a file belongs to. The following table summarizes this.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
![]() |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||