CORPORA from CSLU: Foreign accented English
OHSU # 0681-E
The Foreign Accented English (FAE) corpus consists of American English utterances by non-native speakers. The corpus contains 4925 telephone quality utterances from native speakers of 23 languages. Three independent judgements of accent were made on each utterance by native American English speakers.
The data were collected with the CSLU T1 digital data collection system described in "Digital Data Collection at CSLU". The sampling rate was 8 khz and the files were stored in 8-bit m-law format.
File Naming Conventions:
Each utterance is stored in an individual file, whose name indicates the language and session number of the caller. For example: FAR00100.wav
The leading 'F' specifies that the file is a part of the FAE corpus. The next two letters, "AR" in this case, indicate the native language of the speaker. The final 5 digits represent the session number that was assigned during recording. The "wav" extension indicates that this is a speech file. If the file has a corresponding information file (see the verification section below) the file will be named the same but with an "inf" extension instead of "wav".
BP Brazilian Portuguese
PP Iberian Portuguese
Speech File Formats:
The speech files in this corpus are stored in the RIFF standard file format. This file format is 16-bit linearly encoded.
Some of the files in this corpus are also included in the CSLU 22 Language Speech corpus. Those files have been verified by a native speaker of the language. A variety of information about the speaker was collected into an "info" file. There are info files for 1785 of the calls, since native speakers have not yet screened all of the calls. As an example, these are the contents of AR00145.inf:
145 general dialect bahrain
145 general gender male
145 general age adult
145 general connection good
145 general intelligibility good
The first field is the call number, the second is the comment category (all are general), the third field contains the variety of information being presented, and the final field is the value of that particular item. Thus this file tells us that the speaker is an adult male who speaks the Bahrain dialect of Arabic. We can also see that the level of connection (line) quality and speaker intelligibility were good.
Three native speakers of American English independently listened to each utterance. They made judgements of the accent on a 4-point scale, according to the following guidelines.
Negligible/No Accent: Not accented at all, or difficult to determine if there is even an accent present.
Mild Accent: Accent can be heard through most of the speech, but does not hinder understanding.
Strong Accent: The accent is strong in all speech, and makes understanding difficult.
Very Strong Accent: Intelligibility is hindered, and multiple listening were necessary to understand the speaker.
The accent judgements were based solely on the phonetic variation caused by the foreign language influence. They were not based on improper grammar or word choice.
A list of all calls which were judged "1" by one judge, and "4" by another was generated and these conflicts were checked by one of the judges. During this phase, judges could only change their own incorrect judgements. If a judge was not available to check their side of a "1/4" conflict, then the utterance was excluded from the corpus. A total of 29 utterances were excluded from the corpus for this reason. If the utterance has a "-" for its accent judgement, then it was not heard by that judge.
The judgement information is located in the file called judge.db in the misc/archives/ directory. The file contains one line for each utterance in the corpus, with the three accent judgements and the name of the file. The file format is:
AR00145 3 2 3
This example tells us that judges one and three felt that the speaker had a strong(3)accent, while judge two felt that the accent was mild(2).
We generated the following confusion matrices to show the agreement between the three judges based on language.
Development and Evaluation Sets:
The training, development, and testing sets for the FAE Corpus are defined based on the call number of each of the files. The training set contains 60% of the data while the other two sets contain 20% each. A simple mechanism of using the call number modulo 5 is used to determine the set that a file belongs to. The following table summarizes this.
mod 5 Set
The Center for Spoken Language Understanding (CSLU) distributes corpora to commercial entities and academic institutions for a fee. Commercial entities can use these corpora for research but also for creating commercial products such as generating acoustic models for speech recognition.
To place your order:
1. Click on the type of license you wish to order: Academic or non-profit entity or Commercial entity.
2. Terms of the license agreement can be viewed by clicking on the word "terms".
3. You agree to the terms of the license agreement when you click on "Add to Order" and proceed to the next screen.
4. If information on the "Order Contents" screen is correct, press "Check out".
5. On the next screen, a brief "Intended Use" is required. For "Recipient Scientist Information" enter the appropriate information for yourself or if you are placing the order for another person enter that information. We will use this information should we have questions about the order, payment or shipping address.
6. Once your payment has been received and verified by OHSU, your order will be approved by Technology Transfer & Business Development and then the DVD will be sent out by the Center for Spoken Language Understanding by FedEx within 5-10 business days.
For demos and more information, visit the CSLU Corpora website at:
- CSLU, SOM CSLU
For more information, contact:
Technology Development Manager