CORPORA from CSLU: Foreign accented English


OHSU # 0681-E


The Foreign Accented English (FAE) corpus consists of American English utterances by non-native speakers. The corpus contains 4925 telephone quality utterances from native speakers of 23 languages. Three independent judgements of accent were made on each utterance by native American English speakers.

Recording Conditions:
The data were collected with the CSLU T1 digital data collection system described in "Digital Data Collection at CSLU". The sampling rate was 8 khz and the files were stored in 8-bit m-law format.

File Naming Conventions:
Each utterance is stored in an individual file, whose name indicates the language and session number of the caller. For example: FAR00100.wav


The leading 'F' specifies that the file is a part of the FAE corpus. The next two letters, "AR" in this case, indicate the native language of the speaker. The final 5 digits represent the session number that was assigned during recording. The "wav" extension indicates that this is a speech file. If the file has a corresponding information file (see the verification section below) the file will be named the same but with an "inf" extension instead of "wav".


AR     Arabic  

BP     Brazilian Portuguese

CA     Cantonese

CZ     Czech

FA     Farsi

FR      French

GE     German

HI      Hindi

HU     Hungarian

IN      Indonesian

IT       Italian

JA     Japanese

KO     Korean

MA    Mandarin

MY     Malay

PO      Polish

PP      Iberian Portuguese

RU      Russian

SD      Swedish

SP      Spanish

SW     Swahili

TA      Tamil

VI       Vietnamese

Speech File Formats:
The speech files in this corpus are stored in the RIFF standard file format. This file format is 16-bit linearly encoded.

Some of the files in this corpus are also included in the CSLU 22 Language Speech corpus. Those files have been verified by a native speaker of the language. A variety of information about the speaker was collected into an "info" file. There are info files for 1785 of the calls, since native speakers have not yet screened all of the calls. As an example, these are the contents of AR00145.inf:


                                145 general dialect bahrain

                                145 general gender male

                                145 general age adult

                                145 general connection good

                                145 general intelligibility good


The first field is the call number, the second is the comment category (all are general), the third field contains the variety of information being presented, and the final field is the value of that particular item. Thus this file tells us that the speaker is an adult male who speaks the Bahrain dialect of Arabic. We can also see that the level of connection (line) quality and speaker intelligibility were good.

Accent Judgements:
Three native speakers of American English independently listened to each utterance. They made judgements of the accent on a 4-point scale, according to the following guidelines.

Negligible/No Accent: Not accented at all, or difficult to determine if there is even an accent present.


Mild Accent: Accent can be heard through most of the speech, but does not hinder understanding.


Strong Accent: The accent is strong in all speech, and makes understanding difficult.

Very Strong Accent: Intelligibility is hindered, and multiple listening were necessary to understand the speaker.

The accent judgements were based solely on the phonetic variation caused by the foreign language influence. They were not based on improper grammar or word choice.

Error Checking:
A list of all calls which were judged "1" by one judge, and "4" by another was generated and these conflicts were checked by one of the judges. During this phase, judges could only change their own incorrect judgements. If a judge was not available to check their side of a "1/4" conflict, then the utterance was excluded from the corpus. A total of 29 utterances were excluded from the corpus for this reason. If the utterance has a "-" for its accent judgement, then it was not heard by that judge.

The judgement information is located in the file called judge.db in the misc/archives/ directory. The file contains one line for each utterance in the corpus, with the three accent judgements and the name of the file. The file format is:


AR00145        3       2       3


This example tells us that judges one and three felt that the speaker had a strong(3)accent, while judge two felt that the accent was mild(2).

Confusion Matrices:
We generated the following confusion matrices to show the agreement between the three judges based on language.

Development and Evaluation Sets:
The training, development, and testing sets for the FAE Corpus are defined based on the call number of each of the files. The training set contains 60% of the data while the other two sets contain 20% each. A simple mechanism of using the call number modulo 5 is used to determine the set that a file belongs to. The following table summarizes this.


mod 5                  Set

 0                      Development

1,2,3                 Training

 4                      Test


