Categories | Inventors
CORPORA from CSLU: Foreign accented English.
OHSU # 0681-E
Categories:
Inventors:
- CSLU, SOM CSLU
Technology Overview
The
Foreign Accented English (FAE) corpus consists of American English utterances by
non-native speakers. The corpus contains 4925 telephone quality utterances from
native speakers of 23 languages. Three independent judgements of accent were
made on each utterance by native American English speakers.
Recording Conditions
The data
were collected with the CSLU T1 digital data collection system described in
"Digital Data Collection at CSLU". The sampling rate was 8 khz and the files
were stored in 8-bit m-law format.
File Naming Conventions
Each
utterance is stored in an individual file, whose name indicates the language and
session number of the caller. For example:
FAR00100.wav
The leading 'F' specifies that the file is a part of the FAE corpus. The next two letters, "AR" in this case, indicate the native language of the speaker. The final 5 digits represent the session number that was assigned during recording. The "wav" extension indicates that this is a speech file. If the file has a corresponding information file (see the verification section below) the file will be named the same but with an "inf" extension instead of "wav".
AR Arabic
BP Brazilian Portuguese
CA Cantonese
CZ Czech
FA Farsi
FR French
GE German
HI Hindi
HU Hungarian
IN Indonesian
IT Italian
JA Japanese
KO Korean
MA Mandarin
MY Malay
PO Polish
PP Iberian Portuguese
RU Russian
SD Swedish
SP Spanish
SW Swahili
TA Tamil
VI Vietnamese
Speech File Formats
The speech
files in this corpus are stored in the RIFF standard file format. This file
format is 16-bit linearly encoded.
Verification
Some of the files
in this corpus are also included in the CSLU 22 Language Speech corpus. Those
files have been verified by a native speaker of the language. A variety of
information about the speaker was collected into an "info" file. There are info
files for 1785 of the calls, since native speakers have not yet screened all of
the calls. As an example, these are the contents of AR00145.inf:
145 general dialect bahrain
145 general gender male
145 general age adult
145 general connection good
145 general intelligibility good
The
first field is the call number, the second is the comment category (all are
general), the third field contains the variety of information being presented,
and the final field is the value of that particular item. Thus this file tells
us that the speaker is an adult male who speaks the Bahrain dialect of Arabic.
We can also see that the level of connection (line) quality and speaker
intelligibility were good.
Accent Judgements
Three native
speakers of American English independently listened to each utterance. They made
judgements of the accent on a 4-point scale, according to the following
guidelines.
Negligible/No Accent: Not accented at all, or difficult to determine if there is even an accent present.
Mild Accent: Accent can be heard through most of the speech, but does not hinder understanding.
Strong Accent: The accent is strong in all speech, and makes understanding difficult.
Very
Strong Accent: Intelligibility is hindered, and multiple listening were
necessary to understand the speaker.
The
accent judgements were based solely on the phonetic variation caused by the
foreign language influence. They were not based on improper grammar or word
choice.
Error Checking
A list of all calls which were judged "1" by one judge, and "4" by
another was generated and these conflicts were checked by one of the judges.
During this phase, judges could only change their own incorrect judgements. If a
judge was not available to check their side of a "1/4" conflict, then the
utterance was excluded from the corpus. A total of 29 utterances were excluded
from the corpus for this reason. If the utterance has a "-" for its accent
judgement, then it was not heard by that judge.
The judgement
information is located in the file called judge.db in the misc/archives/
directory. The file contains one line for each utterance in the corpus, with the
three accent judgements and the name of the file. The file format is:
AR00145 3 2 3
This
example tells us that judges one and three felt that the speaker had a
strong(3)accent, while judge two felt that the accent was mild(2).
Confusion Matrices
We generated
the following confusion matrices to show the agreement between the three judges
based on language.
Development
and Evaluation Sets
The training, development, and testing sets for the FAE Corpus are
defined based on the call number of each of the files. The training set contains
60% of the data while the other two sets contain 20% each. A simple mechanism of
using the call number modulo 5 is used to determine the set that a file belongs
to. The following table summarizes this.
mod 5 Set
0 Development
1,2,3 Training
4 Test
The Center for Spoken Language Understanding (CSLU) distributes corpora to commercial entities and academic institutions for a fee. Commercial entities can use these corpora for research but also for creating commercial products such as generating acoustic models for speech recognition.
To place your order:
1. Click on the type of license you wish to order: Academic or non-profit entity or Commercial entity.
2. Terms of the license agreement can be viewed by clicking on the word "terms".
3. You agree to the terms of the license agreement when you click on "Add to Order" and proceed to the next screen.
4. If information on the "Order Contents" screen is correct, press "Check out".
5. On the next screen, a brief "Intended Use" is required. For "Recipient Scientist Information" enter the appropriate information for yourself or if you are placing the order for another person enter that information. We will use this information should we have questions about the order, payment or shipping address.
6. Once your payment has been received and verified by OHSU, your order will be approved by Technology Transfer & Business Development and then the DVD will be sent out by the Center for Spoken Language Understanding by FedEx within 5-10 business days.
For demos and more information, visit the CSLU Corpora website at:
http://www.cslu.ogi.edu/corpora/corpCurrent.html
For more information, contact:
Michele Gunness
Senior Technology Development Manager
503-494-4184
