Categories | Inventors
CORPORA from CSLU: 22 Language.
OHSU # 0681-A
Categories:
Inventors:
- CSLU, SOM CSLU
Technology Overview
The 22
Language corpus consists of telephone speech from 22 languages: Eastern Arabic,
Cantonese, Czech, Farsi, German, Hindi, Hungarian, Japanese, Korean, Malay,
Mandarin, Italian, Polish, Portuguese, Russian, Spanish, Swedish, Swahili,
Tamil, Vietnamese, and English. Unfortunately French is not available. The
corpus contains fixed vocabulary utterances (e.g. days of the week) as well as
fluent continuous speech. We were expecting at least 300 callers in each
language. Each utterance is verified by a native speaker to determine if the
caller followed instructions when answering the prompts. Some of the calls in
each language are transcribed orthographically.
Recording Details
All of
the data in this corpus were collected over digital telephone lines. The digital
data were recorded with the CSLU T1 digital data collection system. These files
were sampled at 8 khz 8-bit and stored as ulaw files.
All of the wave
files were converted to riff format with 16-bit linear coding.
Directory Structure
There are several top-level directories in
this distribution: docs, labels, misc, speech, trans.
The speech
directory contains the speech data files. Each speech filename has the following
structure:
For example:
EN-105.nlang.wav
This utterance is from the English speaker 105 and contains the
answer to the question "What is your native language?".
As a participant
proceeds through the data collection protocol, he is asked a series of
questions. Each of the responses is stored as a separate speechfile. The
utterance type code relates the recorded utterance to the protocol questions.
The description of the protocol shows all of the utterance codes.
These
audio and text files are subdivided into directories based on their call number
mod 10. So, these files would be found in /speech/10.
Verification
Each utterance included in the 22 Language Corpus
has gone through a process of verification. Native speakers of each language did
verification. The verifiers were asked to listen to each utterance and decide if
the speaker responded appropriately to the prompt. In addition, the verifiers
made judgements about the age, gender, and dialect of each speaker.
Two
native talkers verified the utterances in each language independently.
Subsequently, they reexamined each utterance for which there was disagreement
and produced an info file containing the 'resolved' judgements. Note: we
resolved differences in Spanish, Vietnamese and Swahili by chosing the person
with the overwhelmingly correct responses. For the other languages in the corpus
we resolved every disagreement by hand.
Initially we asked the verifiers
to make two judgement that are not now included in the release:
The Center for Spoken Language Understanding (CSLU) distributes corpora to commercial entities and academic institutions for a fee. Commercial entities can use these corpora for research but also for creating commercial products such as generating acoustic models for speech recognition.
To
place your order:
1.
Click on the type of license you wish to order: Academic or non-profit entity or
Commercial entity.
2.
Terms of the license agreement can be viewed by clicking on the word "terms".
3.
You agree to the terms of the license agreement when you click on "Add to Order"
and proceed to the next screen. 4.
If information on the "Order Contents" screen is correct, press "Check
out".
5.
On the next screen, a brief "Intended Use" is required. For "Recipient
Scientist Information" enter the appropriate information for yourself or if you
are placing the order for another person enter that information. We will use
this information should we have questions about the order, payment or
shipping address.
6.
Once your payment has been received and verified by OHSU, your order will be
approved by Technology Transfer & Business Development and then the DVD will
be sent out by the Center for Spoken Language Understanding by FedEx within 5-10
business days.
For
demos and more information, visit the CSLU website at:
http://www.cslu.ogi.edu/corpora/corpCurrent.html
Related Technologies:
- OHSU # 1195 — Clear-Speech Corpus, Speaker JPH
For more information, contact:
Michele Gunness
Senior Technology Development Manager
503-494-4184
