Categories | Inventors
CORPORA from CSLU: Multilanguage telephone speech v1.2.
OHSU # 0681-I
Categories:
Inventors:
- CSLU, SOM CSLU
Technology Overview
The Multi-language
Telephone Speech Corpus consists of telephone speech from 11 languages: English,
Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil,
Vietnamese. The corpus contains fixed vocabulary utterances
(e.g. days of the week) as well as fluent continuous speech. The current release
includes recorded utterances from about 2052 speakers, for a total of about 38.5
hours of speech.
Recording conditions
Each subject called
the CSLU data collection system by dialing a toll-free number. An analog
telephone line was connected to a Gradient Technologies box. Data from
incoming calls were recorded by the Gradient box. The sampling rate was
8khz and the files were stored in 16bit linear format on a UNIX file
system. Each utterance was recorded as a separate file.
This
corpus was collected and developed in 1992.
Subject
Population
Most subjects were respondents to postings on
USEnet newsgroups. Subjects were asked to contribute their voice
to science to help with the research.
Annotation
As per the protocol (see below), each caller was asked to
speak for one minute about any topic. In six of the languages some of
these files, referred to as "stories", were selected for hand generated
fine-phonetic transcriptions. The languages were: English(208),
German(101), Hindi(68), Japanese(64), Mandarin(70), Spanish(108). The
numbers in parentheses indicate the number of "stories" transcribed for
that language.
References
Y.
K. Muthusamy, Ph.D. Thesis, "A Segmental Approach to Automatic Language
Identification," OGI Technical
Report No. CSLU 93-002, Nov. 24,
1993.
"The OGI Multi-language Telephone Speech Corpus" Y. K.
Muthusamy, R. A. Cole and B. T. Oshika
Proceedings of the International
Conference on Spoken Language Processing, Banff, Alberta, Canada, October
1992.
To
place your order:
1.
Click on the type of license you wish to order. The Academic or
non-profit entity fee is $50; Commercial entity fee is $3,000.
2.
Terms of the license agreement can be viewed by clicking on the word
"terms".
3.
You agree to the terms of the license agreement when you click on "Add to
Order" and proceed to the next screen.
4.
If information on the "Order Contents" screen is correct, press
"Check out".
5.
On the next screen, a brief "Intended Use" is required. For
"Recipient Scientist Information" enter the appropriate information for
yourself or if you are placing the order for another person enter that
information. We will use this information should we have questions about
the order, payment or shipping address.
6.
Once your payment has been received and verified by OHSU, your order will
be approved by Technology Transfer & Business Development and then the
DVD will be sent out by the Center for Spoken Language Understanding by
FedEx within 5-10 business days.
For
more information and demos, visit the CSLU Corpora website at:
http://www.cslu.ogi.edu/corpora/corpCurrent.html
For more information, contact:
Michele Gunness
Senior Technology Development Manager
503-494-4184
