CORPORA from CSLU: Multilanguage telephone speech v1.2.
OHSU # 0681-I
- CSLU, SOM CSLU
The Multi-language Telephone Speech Corpus consists of telephone speech from 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, Vietnamese. The corpus contains fixed vocabulary utterances (e.g. days of the week) as well as fluent continuous speech. The current release includes recorded utterances from about 2052 speakers, for a total of about 38.5 hours of speech.
Each subject called the CSLU data collection system by dialing a toll-free number. An analog telephone line was connected to a Gradient Technologies box. Data from incoming calls were recorded by the Gradient box. The sampling rate was 8khz and the files were stored in 16bit linear format on a UNIX file system. Each utterance was recorded as a separate file.
This corpus was collected and developed in 1992.
Most subjects were respondents to postings on USEnet newsgroups. Subjects were asked to contribute their voice to science to help with the research.
As per the protocol (see below), each caller was asked to speak for one minute about any topic. In six of the languages some of these files, referred to as "stories", were selected for hand generated fine-phonetic transcriptions. The languages were: English(208), German(101), Hindi(68), Japanese(64), Mandarin(70), Spanish(108). The numbers in parentheses indicate the number of "stories" transcribed for that language.
Y. K. Muthusamy, Ph.D. Thesis, "A Segmental Approach to Automatic Language Identification," OGI Technical
Report No. CSLU 93-002, Nov. 24, 1993.
"The OGI Multi-language Telephone Speech Corpus" Y. K. Muthusamy, R. A. Cole and B. T. Oshika
Proceedings of the International Conference on Spoken Language Processing, Banff, Alberta, Canada, October 1992.
To place your order:
1. Click on the type of license you wish to order. The Academic or non-profit entity fee is $50; Commercial entity fee is $3,000.
2. Terms of the license agreement can be viewed by clicking on the word "terms".
3. You agree to the terms of the license agreement when you click on "Add to Order" and proceed to the next screen.
4. If information on the "Order Contents" screen is correct, press "Check out".
5. On the next screen, a brief "Intended Use" is required. For "Recipient Scientist Information" enter the appropriate information for yourself or if you are placing the order for another person enter that information. We will use this information should we have questions about the order, payment or shipping address.
6. Once your payment has been received and verified by OHSU, your order will be approved by Technology Transfer & Business Development and then the DVD will be sent out by the Center for Spoken Language Understanding by FedEx within 5-10 business days.
more information and demos, visit the CSLU Corpora website at:
For more information, contact:
Senior Technology Development Manager