CORPORA from CSLU: Names v1.3
OHSU # 0681-J
The Names Corpus is a collection of 24,245 first and last name utterances from 20184 speakers. The utterances were taken from many other telephone speech data collections that have been completed at the CSLU, during which callers were asked to say their first and last names, or asked to leave their name and address to receive an award coupon (addresses are not include in corpus). Each file in the Names corpus has an orthographic transcription following the CSLU Labeling Guide. Also, to take advantage of the phonemic variability, 24245 of the utterance have been phonetically transcribed. The selection of files to phonetically transcribe was constrained by a process that selected files that were suspected to contain phonetic contexts that had not yet been transcribed.
Each subject called the CSLU data collection system by dialing a toll-free number. Depending on which data collection the caller was calling, the call was recorded over an analog line, or a digital line.
The analog telephone line was connected to a Gradient Technologies box. Data from incoming calls were recorded by the Gradient box. The sampling rate was 8khz and the files were stored in 16bit linear format on a UNIX file system. Each utterance was recorded as a separate file.
The digital data were collected with the CSLU T1 digital data collection system described in "Digital Data Collection at CSLU". The sampling rate was 8khz and the files were stored in 8-bit mu-law format on a UNIX file system.
Subjects whose utterances are included in this corpus are respondents to USEnet postings, radio advertisement, newspaper advertisements, and interoffice memos.
All of the files included in this corpus have corresponding non-time-aligned word-level transcriptions that comply with the conventions in the CSLU Labeling Guide. In addition, 24245 files have phonetic transcriptions, also following the conventions in the Labeling Guide.
This corpus is described in "Corpus development activities at the Center for Spoken Language Understanding".
The Center for Spoken Language Understanding (CSLU) distributes corpora to commercial entities and academic institutions for a fee. Commercial entities can use these corpora for research but also for creating commercial products such as generating acoustic models for speech recognition.
To place your order:
1. Click on the type of license you wish to order. The Academic or non-profit entity fee is $50; Commercial entity fee is $3,000.
2. Terms of the license agreement can be viewed by clicking on the word "terms".
3. You agree to the terms of the license agreement when you click on "Add to Order" and proceed to the next screen.
4. If information on the "Order Contents" screen is correct, press "Check out".
5. On the next screen, a brief "Intended Use" is required. For "Recipient Scientist Information" enter the appropriate information for yourself or if you are placing the order for another person enter that information. We will use this information should we have questions about the order, payment or shipping address.
6. Once your payment has been received and verified by OHSU, your order will be approved by Technology Transfer & Business Development and then the DVD will be sent out by the Center for Spoken Language Understanding by FedEx within 5-10 business days.
For more information and demos, visit the CSLU Corpora website at:
- CSLU, SOM CSLU
For more information, contact:
Technology Development Manager