CORPORA from CSLU: Alphadigit.
OHSU # 0681-B
The Alphadigit Corpus is a collection of 78,044 examples from 3,025 speakers saying six digit strings of letters and digits over the telephone. A total of about 82 hours of speech are included in Release 1.3. Each file has an orthographic transcription and time align transcription as well.
Each subject called the CSLU data collection system by dialing a toll-free number. The data were recorded directly off of a digital phone line without digital-to-analog or analog-to-digital conversion at the recording end.
The digital data were collected with the CSLU T1 digital
data collection system. The sampling rate was 8khz and the files were stored in
8-bit mu-law format on a UNIX file system . The speech files have been converted
to the standard 16-bit linearly encoded RIFF file format.
Subjects whose utterances are included in this corpus are respondents to USEnet postings. Respondants were required to fill out a form on the World Wide Web and register for the data collection. In response to their registration a list of letters and digits was emailed to them along with instructions on how to participate.
All of the files included in this corpus have corresponding non-time-aligned word-level transcriptions that comply with the conventions in the CSLU Labeling Guide (could be found in CSLU's publication page).
A total of 78,044 utterances were recorded. Every letter and digit, including zero and oh, is in the corpus (table of all letters and digits). Sometimes the callers said extra words or partial words. Instead of removing these utterances from the corpus we have left them in for researchers who may be interested in dealing with not-so-perfect alpha-digit strings.
Each speech file filename in the Alphadigit Corpus encodes information about the call number, utterance type, and file type. Here is a typical filename:
AD -- The "AD" prefix indicates the corpus name, i.e. Alphadigit.
2 -- The number between the hyphen and the first dot is the call number. Call numbers are described below.
p17 -- The string between the first and second dot is the utterance type. The utterance types are of the form "p##".The numbers after the p are the utterance number and reflect the order that each utterance was recorded during the call. That is, p1 came first, p2, second, etc. Some of the callers were asked to say 29 utterances and some were asked to say 19 utterances.
wav -- The final three letter extension indicates the file type.
The "wav" files contain speech data and use the RIFF standard wav file format. This file format is 16-bit linearly encoded.
The transcriptions are contained in a standard text file.
The Center for Spoken Language Understanding (CSLU) distributes corpora to commercial entities and academic institutions for a fee. Commercial entities can use these corpora for research but also for creating commercial products such as generating acoustic models for speech recognition.
To place your order:
1. Click on the type of license you wish to order: Academic or non-profit entity or Commercial entity.
2. Terms of the license agreement can be viewed by clicking on the word "terms".
3. You agree to the terms of the license agreement when you click on "Add to Order" and proceed to the next screen.
4. If information on the "Order Contents" screen is correct, press "Check out".
5. On the next screen, a brief "Intended Use" is required. For "Recipient Scientist Information" enter the appropriate information for yourself or if you are placing the order for another person enter that information. We will use this information should we have questions about the order, payment or shipping address.
6. Once your payment has been received and verified by OHSU, your order will be approved by Technology Transfer & Business Development and then the DVD will be sent out by the Center for Spoken Language Understanding by FedEx within 5-10 business days.
For demos and more information, visit the CSLU website at:
- CSLU, SOM CSLU
For more information, contact:
Technology Development Manager