CORPORA from CSLU: National cellular v2.3
OHSU # 0681-K
The Cellular Corpus consists of cellular telephone speech from 2336 callers from locations throughout the United States. The data collection protocol contains requests for fixed vocabulary and continuous speech utterances. A total of about one minute of speech from each caller is collected.
The data were collected with the CSLU T1 digital data collection system. The sampling rate was 8khz and the files were stored in 8-bit mu-law format on a UNIX file system.
File Name Conventions
A call is composed of the series of files recorded during each recording session. Every call is identified by a unique call number, and each file in the call is further identifed by an utterance type.
The filename identifies the call number and the question type.
The first two capitalized letters, "NC", indicate the corpus, National Cellular.
The next 5 digits are the call number. The last digit indicates the utterance type. The utterance types are shown in this table:
A background noise
D date of birth
E digital or analog
F familiar license plate number
G familiar phone number
H where did you grow up
I handset or microphone (not in vehicle)
J last name
L male or female
M native language
O spell last name
1 yes or no
2 describe your environment
3 describe the traffic
4 how fast are you going
5 handset or microphone
The word "WAV" indicates that this is a speechfile.
Speech File Formats
The speech file in this distribution are stored as RIFF wav files. 8kHz sampling and 16-bit linear coding.
Distribution directory structure
At the top level of the distribution there are two directories: speech, trans. Immediately below the top level of each directory there are several number subdirectories (0, 1, 2, etc.). These numbers directories hold the files, split by call number div 10. That is, in subdirectory 0 will be the files for calls 0-9, subdirectory 1 will hold the files for calls 10-19, and so on.
Each utterance in the National Cellular corpus has an orthographic transcription. The transcriptions are in the trans directory.
The Center for Spoken Language Understanding (CSLU) distributes corpora to commercial entities and academic institutions for a fee. Commercial entities can use these corpora for research but also for creating commercial products such as generating acoustic models for speech recognition.
To place your order:
1. Click on the type of license you wish to order. The Academic or non-profit entity fee is $50; Commercial entity fee is $5,500.
2. Terms of the license agreement can be viewed by clicking on the word "terms".
3. You agree to the terms of the license agreement when you click on "Add to Order" and proceed to the next screen.
4. If information on the "Order Contents" screen is correct, press "Check out".
5. On the next screen, a brief "Intended Use" is required. For "Recipient Scientist Information" enter the appropriate information for yourself or if you are placing the order for another person enter that information. We will use this information should we have questions about the order, payment or shipping address.
6. Once your payment has been received and verified by OHSU, your order will be approved by Technology Transfer & Business Development and then the DVD will be sent out by the Center for Spoken Language Understanding by FedEx within 5-10 business days.
For more information and demos, visit the CSLU Corpora website at:
- CSLU, SOM CSLU
For more information, contact:
Technology Development Manager