Categories | Inventors
CORPORA from CSLU: Stories v1.2
OHSU # 0681-Q
Categories:
Inventors:
- CSLU, SOM CSLU
General Description
The Stories Corpus is made up of extemporaneous speech collected
from English speakers in the CSLU Multi-language Telephone Speech data
collection. Each speaker was asked to speak on a topic of their choice for one
minute. These utterances make up the Stories Corpus.
Recording
Details
The data were recorded from an analog line using a Gradient
Technologies analog-to-digital conversion box. The file format used is 8 khz
16-bit linear with a 1024-byte NIST Sphere header.
File Naming
Convention
File naming follows the following convention:
ENcall-1003-G.story-bt.txt
The first field ("ENcall") is the prefix indicating the corpus to
which this data belongs, and the second field ("100") represents a unique ID
number for the speaker. The remainder of the information is irrelevant.
These audio and text files are subdivided into directories based on
their call number divided by 10. So, the files for call 103 could be found in
the /10 subdirectory.
The /trans and /labels directory file structures
exactly parallel the structure of the /speech directory.
File
Formats
The data were recorded from an analog line using a Gradient
Technologies analog-to-digital conversion box. The .wav file format used is the
RIFF standard file format. This file format is 16-bit linearly encoded.
Transcriptions
The text transcriptions were performed
according to the non time-aligned word-level conventions described in the CSLU
Labeling Guide.
Phonetic transcriptions are plain text files that carry
time-aligned phonetic labels. The first two lines of the file are a header which
defines the length of a "frame" in milliseconds. The rest of the files consists
of two numbers that define a frame range, and a label that applies to that
region. For example:
MillisecondsPerFrame: 1.000000
END OF HEADER
2 113 .pau
113 191 w
191 267 ^
267 395 n
So, we can see here that a frame corresponds to 1 millisecond (ms)
of time, and that from 2 to 113 ms into the file, there is a pause (.pau), with
the first phoneme (w) starting at 113 ms and stretching to 191 ms.
The
word-level transcription files follow the same format, with word labels in place
of the phonetic labels. The .com files that are found with the .wrd files
contain information about breathing during the speech. They are in a similar
time-aligned format.
Labels
The lola files are ASCII "location
and label" files. They are similar to the ".phn" files of the TIMIT database
except:
- the locations are given in a unit of time other than the sample.
- there is a short header saying what this unit is
Each file in this distribution has the
header:
MillisecondsPerFrame: 3.0
END OF HEADER
After that are a series of lines, one per segment, of the
form:
[begin frame][end frame + 1] label
For example
200 237 ah
237 289 m
The [ah] segment extends from from 200 to frame 236 inclusive. The
end label is 237 for historical reasons. The Center for Spoken
Language Understanding (CSLU) distributes corpora to commercial entities and
academic institutions for a fee. Commercial entities can use these corpora for
research but also for creating commercial products such as generating acoustic
models for speech recognition.
To place your order:
1. Click on the type of license you wish to order: Academic or non-profit entity or Commercial entity.
2. Terms of the license agreement can be viewed by clicking on the word "terms".
3. You agree to the terms of the license agreement when you click on "Add to Order" and proceed to the next screen.
4. If information on the "Order Contents" screen is correct, press "Check out".
5. On the next screen, a brief "Intended Use" is required. For "Recipient Scientist Information" enter the appropriate information for yourself or if you are placing the order for another person enter that information. We will use this information should we have questions about the order, payment or shipping address.
6. Once your payment has been received and verified by OHSU, your order will be approved by Technology Transfer & Business Development and then the DVD will be sent out by the Center for Spoken Language Understanding by FedEx within 5-10 business days.
For demos and more information, visit the CSLU Corpora website at:
http://www.cslu.ogi.edu/corpora/corpCurrent.html
For more information, contact:
Michele Gunness
Senior Technology Development Manager
503-494-4184
