CORPORA from CSLU: The Spoltech Brazilian Portuguese v1.0
OHSU # 0681-R
The Spoltech Brazilian Portuguese corpus contains microphone speech from a variety of regions in Brazil with phonetic and orthographic transcriptions. The utterances consist of both read speech (for phonetic coverage) and responses to questions (for spontaneous speech). The corpus contains 480 speakers and 8119 separate utterances. A total of 2579 utterances have been transcribed at the word level (without time alignments), and 5505 utterances have been transcribed at the phoneme level (with time alignments). Protocol design, recording and transcription were performed by the Universidade Federal do Rio Grande do Sul and the Universidade de Caxias do Sul.
The data have been recorded at 44.1 kHz (mono, 16 bit) and stored in RIFF format. The recording was conducted with a direct connection from the microphone to the sound card. The sound card was SoundBlaster-compatible. For the prompted sentences, the sentence was hidden from view when recording began, so that the speaker might utter the sentence more naturally. Verification of the recording quality was performed immediately after each utterance recording; the data-collection software allowed the speaker to re-record utterances in case the recording was not of sufficient quality. The acoustic environment was not controlled, in order to allow for background conditions that would occur in application environments.
There are directories in top-level called speech, trans, labels, misc, docs. Each speaker has its own directory under these directories and has assigned an identification sequence, consisting of the two letters "BR", a dash, and five digits. The two letters identify the corpus as the Spoltech Brazilian Portuguese corpus, and the five digits uniquely identify the speaker within this corpus. The labels directory contains the time-aligned phonetic transcriptions for this speaker. The trans directory contains the orthographic, non-time-aligned transcriptions for this speaker. The speech directory contains the 44.1 kHz waveforms for this speaker.
Each filename has the following structure:
CC = two-letter corpus name, always "BR"
NNNNN = unique speaker ID, ranging from 00001 to 00480
TTTTTT = utterance identification
EXT = filename extension
This file contains the waveform for speaker 130, who was answering the question "where were you born?" This file is located in the directory speech/BR-00130/.
Table 1. Balanced sentences for Spoltech corpus.
Sentence Filename Identification
O presidente da República faz advertência ao ministro da justiça. balsen1
O jovem atacante convenceu fácil na partida contra o México. balsen2
Zorro é outro dos filmes muito procurados nas locadoras atualmente. balsen3
A primeira maior guerra de todas foi entre o bem e o mal, o céu e a terra. balsen4
É melhor nunca engomar os lençóis azuis debaixo de sol. balsen5
Eu prefiro ser essa metamorfose ambulante. balsen6
Do que ter aquela velha opinião formada sobre tudo. balsen7
Andar a pé é mais barato. balsen8
No meio do caminho tinha uma pedra. balsen9
Dizem que alho é bom pra gripe e que paulada de amor não dói. balsen10
A gente quer mais uma chance. balsen11
Gosto de ovo frito para comer com arroz. balsen12
Sol forte queima a pele fácil, fácil. balsen13
Eu queria biscoito de mel. balsen14
Os jogadores vitoriosos festejam o resultado. balsen15
Os trapezistas dão piruetas arriscadas no ar. balsen16
É necessário paciência de vez em quando. balsen17
A postura antiética de alguns políticos é inaceitável. balsen18
Os cantores devem cuidar bem da goela. balsen19
Quase ninguém foi à reunião, faltou quórum. balsen20
O orvalho da manhã às vezes é confundido com a chuva. balsen21
Table 2. Prompts for Spoltech corpus.
Prompt Filename Identification
Conte de 1 até 10. 1to10
Quantos anos você tem ? age
Diga todos os meses do ano. Allmonth
- CSLU, SOM CSLU
For more information, contact:
Technology Development Manager