OHSU
Categories | Inventors

CORPORA from CSLU: The Spoltech Brazilian Portuguese v1.0

 

OHSU # 0681-R

General Description
The Spoltech Brazilian Portuguese corpus contains microphone speech from a variety of regions in Brazil with phonetic and orthographic transcriptions. The utterances consist of both read speech (for phonetic coverage) and responses to questions (for spontaneous speech). The corpus contains 480 speakers and 8119 separate utterances. A total of 2579 utterances have been transcribed at the word level (without time alignments), and 5505 utterances have been transcribed at the phoneme level (with time alignments). Protocol design, recording and transcription were performed by the Universidade Federal do Rio Grande do Sul and the Universidade de Caxias do Sul.

Recording Details
The data have been recorded at 44.1 kHz (mono, 16 bit) and stored in RIFF format. The recording was conducted with a direct connection from the microphone to the sound card. The sound card was SoundBlaster-compatible. For the prompted sentences, the sentence was hidden from view when recording began, so that the speaker might utter the sentence more naturally. Verification of the recording quality was performed immediately after each utterance recording; the data-collection software allowed the speaker to re-record utterances in case the recording was not of sufficient quality. The acoustic environment was not controlled, in order to allow for background conditions that would occur in application environments.

Directory Structure
There are directories in top-level called speech, trans, labels, misc, docs. Each speaker has its own directory under these directories and has assigned an identification sequence, consisting of the two letters "BR", a dash, and five digits. The two letters identify the corpus as the Spoltech Brazilian Portuguese corpus, and the five digits uniquely identify the speaker within this corpus. The labels directory contains the time-aligned phonetic transcriptions for this speaker. The trans directory contains the orthographic, non-time-aligned transcriptions for this speaker. The speech directory contains the 44.1 kHz waveforms for this speaker.

Each filename has the following structure:

CC-NNNNN.TTTTTT.EXT

CC = two-letter corpus name, always "BR"

NNNNN = unique speaker ID, ranging from 00001 to 00480

TTTTTT = utterance identification

EXT = filename extension

For example:

BR-00130.birtplac.wav

This file contains the waveform for speaker 130, who was answering the question "where were you born?" This file is located in the directory speech/BR-00130/.

 

Table 1. Balanced sentences for Spoltech corpus.

Sentence                                                                                                                   Filename Identification

O presidente da República faz advertência ao ministro da justiça.                              balsen1

O jovem atacante convenceu fácil na partida contra o México.                                        balsen2

Zorro é outro dos filmes muito procurados nas locadoras atualmente.                        balsen3

A primeira maior guerra de todas foi entre o bem e o mal, o céu e a terra.                  balsen4

É melhor nunca engomar os lençóis azuis debaixo de sol.                                           balsen5

Eu prefiro ser essa metamorfose ambulante.                                                                     balsen6

Do que ter aquela velha opinião formada sobre tudo.                                                       balsen7

Andar a pé é mais barato.                                                                                                      balsen8

No meio do caminho tinha uma pedra.                                                                                balsen9

Dizem que alho é bom pra gripe e que paulada de amor não dói.                              balsen10

A gente quer mais uma chance.                                                                                            balsen11

Gosto de ovo frito para comer com arroz.                                                                            balsen12

Sol forte queima a pele fácil, fácil.                                                                                        balsen13

Eu queria biscoito de mel.                                                                                                       balsen14

Os jogadores vitoriosos festejam o resultado.                                                                    balsen15

Os trapezistas dão piruetas arriscadas no ar.                                                                    balsen16

É necessário paciência de vez em quando.                                                                        balsen17

A postura antiética de alguns políticos é inaceitável.                                                      balsen18

Os cantores devem cuidar bem da goela.                                                                            balsen19

Quase ninguém foi à reunião, faltou quórum.                                                                    balsen20

O orvalho da manhã às vezes é confundido com a chuva.                                              balsen21

Table 2. Prompts for Spoltech corpus.

Prompt                                                           Filename Identification

Conte de 1 até 10.                                                      1to10

Quantos anos você tem ?                                           age

Diga todos os meses do ano.                                      Allmonth

Diga todos os dias da semana.                                   allweek

Quando você nasceu?                                                 birtdate / birthdate

Onde você nasceu?                                                     Birtplac

Qual estado brasileiro você gostaria de conhecer?   brstate
Qual o estado em que você nasceu?

Qual a cidade em que você nasceu?                          City

Qual a sua cor preferida?                                           Color

Qual país você gostaria de conhecer?                       Country

Que dia é hoje?                                                           Date

Você gostou de responder a esse questionário?        Didulike

Você estuda?                                                               Doustudy

Você trabalha?                                                            Douwork

Qual o nome do seu pai?                                            Fathname

Qual o seu prato preferido?                                       Food

Qual seu interesse neste evento?                               free1

Fale livremente.                                                          free2  

Quantos irmãos você tem?                                         Howmany

Em que mês você nasceu?                                          Month

Que dia do mês é hoje?                                              Monthday

Qual o nome da sua mãe?                                          Mothname

Diga três dezenas.                                                      numb10

Diga três centenas.                                                     numb100

Em que estação do ano você nasceu?                        Season

Qual é o seu sexo?                                                      Sex

Que horas são?
Que horas são agora?                                                 Time

Para onde você tem vontade de viajar ?
Para onde você gosteria de viajar?                            Travel

Que dia da semana é hoje?                                        weekday

Qual é a sua profissão?                                              Work

Você achou difícil este questionário?                         Yesno

Você tem irmãos? (Sim/Não)                                     yesno1

Sim/Não                                                                      yesno2

Sim/Não                                                                      yesno3

Sim/Não                                                                      yesno4

Qual o seu CEP? (digito a dígito)
Qual o seu CEP?                                                          zipcode           

 

The Center for Spoken Language Understanding (CSLU) distributes corpora to commercial entities and academic institutions for a fee. Commercial entities can use these corpora for research but also for creating commercial products such as generating acoustic models for speech recognition.

 

To place your order:

1. Click on the type of license you wish to order: Academic or non-profit entity or Commercial entity.

2. Terms of the license agreement can be viewed by clicking on the word "terms".

3. You agree to the terms of the license agreement when you click on "Add to Order" and proceed to the next screen.

4. If information on the "Order Contents" screen is correct, press "Check out".

5. On the next screen, a brief "Intended Use" is required. For "Recipient Scientist Information" enter the appropriate information for yourself or if you are placing the order for another person enter that information. We will use this information should we have questions about the order, payment or shipping address.

6. Once your payment has been received and verified by OHSU, your order will be approved by Technology Transfer & Business Development and then the DVD will be sent out by the Center for Spoken Language Understanding by FedEx within 5-10 business days.  

 

For demos and more information, visit the CSLU Corpora website at:

http://www.cslu.ogi.edu/corpora/corpCurrent.html

 

Inventors:

Categories:

For more information, contact:

Trina Voss
Technology Development Manager
503-494-9839

OptionPrice
(terms)  $50.00
(terms)  $3000.00