Small Footprint Speech Synthesis (Small Footprint)

This NSF Small Business Technology Transfer Phase I project is led by Alexander Kain at Biospeech Inc., a CSLU startup, and Jan van Santen The project aims to develop and implement a new algorithm in the area of text-to-speech synthesis (TTS) that will lead to (i) dramatic decreases in disk and memory requirements at a given speech quality level and (ii) minimization of the amount of voice recordings needed to create a new synthetic voice.

Most current TTS systems operate by concatenating segments of recorded speech ([acoustic] units). A challenge for TTS is coarticulation: The dependency of the acoustic manifestations of a phoneme on its neighbors. Current TTS systems use multi-phone acoustic units such as diphones, which preserve coarticulatory patterns naturally present in speech. However, this approach requires a large amount of recordings and generates systems with large footprints.

Biospeech proposes a uniphone approach that addresses coarticulation processes with an explicit model. The method uses complex spectral vectors (basis vectors) representing brief segments of speech inside single phonemes, and decomposes these into two components: A formant vector and a spectral balance vector.

To generate speech, the formant and spectral balance vectors derived from the basis vectors corresponding to successive phonemes are subjected to separate--and hence generally asynchronous--interpolation operations using time varying weights; the formant and spectral balance vector trajectories thus created are re-combined to create a trajectory in complex spectral space; finally, this trajectory is converted into output speech with the inverse Fourier transform. Asynchronicity is necessitated by the quasi-independence of articulators underlying different spectral features (e.g., frication, formant frequencies).

The proposed work has implications for other speech technologies, including Automatic Speech Recognition (ASR). Current ASR technologies address coarticulation by using multi-phone units, typical triphones. The number of triphones in English is over 70,000, and thus requires a large amount of training recordings. The proposed model could dramatically impact on the amount of recordings required for system training.

Second, TTS has generally recognized societal benefits for universal access, education, and information access by voice. For example, TTS-based augmentative devices are available for individuals who have lost their voice; and reading machines for the blind have been available for several decades.

Third, the approach will make higher-quality TTS more available for smaller devices. For example, voice based caller ID on low-end mobile telephones is currently not possible due to memory limitations.

Fourth, it enables voice adaptation with a minimum of recordings. This will enable building personalized TTS systems for individuals with speech disorders who can only intermittently produce normal speech sounds or for individuals who are about to undergo surgery that will irreversibly alter their speech. The method proffered by Biospeech only requires recordings of valid samples of each of (less than 50) phonemes instead of each of (2000 or more) diphones.

Funding Source


Principal Investigators

Alexander Kain

Jan van Santen