High-Quality Compression, Enhancement, and Personalization of Text-to-Speech Voices

The vast variability of the human speech signal remains a central challenge for Text-to-Speech (TTS) systems. The objective of this research is to develop TTS technologies that focus on elimination of concatenation errors, and accurate speech modifications in the areas of coarticulation, degree of articulation, prosodic effects, and speaker characteristics. The investigators are exploring an asynchronous interpolation model (AIM), which promises to provide for high-quality and flexible TTS. The core idea of AIM is to represent a short region of speech as a composition of several types of features called streams. Each stream is computed by asynchronous interpolation of basis vectors.

Each basis vector is associated with a particular phoneme, allophone, or more specialized unit. Thus, the speech region is described by the varying degrees of influence of several types of preceding and following acoustic features. Using AIM, the investigators are also developing methods to optimally compress the acoustic inventories of TTS systems, given a size or a quality constraint, and to adapt the system to a new voice, given a few training samples. The system being researched forms a hybrid between traditional concatenative and formant-based synthesis, having advantages of both, resulting in a high-quality, optimized TTS system with voice adaptation capabilities. TTS has generally recognized societal benefits for universal access, education, and information access by voice. Our research will make it possible, for example, to build personalized TTS systems for individuals with speech disorders who can only intermittently produce normal speech sounds.

Funding source

NSF IIS

Principal Investigators

Alexander Kain

Todd Leen