Synthesis and Perception of Speaker Identity

The capability of a Text-to-Speech (TTS) synthesis system to generate speech that sounds like that of a spefic individual (Speaker Identity Synthesis, or SIS) has numerous applications, including (1) the creation of personalized voices for individuals with neurodegenerative disorders who anticipate becoming users of Speech Generating Devices (SGDs) in the future, (2) the entertainment industry, and (3) vanity voices for speech-enabled consumer products such as navigation systems and mobile telephones. Yet, SIS is one of the lesser-studied topics in speech generation. This may be in part because it seems like a solved problem: Why not simply record the target speaker (i. e., the speaker we want to mimic), and use these recordings as the database for a standard unit selection based TTS system? Unit selection synthesis can mimic a speaker extremely well because both the vocal and prosodic characteristics of the speaker are captured completely by what is essentially a raw speech splicing method. However, none of the above applications would be practical with unit selection because it requires a large quantity of recordings from a highly consistent, even talented, speaker. Other synthesis techniques have limitations too; for example, conventional diphone synthesis and parametric synthesis use artificial prosody that does not resemble prosody of any particular speaker. Thus, SIS is not a solved problem in any practical sense.

The core goal of this proposal is to develop methods for synthesis of speaker identity from a relatively small, realistic amount of training recordings. This constraint serves not only to force us to create solutions relevant for real-world applications, but also to develop and apply tools for exploring a key perceptual question: What speech features are critical for speaker identification by humans?

Funding source

NSF IIS

Principal Investigators

Alexander Kain

Esther Klabbers-Judd

Jan van Santen