User Adaptation of AAC Device Voices - Phase II

Augmentative and alternative communication devices with voice output (also known as Speech Generating Devices, or SGDs) enable individuals to speak by electronic means. Typical users of SGDs are individuals who have suffered from a stroke, traumatic brain injury, or who have neurodegenerative or neurodevelopmental disorders. In most cases, the user was able to speak previously, or had or still has intermittent speech. In these cases, the user's relatives and friends are familiar with the user's voice. An often expressed desire is for the SGD to sound like the user. However, typical SGDs do not mimic any characteristics of the user's speech; in fact, they typically have an extremely limited array of synthetic voices, and the prosodic patterns of these voices are not customizable. As a result, the synthetic voice is impersonal, which may be a factor in discouraging impaired speakers and their communication partners from using the SGD. To address this impersonal character of current SGDs, we propose to offer a system with a wide range of personal customization options, by making use of (1) a substantial number of synthetic voices to choose from; (2) customizable prosody; and, most important, (3) Speaker Mimicry (SM) technology to mimic the user.

Phase I of this project established the feasibility of using SM technology to adapt an existing Text-to-Speech (TTS) synthesis system to mimic a specific target speaker, requiring only a small set of "training" recordings to be made of the target speaker. Mimicry of the spectral aspects of the target speaker was achieved with two Voice Transformation (VT) technologies, one that required extremely few recordings that, moreover, did not need to be of high acoustic quality (hence, pre-morbid home videos could in principle be used); and a second one that required more and better-quality recordings, but also provided better results. Mimicry of the prosodic aspects of the target speaker (Prosody Mimicry, or PM) was achieved by estimating static and dynamic parameters of the target speaker's intonational and durational patterns, which were then incorporated into the TTS system.

The deliverables of this Phase II STTR project consists of: (1) A complete SM-capable SGD, comprising an SM-capable TTS system and a built-in touch-screen Graphical User Interface (GUI) for user input, installed on a low-cost touch-screen dedicated "netbook" (alternative keyboards or special input devices will also be supported); (2) Efficient software tools and processes to be used by BioSpeech technicians to compute the individual user data needed for the SM capability. The SM capability will have multiple options, depending on the availability, quantity, and quality of user recordings.

The goals of this Phase II proposal are to develop a complete prototype of this product concept, and to co- develop and field-test the system with a group of SGD users. Moreover, we will show that, even with these unique features, the system can be made commercially available at a far lower cost than most current SGDs thanks to minimal ROI pressures and to the availability of low-cost "netbooks".

This is a joint project of BioSpeech and OHSU, and involves Esther Klabbers (PI), Alexander Kain, Jan van Santen, and Melanie Fried-Oken.

Funding Source

NIH NIDCD

Principal Investigators

Alexander Kain

Esther Klabbers-Judd

Jan van Santen

Research Team Member

Melanie Fried-Oken