Modeling Coarticulation for Automatic Speech Recognition

This project focuses on applying a model used in text-to-speech synthesis (TTS) to the task of automatic speech recognition (ASR). The standard method in ASR for addressing variability due to phonemic context, or coarticulation, requires a large amount of training data and is sensitive to differences between training and testing conditions. Despite the effective use of stochastic models, current ASR systems are often unable to sufficiently account for the large degree of variability observed in speech. In many cases, this variability is not due to random factors, but is due to predictable changes in the speech signal. These factors are currently modeled in order to generate speech via TTS, but they are not yet modeled in order to recognize speech, largely because of non-local dependencies. We apply the Asynchronous Interpolation Model (AIM) used in TTS to the task of speech recognition, by decomposing the speech signal into target vectors and weight trajectories, and then searching weight-trajectory and stochastic target-vector models for the highest-probability match to the input signal.

The goal of this research is improve the robustness of ASR to variability that is due to phonemic and lexical context. This improvement will increase the use of ASR technology in automated information access by telephone, educational software, and universal access for individuals with visual, auditory, or speech-production challenges. More effective models of coarticulation may increase our understanding of both human speech perception and speech production.

Funding Source

NSF IIS