Objective Methods for Prediciting and Optimizing Synthetic Speech Quality

The project focuses on how humans perceive acoustic discontinuities in speech. Current text-to-speech synthesis ("TTS") technology operates by retrieving intervals of stored digitized speech("units") from a database and splicing ("concatenating") them to form the output utterance. Unavoidably, there are acoustic discontinuities at the time points where the successive speech intervals meet. An unsolved problem is how to predict from the quantitative, acoustic properties of two to-be-concatenated units whether humans will hear a discontinuity. This is of immediate relevance for TTS systems that select units at run time from a large speech corpus. During selection, the systems search through the space of all possible sequences of units that can be used for the utterance and selects the sequence that has the lowest overall objective cost measure, such as the Euclidean distance between the final frame and initial frame of two units. However, research has already shown that this method and related methods do not predict well whether humans will hear a discontinuity. The current research, by being explicitly focused on perceptually optimized objective cost measures, will directly contribute to the perceptual accuracy of cost measures and hence to synthesis quality.

Funding source

NSF IIS

Principal Investigator

Jan van Santen