Discriminative Syntactic Language Modeling: Automatic Feature Selection and Efficient Annotation

The focus of this proposal is on the effective use of parser-derived and tagger-derived features within discriminative approaches to language modeling for automatic speech recognition. Discriminative language modeling approaches provide a tremendous amount of flexibility in defining features, but the size of the potential parser-derived feature space requires efficient feature annotation and selection algorithms. The project has four specific aims. The first aim is to develop a set of efficient, general, and scalable syntactic feature selection algorithms for use with various kinds of annotation and several parameter estimation techniques. The second aim is to develop general tree and grammar transformation algorithms designed to preserve selected feature annotations yet lead to faster parsing or even tagging approximations to parsing. The third aim is to evaluate a broad range of feature selection and grammar transformation approaches on a large vocabulary continuous speech recognition (LVCSR) task, namely Switchboard. The final aim is to design and package the algorithms to straightforwardly support future research into other applications, such as machine translation (MT); and into other languages, such as Chinese and Arabic. The algorithms developed as a part of this project are expected to contribute to improvements in LVCSR accuracy and applications that rely upon this technology. The algorithms are being packaged into a publicly available software library, enabling researchers working in many application areas -- including LVCSR and MT -- and various languages to investigate best practices in syntactic language modeling for their specific task, without having to hand-select and evaluate feature sets.

Funding source

NSF IIS