Efficient Hidden Structure Annotation Via Structural Multiple-Sequence Alignments

The focus of this project is to develop finite-state syntactic processing models for natural language that use features encoding global structural constraints derived through multiple sequence alignment (MSA) techniques, to significantly improve accuracy without expensive context-free inference. MSAs are widely used in computational biology for building finite-state models that capture long-distance dependencies in sequences (e.g., in RNA secondary structure). Given a large set of functionally aligned sequences in MSA format, finite-state models can be constructed that allow for the efficient alignment of new sequences with the given MSA. In natural language processing (NLP), only very rarely have MSA techniques been used, and then to characterize phonetic or semantic similarity. This project is exploring the definition of a purely syntactic functional alignment between semantically unrelated strings from the same language, to define a structural MSA for constructing finite-state syntactic models. The project has two specific aims. The first aim is to develop natural language sequence processing algorithms and models that can: a) define sequence alignments with respect to syntactic function; b) build structural MSAs based on defined functional alignments; c) derive finite-state models to efficiently align new sequences with the built MSA; and d) extract features from an alignment with the MSA for improved sequence modeling. The second aim is to empirically validate this approach within a number of large-scale text processing applications in multiple domains and languages. The resulting algorithms are expected to provide improved finite-state natural language models that will contribute to the state-of-the-art in critical text processing applications.

Funding source