Hadoop Cluster Acquisition, Deployment and Training for Speech and Language Processing

This project aims to educate, train and equip graduate students in what is becoming a critical paradigm in speech and NLP: distributed algorithms, a paradigm pursued by Google with great success. This distributed processing paradigm represents more than just a computational convenience, but rather an approach for designing algorithms to optimize performance within such an environment, yielding massive improvements over standard algorithms directly deployed in parallel.

The objectives of the current institutional infrastructure proposal are to (1) acquire an extensive (384 core) cluster of processors for use as a Hadoop cluster at the Center for Spoken Language Understanding (CSLU) at OHSU; (2) integrate the cluster within the existing computing infrastructure; and (3) develop educational resources (tutorials, lab sessions, course modules and seminars) focused on both "how-to" information for using the Hadoop cluster and more general topics in algorithms for distributed computing. At CSLU -- part of the Department of Biomedical Engineering at OHSU -- nearly all of the problems are within the scope of basic or applied NLP or speech processing research.

Apart from training graduate students via both course-work and work on research projects, the infrastructure created in this project will contribute to advances in speech processing and NLP and applications that make use of these technologies, including national defense applications in text and speech mining along with biomedical applications. It will enable at least eight funded NSF projects and at least five projects from other agencies to pursue novel directions in their data analysis.

Funding source

NSF CNS