BioMedical Evidence Graph (BMEG)

The BioMedical Evidence Graph (BMEG) integrates different types of biomedical data into a unified graph for efficient application of machine learning and discovery algorithms across heterogeneous data types. BMEG will leverage the petabytes of genomics data available for tumor samples from repositories like the National Cancer Institute’s Genomic Data Commons to predict drug sensitivity, patient outcomes, and other clinically relevant phenotypes.                                                           


The BMEG data model is instantiated in a scalable graph database optimized for storing and querying graphs containing terabytes of vertices and edges distributed across a multi-machine cluster.   This graph is the store of record for the BMEG. It maintains the connections between projects, donors, samples, molecular data and treatment evidence and assures that these entities are associated correctly.



Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research that is:


* Accessible: Users without informatics experience can easily specify parameters and run tools and workflows using a web interface.

* Reproducible: Galaxy captures information so that any user can repeat and understand a complete computational analysis.

* Transparent: Users share and publish analyses via the web and create Pages, interactive, web-based documents that describe a complete analysis.

This is a joint project in collaboration with the Nekrutenko Lab and Taylor Lab and is funded by NIH and NSF. 

additional information



Rail-RNA is software for analysis of RNA sequencing (RNA-seq) data. Its distinguishing features are

* **Scalability**. Built on MapReduce, the software scales to analyze hundreds of RNA-seq samples at the same time.

* **Reduced redundancy**. The software identifies and eliminates redundant alignment work, making the end-to-end analysis time per sample *decrease* for fixed computer cluster size as the number of samples increases.

* **Integrative analysis**. The software borrows strength across replicates to achieve more accurate splice junction detection, especially in genomic regions with low coverage.

* **Mode agnosticism**. The software integrates its own parallel abstraction layer that allows it to be run in various distributed computing environments, including the Amazon Web Services (AWS) Elastic MapReduce (EMR) service, or any distributed environment supported by Python, including clusters using batch schedulers like PBS or SGE, Message Passing Interface (MPI), or any cluster with a shared filesystem and mutual SSH access. Alternately, Rail-RNA can be run on a single multi-core computer, without the aid of a batch system or MapReduce implementation.

* **Inexpensive cloud implementation**. An EMR run on > ~100 samples costs ~ $1/sample with spot instances.

* **Secure analysis of dbGaP-protected data on EMR**. See this guide for information on setup.

Together with collaborators at Johns Hopkins University, we have used Rail-RNA to reanalyze over 70,000 human RNA-seq samples so far, including publicly available samples on the Sequence Read Archive (SRA) as well as controlled-access samples from The Cancer Genome Atlas (TCGA) and the Genotype-Tissue Expression (GTEx) project. Expression information across these samples at the gene, exon, and exon-exon junction levels are collected into the resource recount2 , which has an accompanying R/Bioconductor package.

additional information



Predictors of Cellular Phenotypes to guide Therapeutic Strategies (PRECEPTS) are a set of related analytical packages which can be used in tandem to identify transcriptional programs downstream of cancer driving events from coordinated genomic and expression profiles. The six driving events and types we seek to identify and predict are: 

* **Recurrent mutations in cancer driving genes**

* **Mutually exclusive modules of cancer driving genes**

* **Transcription factor activity**

* **Mutually inhibiting transcription factors**

* **Network enrichment and decoupling**

* **Drug sensitivity**

PRECEPTS is currently in early development.


Pathway Commons

Pathway Commons is a collection of publicly available pathway information from multiple organisms. It provides researchers with convenient access to a comprehensive collection of biological pathways from multiple sources represented in a common language for gene and metabolic pathway analysis. Access is via a web portal for query and download. Database providers can share their pathway data via a common repository and avoid duplication of effort and reduce software development costs. Bioinformatics software developers can increase efficiency by sharing pathway analysis software components. Pathways can include biochemical reactions, complex assembly, transport and catalysis events, physical interactions involving proteins, DNA, RNA, small molecules and complexes, gene regulation events and genetic interactions involving genes.


Quantitative Image Analysis for multiplex IHC (and cyclic IF)

We successfully developed multiplexed immunohistochemistry (IHC) technology which allows evaluation of multiple protein biomarkers in a single FFPE tissue section and demonstrated that immune complexity stratifies response to vaccination therapy in PDAC. However, interpretation of the serial images output by the multiplex IHC method entails several challenges. This project aims to refine and rigorously validate our technologies as well as develop enhanced analytical capabilities addressing current limitations.