BioMedical Evidence Graph (BMEG)
The BioMedical Evidence Graph (BMEG) integrates different types of biomedical data into a unified graph for efficient application of machine learning and discovery algorithms across heterogeneous data types. BMEG will leverage the petabytes of genomics data available for tumor samples from repositories like the National Cancer Institute’s Genomic Data Commons to predict drug sensitivity, patient outcomes, and other clinically relevant phenotypes.
The BMEG data model is instantiated in a scalable graph database optimized for storing and querying graphs containing terabytes of vertices and edges distributed across a multi-machine cluster. This graph is the store of record for the BMEG. It maintains the connections between projects, donors, samples, molecular data and treatment evidence and assures that these entities are associated correctly.
Galaxy is a scientific analysis workbench used by thousands of scientists worldwide to analyze genomic, proteomic, imaging, and other large biomedical datasets. Galaxy’s user-friendly, web-based interface makes it possible for anyone, regardless of their informatics expertise, to create, run, and share large-scale robust and reproducible analyses. Galaxy accelerates biomedical research by bringing together tool developers and end users such as bench scientists and physician-researchers. There are more than 5,000 analysis tools available in Galaxy’s ToolShed, and users run more than 200,000 analyses each month on Galaxy’s main public server. OHSU’s precision cancer medicine programs use Galaxy to run clinical and research genomics analyses as well as machine learning workflows. Galaxy is funded by both NIH and NSF.
Rail-RNA is software for analysis of RNA sequencing (RNA-seq) data. Its distinguishing features are
* **Scalability**. Built on MapReduce, the software scales to analyze hundreds of RNA-seq samples at the same time.
* **Reduced redundancy**. The software identifies and eliminates redundant alignment work, making the end-to-end analysis time per sample *decrease* for fixed computer cluster size as the number of samples increases.
* **Integrative analysis**. The software borrows strength across replicates to achieve more accurate splice junction detection, especially in genomic regions with low coverage.
* **Mode agnosticism**. The software integrates its own parallel abstraction layer that allows it to be run in various distributed computing environments, including the Amazon Web Services (AWS) Elastic MapReduce (EMR) service, or any distributed environment supported by Python, including clusters using batch schedulers like PBS or SGE, Message Passing Interface (MPI), or any cluster with a shared filesystem and mutual SSH access. Alternately, Rail-RNA can be run on a single multi-core computer, without the aid of a batch system or MapReduce implementation.
* **Inexpensive cloud implementation**. An EMR run on > ~100 samples costs ~ $1/sample with spot instances.
* **Secure analysis of dbGaP-protected data on EMR**. See this guide for information on setup.
Together with collaborators at Johns Hopkins University, we have used Rail-RNA to reanalyze over 70,000 human RNA-seq samples so far, including publicly available samples on the Sequence Read Archive (SRA) as well as controlled-access samples from The Cancer Genome Atlas (TCGA) and the Genotype-Tissue Expression (GTEx) project. Expression information across these samples at the gene, exon, and exon-exon junction levels are collected into the resource recount2 , which has an accompanying R/Bioconductor package.
Predictors of Cellular Phenotypes to guide Therapeutic Strategies (PRECEPTS) are a set of related analytical packages which can be used in tandem to identify transcriptional programs downstream of cancer driving events from coordinated genomic and expression profiles. The six driving events and types we seek to identify and predict are:
* **Recurrent mutations in cancer driving genes**
* **Mutually exclusive modules of cancer driving genes**
* **Transcription factor activity**
* **Mutually inhibiting transcription factors**
* **Network enrichment and decoupling**
* **Drug sensitivity**
PRECEPTS is currently in early development.
Pathway Commons is a collection of publicly available pathway information from multiple organisms. It provides researchers with convenient access to a comprehensive collection of biological pathways from multiple sources represented in a common language for gene and metabolic pathway analysis. Access is via a web portal for query and download. Database providers can share their pathway data via a common repository and avoid duplication of effort and reduce software development costs. Bioinformatics software developers can increase efficiency by sharing pathway analysis software components. Pathways can include biochemical reactions, complex assembly, transport and catalysis events, physical interactions involving proteins, DNA, RNA, small molecules and complexes, gene regulation events and genetic interactions involving genes.
Quantitative Image Analysis for multiplex IHC (and cyclic IF)
We successfully developed multiplexed immunohistochemistry
(IHC) technology which allows evaluation of multiple protein biomarkers in a
single FFPE tissue section and demonstrated that immune complexity stratifies
response to vaccination therapy in PDAC. However, interpretation of the serial
images output by the multiplex IHC method entails several challenges. This
project aims to refine and rigorously validate our technologies as well as
develop enhanced analytical capabilities addressing current limitations.
Precision Cancer Medicine Informatics
We are developing data analysis methods and data management
software to store, analyze, and integrate clinical, imaging, and molecular data
for (1) treating cancer using precision therapies adapted over time; and (2)
discovering and understanding mechanisms of resistance in cancer. This initiative
brings together and advances many areas, including (a) development of
computational analysis workflows to identify key biomarkers such as somatic
mutations, gene expression, pathway activity, and tumor composition; (b) using
public datasets in genomics, transcriptomics, and biological pathways together
with patient data to correlate biomarkers with prognosis and predict
therapeutic response; and (c) producing patient reports and interactive
visualizations that provide precision therapy recommendations based on
consensus amongst methods and enable differential analysis across timepoints.
Key software used in this work includes LabKey for
data management and visualization, G2P for finding key biological and
clinically actionable biomarkers, and Galaxy for
analysis workflow creation and execution.
Genotype-to-Phenotype Database (G2P)
G2P is an aggregate public clinical cancer knowledge base
for storing and searching connections between genomic biomarkers (“genotypes”)
and patient diagnosis, prognosis, and response to treatment (“phenotypes”). Key
uses of G2P include (a) searching by somatic variant to find drugs known to
lead to response or resistance in tumors with the variant; (b) searching by
drug to identify different mutations in which it can lead to response; (c)
searching clinical trials to find those associated with particular biomarkers
or drugs. G2P combines biomarker-phenotype associations from 9 trusted and
curated knowledge bases, including CIViC, OncoKB, PMKB, JAX CKB and the Cancer Genome Interpreter. Clinical trials data is also
included from several sources as well. Users can perform full-text search on
G2P and filter results using a web portal with intuitive visualizations. Code