I specialize in building up production bioinformatics teams and managing all aspects of pipeline development, testing, deployment (cloud or on-prem), data lifecycle management and strategy, and building and integrating systems for transparency, provenance, and coordination with other teams. I consider reproducibility in science to be a moral imperative, I view a healthy and supportive team culture as essential, and I am committed to assiduous advocacy for my reports.
I lead our production bioinformatics efforts at 54gene, including deployment and administration of an on-demand virtual HPC, data lifecycle management, testing and development of production bioinformatics workflows, and benchmarking both for computational performance and for biological accuracy and precision. I coordinate data cleaning and reporting for patient consent forms, questionnaire results, and genomic data, and work closely with other stakeholders within the company, from our lab group to our epidemiologists to senior executives. I set team policy and best practices, write SOPs, and use ticket tracking for transparency and provenance.
I managed a team of six bioinformaticians at the Cancer Genomics Research Laboratory in support of the ~60 investigators in the Division of Cancer Epidemiology and Genetics, NCI, NIH. Our team conducted all pipeline development, including short- and long-read WES/WGS/targeted DNA-sequencing analysis, germline and somatic variant detection, structural variation analysis, microbiome (16S/ITS and metagenomics), GWAS (including association testing and meta-analysis), CNV analysis, and detection of germline mosaicism. We also supported ad-hoc downstream analysis needs. We emphasized personal accountability, clear and frequent communication, and reproducibility in our work.
I analyzed next generation sequencing data to identify medically actionable somatic mutations in tumors, and generate reports for physicians, academic institutions, and pharmaceutical clients.
I explored the molecular basis of cancer susceptibility through whole exome sequencing and SNP array genotyping of individuals with genomic instability syndromes, in both family- and population-based studies.
I completed my thesis work in Dr. Vicki Lundblad's laboratory, where I explored mechanisms by which telomeres are protected from being misinterpreted by the cell as DNA breaks.
R library that facilitates highly-configurable, flexible, reproducible cleaning of arbitrary input phenotype data (e.g. health questionnaires, lab test results, etc.), including application of consent and age thresholds as well as an output report describing every input variable.
View ProjectReproducible, scalable, and robust Snakemake pipeline for calling germline WGS data. Handles dependency management, exposes critical parameters into user-configurable yaml, capable of deployment to any infrastructure that supports Snakemake. Multiple run modes to permit different use cases, e.g. a quick turn-around of read quality metrics to lab, a more thorough per-flowcell calling and subject-level QC, and joint-calling from gVCFs across multiple flowcells.
View ProjectIntroduction to exploratory data analysis with the Python package pandas, presented in an interactive Jupyter notebook accessible via either Binder or Google Colab.
Feedback from this workshop: "Your workshop ... received the highest scores of all our events to date! You are a Cancer Data Science Star Instructor and we cannot thank you enough for your thoughtful, tailored, engaging presentation!"
View ProjectIntroduction to building genomics pipelines with Snakemake. Slides available as a PDF; interactive Juptyer notebook available via Binder.
View ProjectA modular Snakemake-based workflow to coordinate germline, de novo, and somatic SV calling with multiple callers. Caller output is harmonized and annotated for genomic context (e.g. segmental duplications, proximity to telomere/centromere sequence, genes and transcripts) and compared to SVs in public databases including DGV, ClinVar, ClinGen, and 1000 Genomes.
View ProjectThe production pipeline used at CGR (Cancer Genomics Research Lab, NCI) for analysis of 16S microbiome sequencing data.
View Project