Responsibilities
- Design and build data architecture that transforms raw and processed omics data into harmonized, AI-consumable layers
- Build and optimize ETL/ELT pipelines that produce denormalized views, pre-computed aggregations, embedding-ready text representations, and feature stores optimized for AI consumption
- Implement data quality monitoring, automated profiling, and validation checks across harmonization layers
- Create versioned, reproducible data snapshots that support model training, evaluation, and audit requirements in a regulated environment
- Partner with teams to extend harmonization patterns as modalities expand beyond genomics and proteomics into spatial transcriptomics, Perturb-Seq, single-cell, and digital pathology
- Design and maintain a semantic layer over multi-omics databases that enables AI systems
- Create schema documentation: table descriptions, column-level annotations, relationship mappings, business logic rules, and domain-specific constraints
- Develop gold-standard question/SQL pairs for major databases (with computational biologists and Generative AI Engineers) for training, few-shot examples, and evaluation benchmarks
- Build and maintain a data dictionary and ontology mapping layer translating scientific terms (gene names, pathways, assay types) to physical data storage
- Build and manage vector embedding pipelines for scientific documents, study metadata, and structured data descriptions to power RAG retrieval
- Build integration pipelines connecting heterogeneous sources (omics DBs, internal publications, ELNs, assay results, clinical annotations) into a unified queryable layer
- Develop and enforce metadata standards so new sources are AI-accessible from ingestion
- Design data products for multiple consumption patterns: direct SQL, ML training feeds, and semantic interfaces for LLM tools
Qualifications
- BS in Computer Science, Data Engineering, Bioinformatics, or related field + 8 years data engineering experience OR MS + 5 years data engineering experience
- Demonstrated expertise building data pipelines, ETL/ELT workflows, and data products serving downstream AI/ML systems
Additional Skills/Preferences
- PhD in data or related field
- Strong SQL and experience with complex relational schemas (hundreds of tables, multi-level joins, domain conventions)
- Experience with lakehouse platforms (Databricks, Snowflake, or equivalent)
- Experience with dbt, Spark, Airflow, or similar orchestration/transformation frameworks
- Proficiency in Python for data processing and pipeline development
- Cloud data platform experience (AWS preferred: Redshift, Athena, Glue, S3, etc.)
- Familiarity with vector databases, embedding pipelines, or semantic layer tooling
- Strong communication across engineers and scientists
- Biomedical/scientific data experience (omics: RNA-seq, proteomics, GWAS; clinical data; LIMS)
- Experience in pharma/biotech/life sciences
- Familiarity with biomedical ontologies/controlled vocabularies (Gene Ontology, MeSH, ChEBI, HGNC)
- Experience building AI/ML-serving data products (feature stores, training datasets, evaluation benchmarks, semantic annotations for text-to-SQL)
- Knowledge of data governance in regulated industries (lineage, access controls, versioning, auditability)
- Experience with knowledge graph technologies (Neo4j, Amazon Neptune, RDF/SPARQL) or graph data modeling
- Deep Databricks ecosystem experience (Unity Catalog, Delta Lake, MLflow, Databricks SQL)
- Experience designing architectures bridging Nextflow/R/Bioconductor workflows with lakehouse consumption patterns
Benefits (as stated)
- Company bonus (for eligible full-time equivalent employees)
- Comprehensive benefit program: eligibility for company-sponsored 401(k), pension, vacation, medical/dental/vision/prescription coverage, flexible benefits (e.g., healthcare and/or dependent day care FSA), life insurance/death benefits, time off/leave of absence benefits, and well-being benefits (e.g., employee assistance program, fitness benefits, employee clubs/activities)