Role Summary
Senior Compute Platform Engineer responsible for building a first-in-class, metadata-forward compute platform spanning on-prem and cloud environments. You will design, build, and operate toolchains and workflows that accelerate application development, scale computational experiments, and integrate computation with project metadata, logs, experiment configuration, and performance tracking for interactive development, large-scale batch processing, and production deployments. You will mentor junior engineers, uphold coding standards, documentation, and DevOps practices, and contribute to open-source community efforts as appropriate. This role serves sites including Cambridge, Seattle, South San Francisco, and Upper Providence.
Responsibilities
- Designs, builds, and operates tools, services, and workflows that deliver high value to key business problems.
- Develops components of a hybrid on-prem/cloud compute platform for interactive and scalable batch computing; creates processes to transition existing HPC users to this platform.
- Manages code-driven environments, applications, and container/image builds, and CI/CD-driven deployments.
- Consults science users on application scalability to PBs of data, applying software engineering knowledge and hardware considerations to performance.
- Optimizes design and execution of complex solutions within large-scale distributed computing environments.
- Produces well-engineered software, with automated tests, technical documentation, and operational strategy.
- Ensures consistent platform abstractions for logging and data lineage.
- Participates in code reviews, adheres to coding best practices, and helps raise team standards.
- Adheres to QMS framework and CI/CD best practices, guiding improvements to these processes.
- Provides technical leadership to team members to ensure tasks are completed correctly.
Qualifications
- Required: Bachelor's degree in data engineering, Computer Science, Software Engineering or related discipline
- Required: 6+ years of professional experience
- Required: Experience with Python
- Required: Experience with Cloud
- Required: Experience with High Performance Compute (HPC)
- Preferred: Deep knowledge of at least one common programming language (e.g., Python, C++, Java) and related toolchains for documentation, testing, and operations/observability
- Preferred: Expertise with modern software development tools and practices (git/GitHub, DevOps tools, metrics/monitoring)
- Preferred: Deep cloud expertise (AWS, Google Cloud, Azure) including infrastructure-as-code and scalable compute technologies
- Preferred: Experience with CI/CD implementations using git and a common CI/CD stack (Azure DevOps, CloudBuild, Jenkins, CircleCI, GitLab)
- Preferred: Expertise with Docker, Kubernetes, CNCF ecosystem, and deployment tools like Helm
- Preferred: Experience with low-level build tools (make, CMake) and automated build systems (Spack, EasyBuild)
- Preferred: Experience with workflow orchestration (Argo, Airflow, Nextflow, Snakemake, VisTrails, Cromwell)
- Preferred: Experience with performance tuning in parallel and distributed computing (MPI, OpenMP, Gloo) and understanding of hardware, networks, storage
- Preferred: Agile software development experience using Jira and Confluence
- Preferred: Engagement with open-source community and ability to contribute to tools
Skills
- Python
- Cloud computing (AWS, Google Cloud, Azure)
- High Performance Compute (HPC)
- Docker, Kubernetes, CNCF ecosystem; Helm
- CI/CD tools and workflows (Git, GitHub, Azure DevOps, CloudBuild, Jenkins, CircleCI, GitLab)
- Workflow orchestration (Argo, Airflow, Nextflow, Snakemake, VisTrails, Cromwell)
- Build tools (make, CMake) and automated build systems (Spack, EasyBuild)
- Application performance tuning for parallel/distributed computing (MPI, OpenMP, Gloo); hardware, networks, storage understanding
- Agile tools (Jira, Confluence)
- Open-source community involvement