Role Summary
Senior Compute Platform Engineer is a technical contributor responsible for designing and operating a hybrid on-prem/cloud compute platform that enables interactive development, large-scale batch processing, and observability across the application and analysis lifecycle. The role emphasizes building toolchains and workflows, CI/CD-driven deployments, and robust platform abstractions, while mentoring junior team members and ensuring code quality and documentation. Location options include Cambridge, MA; Waltham, MA; Rockville, MD; and San Francisco, CA.
Responsibilities
- Designs, builds, and operates tools, services, workflows, etc that deliver high value through the solution to key business problems
- Responsible for development of key components of a hybrid on-prem/cloud compute platform for both interactive and scalable batch computing and establishing of processes and workflows to transition existing HPC users and teams to this platform
- Responsible for code-driven environment, applications, and container/image builds as well as CI/CD driven application deployments
- Consult science users on application scalability to PBs of data by having a deep understanding of software engineering, algorithms, and underlying hardware infrastructure and their impact on performance
- Confidently optimizes design and execution of complex solutions within large-scale distributed computing environments
- Produces well-engineered software, including appropriate automated test suites, technical documentation, and operational strategy
- Ensure consistent application of platform abstractions to ensure quality and consistency with respect to logging and lineage
- Fully versed in coding best practices and ways of working, and participates in code reviews and partnering to improve the teamβs standards
- Adhere to QMS framework and CI/CD best practices and helps to guide improvements to them that improve ways of working
- Provide leadership to team members to help others get the job done right
Qualifications
- Required: Bachelor's degree in data engineering, Computer Science, Software Engineering or related discipline
- Required: 6+ years of professional experience
- Required: Experience with Python
- Required: Experience with Cloud
- Required: Experience with High Performance Compute (HPC)
- Preferred: Deep knowledge and use of at least one common programming language (e.g., Python, C++, Java) including toolchains for documentation, testing, and operations/observability
- Preferred: Deep expertise in modern software development tools/ways of working (Git, GitHub, DevOps tools, metrics/monitoring)
- Preferred: Deep cloud expertise (AWS, Google Cloud, Azure) including infrastructure-as-code tools and scalable compute technologies (e.g., Google Batch and Vertex)
- Preferred: Experience with CI/CD implementations using Git and a common CI/CD stack (Azure DevOps, CloudBuild, Jenkins, CircleCI, GitLab)
- Preferred: Deep expertise with Docker, Kubernetes, and the CNCF ecosystem including deployment tools like Helm
- Preferred: Experience with low-level build tools (Make, CMake) and automated build systems such as Spack or EasyBuild
- Preferred: Experience with workflow orchestration tools (Argo, Airflow, Nextflow, Snakemake, VisTrails, Cromwell)
- Preferred: Experience with application performance tuning in parallel and distributed computing, including MPI/OpenMP/Gloo and understanding of hardware, networks, and storage
- Preferred: Demonstrated excellence with agile software development using Jira/Confluence
- Preferred: Engagement with open-source community and potential contributions to tools