Role Summary
Pfizer's commitment to applying computational science in drug discovery and development includes a large-scale migration of our computational infrastructure to the cloud. This role leverages extensive experience in cloud engineering and DevOps and requires hands-on design and delivery of robust High Performance Computing (HPC) solutions supporting computational workloads across the organization. Location: Hybrid. Must be able to work from the assigned Pfizer office 2-3 days per week, or as needed by the business.
Responsibilities
- Design, implement, operate, and own robust and dependable infrastructure for HPC and ML/AI workloads in a cloud environment (AWS/GCP).
- Lead containerization, deployment, and operation of user- and admin-facing HPC platforms (Slurm, Open On Demand, Prometheus/Grafana, batch and distributed computing platforms) across cloud environments.
- Translate stakeholder input into robust, high-performance, scalable, cost effective computing platforms.
- Partner with HPC specialists to capture institutional knowledge and manual processes in IaC workflows, transforming ad-hoc deployment practices into reproducible, version-controlled, automated procedures.
- Develop and maintain infrastructure automation using IaC tools like Terraform and CloudFormation to ensure repeatable environment provisioning and scaling.
- Create reusable Terraform modules, develop and enforce standards, and drive the implementation and maintenance of all cloud infrastructure using IaC tools.
- Operationalize containerized solutions using Docker and Kubernetes.
- Own the full lifecycle of infrastructure management, from provisioning to operations, support, updating, and teardown of production computing platforms.
- Perform troubleshooting, system analysis, and benchmarking to resolve issues and maintain a high-performance environment.
- Develop and maintain monitoring, logging, and alerting for the infrastructure (e.g., CloudWatch, Prometheus/Grafana).
- Design dashboards, workflows, and utilities to improve observability, cost monitoring, workload efficiency, user or administration experience.
- Document architecture, deployment processes, and operational procedures.
- Collaborate with team members to support delivery of scientific computing services including user support, Linux administration, operations, job scheduling, application management, and resource optimization.
Qualifications
- Required: B.S. in computer science, life science, data science or similar fields.
- Required: 6+ years of experience in cloud infrastructure engineering with a proven track record of developing and supporting robust IaC deployments.
- Required: Experience managing scientific computing workloads in an enterprise environment.
- Required: Advanced experience with at least one of AWS and GCP, including knowledge of core compute and storage services relevant to HPC.
- Required: Solid understanding of cloud networking, identity, and security controls.
- Preferred: Prior experience with HPC deployment utilities including AWS ParallelCluster, AWS Parallel Computing Services, and Google Cloud Cluster Toolkit.
- Preferred: Proficiency with distributed computing environments, especially EKS/GKE/Kubernetes.
- Preferred: Familiarity with HPC environments, job schedulers (Slurm), HPC application containers (Docker, Singularity, Apptainer) and NVIDIA GPU computing.
Additional Requirements
- Occasional international travel for team meetings and conferences.