Bristol Myers Squibb logo

Associate Director, Sr Principal Systems Engineer

Bristol Myers Squibb
Full-time
Remote friendly (Cambridge Crossing, FL)
United States
IT

Want to see how your resume matches up to this job? A free trial of our JobsAI will help! With over 2,000 biopharma executives loving it, we think you will too! Try it now — JobsAI.

Role Summary

Bristol Myers Squibb is looking for an experienced Sr Principal Systems Engineer in HPC/AI infrastructure to work with our technology teams and various stakeholders to design, manage, and support cutting-edge HPC/AI infrastructure platforms to serve our community of researchers and scientists, who are using Machine Learning, Deep Learning, and High-Performance Computing every day to make groundbreaking discoveries. Collaborating with cross functional teams within BMS, the systems engineer would work our teams to define and execute our HPC/AI roadmap for both on-premises datacenters and in the cloud, provide guidance and technical expertise to senior research leaders and scientists, and work to build out standards and best practice design principles to guide BMS' future roadmap.

Responsibilities

  • Software/Hardware Optimization, such as performance tuning for bespoke hardware, code refactoring, accelerated ML toolkit and libraries such as CUDA, and continuous integration of codes and ML models.
  • Development Tools and Environment, such as Git, Linux and python package management, pytorch lightning, containers, and Kubernetes.
  • Job/Scheduler Orchestration and Integration, knowledgeable in automating and integrating machine learning jobs with major resource schedulers such as SLURM, Grid Engine, AWS Batch, and Parallel Cluster to maximize throughput, performance, utilization, efficiency, and cost effectiveness for ML/AI training and prediction.
  • Datacenter/Colocation Operations, such as physical installation, networking or bespoke network fabrics, understanding of power/cooling, etc. are strongly preferred.
  • Vendor Outreach, ability to partner with leading vendors or partners to explore, experiment, and pilot proof-of-concept studies to help bring in, or deliver leading-edge, differentiating capabilities for BMS Research.

Qualifications

  • Strong experience working with and supporting HPC users, including scientists, data scientists, and/or developers
  • Strong working experience with container runtimes and container orchestration platforms, including Kubernetes, Docker, and/or Singularity
  • Strong operational, architecture, and troubleshooting experience with cluster managers and schedulers, ideally Slurm but experience with other HPC schedulers should be acceptable.
  • Linux systems management and configuration management in an HPC environment
  • Expert troubleshooting skills with open source frameworks and libraries
  • Experience working with the NVIDIA software ecosystem and GPU-powered systems for Machine Learning and Deep Learning workloads (preferred)
  • Experience working with Deep Learning frameworks, libraries, and pipelines, either directly as a user or supporting researcher and/or data science users (preferred)
  • Experience working with parallel file systems for data storage strategies for large clusters (preferred)
  • Working knowledge of GPU profiling techniques (preferred)

Education

Additional Requirements

Apply now
Share this job