Associate Director, Sr Principal Systems Engineer

Full-time

Remote friendly (Cambridge Crossing, FL)

United States

Want to see how your resume matches up to this job? A free trial of our JobsAI will help! With over 2,000 biopharma executives loving it, we think you will too! Try it now — JobsAI.

Role Summary

Bristol Myers Squibb is looking for an experienced Sr Principal Systems Engineer in HPC/AI infrastructure to work with our technology teams and various stakeholders to design, manage, and support cutting-edge HPC/AI infrastructure platforms to serve our community of researchers and scientists, who are using Machine Learning, Deep Learning, and High-Performance Computing every day to make groundbreaking discoveries. Collaborating with cross functional teams within BMS, the systems engineer would work our teams to define and execute our HPC/AI roadmap for both on-premises datacenters and in the cloud, provide guidance and technical expertise to senior research leaders and scientists, and work to build out standards and best practice design principles to guide BMS' future roadmap.

Responsibilities

Software/Hardware Optimization, such as performance tuning for bespoke hardware, code refactoring, accelerated ML toolkit and libraries such as CUDA, and continuous integration of codes and ML models.
Development Tools and Environment, such as Git, Linux and python package management, pytorch lightning, containers, and Kubernetes.
Job/Scheduler Orchestration and Integration, knowledgeable in automating and integrating machine learning jobs with major resource schedulers such as SLURM, Grid Engine, AWS Batch, and Parallel Cluster to maximize throughput, performance, utilization, efficiency, and cost effectiveness for ML/AI training and prediction.
Datacenter/Colocation Operations, such as physical installation, networking or bespoke network fabrics, understanding of power/cooling, etc. are strongly preferred.
Vendor Outreach, ability to partner with leading vendors or partners to explore, experiment, and pilot proof-of-concept studies to help bring in, or deliver leading-edge, differentiating capabilities for BMS Research.

Qualifications

Strong experience working with and supporting HPC users, including scientists, data scientists, and/or developers
Strong working experience with container runtimes and container orchestration platforms, including Kubernetes, Docker, and/or Singularity
Strong operational, architecture, and troubleshooting experience with cluster managers and schedulers, ideally Slurm but experience with other HPC schedulers should be acceptable.
Linux systems management and configuration management in an HPC environment
Expert troubleshooting skills with open source frameworks and libraries
Experience working with the NVIDIA software ecosystem and GPU-powered systems for Machine Learning and Deep Learning workloads (preferred)
Experience working with Deep Learning frameworks, libraries, and pipelines, either directly as a user or supporting researcher and/or data science users (preferred)
Experience working with parallel file systems for data storage strategies for large clusters (preferred)
Working knowledge of GPU profiling techniques (preferred)

Education

Additional Requirements

Apply now

Share this job

Associate Director, Sr Principal Systems Engineer

Role Summary

Responsibilities

Qualifications

Education

Additional Requirements

More jobs

Associate Director, Sr Principal Business Analyst

Bristol Myers Squibb

Data & Digital Director, Site Operations

Takeda