Eli Lilly and Company logo

AI GPU Platform Engineering

Eli Lilly and Company
Full-time
Remote friendly (Indianapolis, IN)
United States
$135,000 - $213,400 USD yearly
IT

Want to see how your resume matches up to this job? A free trial of our JobsAI will help! With over 2,000 biopharma executives loving it, we think you will too! Try it now — JobsAI.

Role Summary

AI GPU Platform Engineer focused on driving the engineering and operations of advanced Linux platforms supporting AI and HPC workloads, with expertise in Nvidia DGX systems, Spectrum X networking, and WEKA storage to support cutting-edge AI/ML workloads. Lead strategy and development of advanced Linux computing capabilities for AI/ML and advise on global Linux strategy for on-premises private cloud and public IaaS Linux services.

Responsibilities

  • Drive the engineering and operations of advanced Linux platforms supporting AI and HPC workloads.
  • Manage Nvidia DGX systems using Mission Control, Base Command, and Run:AI.
  • Optimize Spectrum X networking and WEKA storage for AI/ML applications.
  • Boost productivity for Advanced Intelligence and Data Science teams through AI/HPC infrastructure tooling and operational excellence.
  • Lead the strategy, engineering, and development of Advanced Linux computing capabilities for AI/ML.
  • Advise with the senior Linux platform engineer on directing the global Linux strategy for on-premises private cloud and public IaaS Linux services.

Qualifications

  • Required: Expertise in Linux system administration, HPC environments, and Nvidia DGX server management; Experience with Spectrum X networking and parallel file systems.
  • Required: 6+ years of demonstrated experience in AI/ML and HPC workloads and infrastructure.
  • Required: Hands-on experience with HPC-grade infrastructure; knowledge of accelerated computing (GPU), storage (WEKA), scheduling/orchestration (Slurm, Kubernetes, LSF), high-speed networking (Ultra-Ethernet, RoCE), and container technologies (Docker).
  • Required: Proficiency in at least one scripting language (Bash, Python, etc.).
  • Preferred: Experience running and optimizing large-scale distributed training workloads using PyTorch (DDP, FSDP), NeMo, or JAX; understanding AI/ML workflows from data processing to inference.
  • Required: Bachelor’s degree in computer science, IT, or related technical field.
  • Required: 10+ years’ experience as a Linux OS/Platform Engineer.

Education

  • Bachelor’s degree in computer science, Information Technology, or related technical field.

Skills

  • Linux system administration
  • HPC environments
  • Nvidia DGX server management
  • Spectrum X networking
  • WEKA storage
  • Containerization and automation tools
  • Python/Bash scripting
  • PyTorch, NeMo, or JAX for large-scale distributed training
  • Slurm/Kubernetes/LSF, GPU acceleration
  • AI/ML infrastructure optimization

Additional Requirements

  • Hybrid role located in Indianapolis, IN (relocation required)
  • Less than 5% travel
Apply now
Share this job