Eli Lilly and Company logo

AI GPU Platform Engineering

Eli Lilly and Company
Full-time
Remote friendly (Indianapolis, IN)
United States
$135,000 - $213,400 USD yearly
IT

Want to see how your resume matches up to this job? A free trial of our JobsAI will help! With over 2,000 biopharma executives loving it, we think you will too! Try it now β€” JobsAI.

Role Summary

AI GPU Platform Engineering role focusing on AI/HPC infrastructure, Nvidia DGX server management, Spectrum X networking, and WEKA storage integration to support AI/ML workloads. Located in Indianapolis, IN with hybrid work and relocation required.

Responsibilities

  • Drive the engineering and operations of advanced Linux platforms supporting AI and HPC workloads.
  • Manage Nvidia DGX systems using Mission Control, Base Command, and Run:AI.
  • Optimize Spectrum X networking and WEKA storage for AI/ML applications.
  • Improve productivity for Advanced Intelligence and Data Science teams through AI/HPC infrastructure tooling and operational excellence.
  • Lead strategy, engineering, and development of Advanced Linux computing capabilities for AI/ML within the Infrastructure Hosting Platform.
  • Advise with the senior Linux platform engineer on global Linux strategy for on-premises private cloud and public IaaS Linux services.

Qualifications

  • Required: 10+ years of experience as a Linux OS/Platform Engineer; Bachelor's degree in computer science, IT, or related field.
  • Preferred: 6+ years of demonstrated experience in AI/ML and HPC workloads; experience leading global large-scale infrastructure projects.
  • Expertise in Linux system administration, HPC environments, and Nvidia DGX server management; Spectrum X networking and parallel file systems.
  • Strong scripting skills; familiarity with containerization and automation tools.
  • Hands-on experience with HPC infrastructure, accelerated computing (GPU), storage (WEKA), scheduling/orchestration (Slurm, Kubernetes, LSF), high-speed networking (Ultra-Ethernet, RoCE), and containers (Docker).
  • Experience with distributed training workloads using PyTorch (DDP, FSDP), NeMo, or JAX; understanding AI/ML workflows from data processing to inference.
  • Proficiency in at least one scripting language (e.g., Bash, Python).

Skills

  • Linux system administration
  • HPC infrastructure management
  • Nvidia DGX server management
  • Spectrum X networking
  • WEKA storage integration
  • AI/ML workload optimization
  • Infrastructure as Code, AI OPS automation
  • PyTorch, NeMo, or JAX distributed training
  • Scripting (Bash, Python)
  • Container technologies (Docker)
  • Slurm, Kubernetes, LSF

Education

  • Bachelorβ€šΓ„Γ΄s degree in computer science, Information Technology, or related technical field.

Additional Requirements

  • Hybrid role located in Indianapolis, IN (relocation required)
  • <5% travel