Eli Lilly and Company logo

AI GPU Platform Engineering

Eli Lilly and Company
Full-time
Remote friendly (Indianapolis, IN)
United States
$135,000 - $213,400 USD yearly
IT

Role Summary

AI GPU Platform Engineer focused on driving the engineering and operations of advanced Linux platforms supporting AI and HPC workloads, with expertise in Nvidia DGX systems, Spectrum X networking, and WEKA storage to support cutting-edge AI/ML workloads. Lead strategy and development of advanced Linux computing capabilities for AI/ML and advise on global Linux strategy for on-premises private cloud and public IaaS Linux services.

Responsibilities

  • Drive the engineering and operations of advanced Linux platforms supporting AI and HPC workloads.
  • Manage Nvidia DGX systems using Mission Control, Base Command, and Run:AI.
  • Optimize Spectrum X networking and WEKA storage for AI/ML applications.
  • Boost productivity for Advanced Intelligence and Data Science teams through AI/HPC infrastructure tooling and operational excellence.
  • Lead the strategy, engineering, and development of Advanced Linux computing capabilities for AI/ML.
  • Advise with the senior Linux platform engineer on directing the global Linux strategy for on-premises private cloud and public IaaS Linux services.

Qualifications

  • Required: Expertise in Linux system administration, HPC environments, and Nvidia DGX server management; Experience with Spectrum X networking and parallel file systems.
  • Required: 6+ years of demonstrated experience in AI/ML and HPC workloads and infrastructure.
  • Required: Hands-on experience with HPC-grade infrastructure; knowledge of accelerated computing (GPU), storage (WEKA), scheduling/orchestration (Slurm, Kubernetes, LSF), high-speed networking (Ultra-Ethernet, RoCE), and container technologies (Docker).
  • Required: Proficiency in at least one scripting language (Bash, Python, etc.).
  • Preferred: Experience running and optimizing large-scale distributed training workloads using PyTorch (DDP, FSDP), NeMo, or JAX; understanding AI/ML workflows from data processing to inference.
  • Required: Bachelor’s degree in computer science, IT, or related technical field.
  • Required: 10+ years’ experience as a Linux OS/Platform Engineer.

Education

  • Bachelor’s degree in computer science, Information Technology, or related technical field.

Skills

  • Linux system administration
  • HPC environments
  • Nvidia DGX server management
  • Spectrum X networking
  • WEKA storage
  • Containerization and automation tools
  • Python/Bash scripting
  • PyTorch, NeMo, or JAX for large-scale distributed training
  • Slurm/Kubernetes/LSF, GPU acceleration
  • AI/ML infrastructure optimization

Additional Requirements

  • Hybrid role located in Indianapolis, IN (relocation required)
  • Less than 5% travel
Apply now
Share this job