Role Summary
AI GPU Platform Engineer focused on driving the engineering and operations of advanced Linux platforms supporting AI and HPC workloads, with expertise in Nvidia DGX systems, Spectrum X networking, and WEKA storage to support cutting-edge AI/ML workloads. Lead strategy and development of advanced Linux computing capabilities for AI/ML and advise on global Linux strategy for on-premises private cloud and public IaaS Linux services.
Responsibilities
- Drive the engineering and operations of advanced Linux platforms supporting AI and HPC workloads.
- Manage Nvidia DGX systems using Mission Control, Base Command, and Run:AI.
- Optimize Spectrum X networking and WEKA storage for AI/ML applications.
- Boost productivity for Advanced Intelligence and Data Science teams through AI/HPC infrastructure tooling and operational excellence.
- Lead the strategy, engineering, and development of Advanced Linux computing capabilities for AI/ML.
- Advise with the senior Linux platform engineer on directing the global Linux strategy for on-premises private cloud and public IaaS Linux services.
Qualifications
- Required: Expertise in Linux system administration, HPC environments, and Nvidia DGX server management; Experience with Spectrum X networking and parallel file systems.
- Required: 6+ years of demonstrated experience in AI/ML and HPC workloads and infrastructure.
- Required: Hands-on experience with HPC-grade infrastructure; knowledge of accelerated computing (GPU), storage (WEKA), scheduling/orchestration (Slurm, Kubernetes, LSF), high-speed networking (Ultra-Ethernet, RoCE), and container technologies (Docker).
- Required: Proficiency in at least one scripting language (Bash, Python, etc.).
- Preferred: Experience running and optimizing large-scale distributed training workloads using PyTorch (DDP, FSDP), NeMo, or JAX; understanding AI/ML workflows from data processing to inference.
- Required: Bachelor’s degree in computer science, IT, or related technical field.
- Required: 10+ years’ experience as a Linux OS/Platform Engineer.
Education
- Bachelor’s degree in computer science, Information Technology, or related technical field.
Skills
- Linux system administration
- HPC environments
- Nvidia DGX server management
- Spectrum X networking
- WEKA storage
- Containerization and automation tools
- Python/Bash scripting
- PyTorch, NeMo, or JAX for large-scale distributed training
- Slurm/Kubernetes/LSF, GPU acceleration
- AI/ML infrastructure optimization
Additional Requirements
- Hybrid role located in Indianapolis, IN (relocation required)
- Less than 5% travel