AI GPU Platform Engineering

Full-time

Remote friendly (Indianapolis, IN)

United States

$135,000 - $213,400 USD yearly

Want to see how your resume matches up to this job? A free trial of our JobsAI will help! With over 2,000 biopharma executives loving it, we think you will too! Try it now — JobsAI.

Role Summary

AI GPU Platform Engineering role focusing on AI/HPC infrastructure, Nvidia DGX server management, Spectrum X networking, and WEKA storage integration to support AI/ML workloads. Located in Indianapolis, IN with hybrid work and relocation required.

Responsibilities

Drive the engineering and operations of advanced Linux platforms supporting AI and HPC workloads.
Manage Nvidia DGX systems using Mission Control, Base Command, and Run:AI.
Optimize Spectrum X networking and WEKA storage for AI/ML applications.
Improve productivity for Advanced Intelligence and Data Science teams through AI/HPC infrastructure tooling and operational excellence.
Lead strategy, engineering, and development of Advanced Linux computing capabilities for AI/ML within the Infrastructure Hosting Platform.
Advise with the senior Linux platform engineer on global Linux strategy for on-premises private cloud and public IaaS Linux services.

Qualifications

Required: 10+ years of experience as a Linux OS/Platform Engineer; Bachelor's degree in computer science, IT, or related field.
Preferred: 6+ years of demonstrated experience in AI/ML and HPC workloads; experience leading global large-scale infrastructure projects.
Expertise in Linux system administration, HPC environments, and Nvidia DGX server management; Spectrum X networking and parallel file systems.
Strong scripting skills; familiarity with containerization and automation tools.
Hands-on experience with HPC infrastructure, accelerated computing (GPU), storage (WEKA), scheduling/orchestration (Slurm, Kubernetes, LSF), high-speed networking (Ultra-Ethernet, RoCE), and containers (Docker).
Experience with distributed training workloads using PyTorch (DDP, FSDP), NeMo, or JAX; understanding AI/ML workflows from data processing to inference.
Proficiency in at least one scripting language (e.g., Bash, Python).

Skills

Linux system administration
HPC infrastructure management
Nvidia DGX server management
Spectrum X networking
WEKA storage integration
AI/ML workload optimization
Infrastructure as Code, AI OPS automation
PyTorch, NeMo, or JAX distributed training
Scripting (Bash, Python)
Container technologies (Docker)
Slurm, Kubernetes, LSF

Education

Bachelor‚Äôs degree in computer science, Information Technology, or related technical field.

AI GPU Platform Engineering

Role Summary

Responsibilities

Qualifications

Skills

Education

Additional Requirements

More jobs

2026 Global Knowledge Management & Digital Collaboration Intern

AbbVie

AD, Health Information Technology (Georgia) - REMOTE

Novartis