Role Summary
We are seeking a highly skilled Senior AIML Optimization Engineer to help advance an ambitious AI-driven platform for computing and AIML workloads. The role focuses on optimizing first-in-class Compute and AIML platforms to accelerate application development, scale computational experiments, and integrate computation with project metadata, logs, and performance tracking across Cloud and High-Performance Computing environments. The position requires ownership, collaboration, and the ability to design strategies that improve the end-user environment and scale across the organization.
Responsibilities
- Serve as a key engineer for the optimization team and contribute technical expertise to teams in closely aligned technical areas such as DevOps, Cloud and Infrastructure
- Lead design of major optimization software components of the Compute and AIML Platforms, contribute to development of production code and participate in both design reviews and PR reviews
- Accountable for delivery of scalable solutions to the Compute and AIML Platforms that supports the entire application lifecycle (interactive development and explorations/analysis, scalable batch processing, application deployment) with particular focus on performance at scale
- Partner with both AIML and Compute platform teams as well as scientific users to help optimize and scale scientific workflows by utilizing deep understanding of both software as well as underlying infrastructure (networking, storage, GPU architectures, …)
- Participate or leads scrum team and contribute technical expertise to teams in closely aligned technical areas
- Able to design innovative strategy and way of working to create a better environment for the end users, and able to construct a coordinated, stepwise plan to bring others along with the change curve
- Standard bearer for proper ways of working and engineering discipline, including CI/CD best practices and proactively spearhead improvement within their engineering area
Qualifications
- Required: Bachelor’s, Master’s or PhD degree in Computer Science, Software Engineering, or related discipline
- Required: 6+ years of experience with Bachelor's, 4+ years of experience with Masters, or 2+ years with PhD in cloud computing, scalable parallel computing paradigms, software engineering, and CI/CD
- Required: 2+ years of experience in AIML engineering, including large-scale model training and production deployment
- Preferred: Deep experience with at least one interpreted and one compiled language (e.g., Python, C/C++, Scala, Java) and toolchains for documentation, testing, and operations/observability
- Preferred: Deep experience with application performance tuning and optimization in parallel and distributed computing paradigms and knowledge of MPI, OpenMP, Gloo, and underlying systems
- Preferred: Cloud expertise (AWS, Google Cloud, Azure) with infrastructure-as-code tools and scalable cloud compute technologies
- Preferred: Expertise in AIML training optimization, distributed multi-node training, and acceleration of training jobs
- Preferred: Understanding of ML model deployment strategies, including scalable inference in multi-GPU, multi-node environments
- Preferred: Experience with CI/CD implementations using common stacks (e.g., Azure DevOps, CloudBuild, Jenkins, CircleCI, GitLab)
- Preferred: Experience with Docker, Kubernetes, CNCF ecosystem, and deployment tools like Helm
- Preferred: Experience with build tools (make, CMake) and optimization at build/compile level
- Preferred: Agile software development experience using Jira and Confluence
Skills
- Strong software engineering and optimization skills across cloud and HPC environments
- Proficiency in Python, C/C++, Scala, or Java, with emphasis on performance and observability
- Hands-on experience with Docker, Kubernetes, and CI/CD pipelines
- Experience with distributed computing, parallelism, and performance tuning
- Familiarity with cloud platforms and infrastructure-as-code tools
- Ability to collaborate across cross-functional teams and guide end-user adoption
Education
- Bachelor’s, Master’s or PhD degree in Computer Science, Software Engineering, or related discipline