Role Summary
Senior Machine Learning Engineer responsible for building a platform that accelerates drug discovery with machine learning across diverse use cases, data modalities, model architectures, and sizes. Leverage experience in machine learning, software engineering, infrastructure, and data to enable the full model lifecycle from development through deployment.
Responsibilities
- Build the machine learning platform from a mix of off-the-shelf and custom tooling to enable the full model lifecycle from development through deployment
- Design and implement scalable LLM serving infrastructure for high-throughput inference and real-time model serving
- Architect infrastructure for agentic systems, including multi-agent coordination, workflow orchestration, and autonomous decision-making pipelines
- Partner with ML teams to onboard models to platform capabilities
- Continuously optimize the platform to support emerging ML research capabilities and evolving use-case needs
- Build and maintain orchestration frameworks for complex ML workflows, including agent-based systems and multi-modal model pipelines
- Inspire the team and stakeholders to find the best outcome by facilitating constructive dialogue and reconciling perspectives
- Create tools and experiences that support model development, while not being responsible for building ML models
Qualifications
- Experience designing and building large distributed systems with scalable interfaces
- Proven experience with LLM serving infrastructure, including optimization techniques for large-scale model inference, batching, and resource management
- Expertise in infrastructure for agentic systems, including multi-agent architectures, coordination protocols, and autonomous workflow management
- Strong background in orchestration frameworks and workflow management for complex ML pipelines and agent-based workflows
- Proficiency with the full lifecycle of ML and software development in production: data preparation, model training, evaluation, deployment, monitoring; releasing and maintaining mature ML products; authoring well-tested, scalable, documented code with CI/CD
- Experience with model optimization, quantization, and serving frameworks for efficient LLM deployment
- Knowledge of distributed agent coordination, task scheduling, and inter-agent communication patterns
- Ability to own projects in a collaborative open environment
- Ability to mentor and sponsor peers and colleagues
- Nice to have: PyTorch, GCP, CUDA, Docker, Kubernetes, BigQuery, vLLM, Ray, Prefect, LangChain, model quantization techniques, distributed inference frameworks, agent orchestration platforms, workflow management systems
Skills
- Large distributed systems design
- LLM serving and optimization
- Agentic systems and multi-agent coordination
- Orchestration and workflow management
- Production ML lifecycle management
- Model optimization and quantization
- Inter-agent communication and scheduling
- Mentorship and collaboration
Education
Additional Requirements
- Hybrid work location: Salt Lake City or New York City, with 50% in-office requirement