Operational Reliability & Service Ownership
- Lead a centralized Service Reliability & Operations function, providing standardized intake, triage, coordination, and governance across incidents, events, requests, and problems.
- Deploy AI-assisted triage models that classify, prioritize, and route incidents based on historical patterns, service risk profiles, and real-time signals.
- Establish and govern an automated remediation capability for known failure patterns, with human-in-the-loop escalation for high-risk scenarios.
- Own the operational runbook strategy so runbooks are active, machine-readable automation inputs (not static documentation) driving consistent, auditable response execution with or without human initiation.
- Own service stability and operational readiness outcomes for a heterogeneous, global application estate.
- Ensure Major Incident Management discipline (command, escalation, executive communications) is consistently executed for critical services.
Demand Management and Intake Governance
- Own operational demand management for centralized production support; ensure requests, enhancements, onboarding, and change-driven demand are visible, prioritized, and aligned to reliability and capacity constraints.
- Define and govern standardized demand intake, categorization, and prioritization models balancing business urgency, service risk, and operational sustainability.
- Use AI-driven demand pattern analysis to distinguish predictable, automatable demand to protect human capacity for high-judgment activity.
- Use demand trends to influence service design, onboarding decisions, and support models.
Capacity Planning & Workforce Sustainability
- Own capacity planning using demand, incident, and service risk data to forecast workload and skill needs.
- Translate demand signals into staffing strategies, automation investments, and capacity plans.
- Ensure the operating model scales sustainably by balancing workloads.
Reliability Engineering & Continuous Improvement
- Strengthen Problem Management to reduce reliability and demand (drive systemic risk reduction vs. reactive firefighting).
- Own and maintain a living automation roadmap sequencing opportunities by ROI, operational risk reduction, and technical feasibility to reduce MTTR, operational toil, and human demand.
- Partner with engineering/platform/SRE teams to create feedback loops between automated remediation outcomes and the knowledge base.
Data-Driven Operations & Tool Enablement
- Use incident, change, and risk data to prioritize staffing, automation, and reliability improvement investments.
- Define and standardize operational KPIs/health indicators (demand volume, capacity utilization, MTTR, automation effectiveness).
- Ensure ITSM and observability tooling enables consolidated intake, standardized workflows, measurable outcomes, real-time visibility, predictive insights, and actionable reporting.
- Partner on AI-enabled/automated capabilities (predictive insights, automated remediation, agent-driven coordination).
Leadership & Stakeholder Engagement
- Lead/develop operations, reliability, and service management leaders; set expectations for technical literacy, automation-first thinking, demand accountability, and outcome ownership.
- Partner with business service owners and risk/security/technology leaders; translate operational data and AI insights into business-relevant decisions.
- Drive organizational change to protect stability, transparency, and capacity during transitions.
How Youโll Succeed
- Operate at both strategic design and day-to-day execution levels in high-pressure production environments while modernizing operations.
- Success implementing centralized, tiered operating models with demand management and capacity planning to scale globally using automation and AI.
- Deep understanding of Incident, Event, Change, and Problem Management enhanced by automation/analytics/AI in a mature ITSM environment.
- Use operational and demand data to drive capacity decisions, reliability improvements, and executive confidence.
Your Basic Qualifications
- Bachelorโs degree in Business, Information Technology, STEM, or related field.
- 12+ years of IT experience in production operations, reliability, or service management (or SRE-adjacent).
- 7+ years leading vendor/supplier-supported operating models (capacity planning, demand forecasting, automation/innovation).
- 5+ years in people leadership roles in complex, global environments.
- Demonstrated experience leading high-severity incident response and operational risk mitigation.
- Hands-on experience with AI in production operations, reliability, or service management (or SRE-adjacent).
- Authorization to work in the United States full-time (Lilly will not sponsor visas).
What You Should Bring
- Strong executive communication skills translating operational and AI signals into business-relevant insights.
Compensation/Benefits (as stated)
- Anticipated wage: $132,000 - $193,600.
- Eligible for company bonus (depending on performance).
- Comprehensive benefits for eligible employees (401(k), pension, vacation, medical/dental/vision/prescription, flexible benefits, life insurance, time off/leave, and well-being benefits).