This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We’re building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health®, you’ll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time. CVS Health is seeking an Executive Director, AI Ops Engineering to build and lead a team of professionals responsible for the continuous operation, monitoring, and optimization of CVS's Enterprise AI environment. This is first and foremost an engineering leadership role — your core accountability is ensuring the platform is always on, always performing, and always improving. CVS Health's AI platform is a critical enterprise asset powering clinical, operational, and consumer capabilities at scale across one of the nation's largest healthcare organizations. Keeping it reliable, observable, and continuously improving is the mission. Reporting to the Global Head of Infrastructure/AI Operations and Service Delivery, you will establish and maintain operational baselines across the full infrastructure stack, ensure all changes are continuously monitored, observed, and adjusted, and drive the highest levels of availability, reliability, and scalability across every layer of the environment. This is a greenfield organizational build — the person in this role will define the operating model, shape the team culture, and establish the engineering standards that will govern CVS's AI infrastructure for years ahead. If you thrive on building from the ground up, this role was designed for you.
Job Responsibility:
Own the SRE vision, strategy, and long-range roadmap with availability (>99.99%), reliability, and scalability as the primary measures of success
Lead, develop, and integrate all functional teams into a cohesive, always-on operations organization — setting clear ownership, accountability, and performance expectations for each team and each engineer
Establish and enforce operational baselines across all platform components
ensure deviations are detected, escalated, and resolved within defined SLAs
Drive end-to-end observability with continuous feedback loops connecting monitoring data to incident response, change decisions, and improvement cycles
Oversee change management ensuring every modification is risk-assessed, monitored during rollout, and baseline-validated post-deployment
Ensure configuration consistency and drift detection across all platform components to prevent baseline degradation over time
Build and sustain a high-performing 24/7 operations model — zero mandatory overtime, zero burnout attrition, and measurable team health and retention
Empower the Security SRE Lead to implement and maintain a world-class security posture, minimizing risk and ensuring robust compliance with frameworks like HIPAA and NIST AI RMF
Direct Innovation POD strategy to develop self-healing and autonomous capabilities that proactively prevent degradation before it impacts availability
Lead GPU FinOps governance — utilization optimization, tenant quota enforcement, and cost reduction — in partnership with the Finance organization
Manage vendor relationships and performance accountability
Lead the structured transition of operational ownership from the incumbent managed services provider to CVS's internal SRE organization, governing phased handoffs, competency validation, and milestone sign-offs, ensuring a seamless transition with minimal disruption to platform availability and business operations
Establish and lead the long-term operating model by institutionalizing key technical, architectural, and delivery leadership capabilities into permanent CVS roles, ensuring the organization is fully self-sustaining at program close
Requirements:
10+ years in SRE, platform operations, or DevOps engineering leadership with a demonstrated focus on availability and reliability outcomes
5+ years leading multiple technical teams simultaneously, including 24/7 operations organizations — with measurable team health, retention, and performance outcomes
Proven success establishing and enforcing operational baselines, SLO/SLI/error budget frameworks, and observability-driven continuous improvement in complex environments
Deep expertise in Kubernetes/OpenShift, IaC, GPU computing, and AI/ML infrastructure
Experience managing large-scale MSP transitions or platform operational handoffs while ensuring business continuity and minimizing disruption.
Demonstrated FinOps and GPU cost optimization experience in cloud or on-premises environments
Security framework implementation and compliance program management in regulated industries (HIPAA, NIST AI RMF)
Track record building sustainable 24/7 operations models with measurable retention and no burnout-related attrition
Executive stakeholder communication, vendor negotiation, and budget ownership
Background in innovation programs, POD structures, or centers of excellence
Willingness to travel and work off hours as required
Required: Bachelor's in Computer Science, Engineering, or related field
Nice to have:
NVIDIA AI Enterprise, Run:AI, or GPU orchestration platform experience