This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We’re building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health®, you’ll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time.
Job Responsibility:
Lead the AIOps strategy, roadmap, and operating model (intake, triage, automation lifecycle, KPIs) to measurably improve MTTR, alert quality, and operational efficiency
Own the observability-to-AIOps pipeline (metrics, logs, traces, events) and drive standardization of telemetry, service health models, and actionable alerting across teams and platforms
Design and implement event intelligence: correlation, deduplication, suppression, anomaly detection, incident clustering, and probable-cause analysis using topology/CMDB context
Advise operations, service owners, and leadership stakeholders
lead change enablement, adoption, and value measurement for AIOps and agentic automation across the organization
Develop ServiceNow-centric AIOps integrations (ITSM + ITOM/Event Management where applicable): event ingestion, alert-to-incident policies, enrichment, assignment/routing, approvals, change workflows, and closure updates for auditable closed-loop ops
Establish governance for operational AI (risk controls, approvals, auditability, data access, prompt/response logging, evaluation, and continuous improvement) in partnership with security, compliance, and operations
Build and operationalize agentic AI workflows for incident triage and resolution: signal summarization, similar-incident retrieval, knowledge article drafting, ticket updates, stakeholder communications, and human-in-the-loop remediation
Enable closed-loop automation and self-healing by connecting AIOps detections to orchestrated actions (runbooks/workflows), with clear approvals, safety checks, and rollback paths
Partner with NOC/SOC, infrastructure, and application owners to onboard services into AIOps, define service models, and improve signal quality, escalation paths, and operational readiness
Create enablement materials (playbooks, operating procedures, dashboards) and coach teams on AIOps practices, agentic AI usage, and responsible automation
Requirements:
10+ years of experience in SRE, production operations supporting highly available services along with experience with Product model
Proven technical leadership: ability to set direction, lead cross-team initiatives, and advise stakeholders through architecture reviews, tradeoffs, and operational readiness
Strong programming/scripting skills (Python preferred) and experience building automation, integrations, and APIs
Experience integrating observability platforms and event sources across hybrid environments (cloud/on-prem) and operating production-grade monitoring/event management at scale
Strong ServiceNow experience as an ITSM system of record (Incident/Problem/Change
CMDB/asset concepts). Ability to build and operate integrations at scale (REST, webhooks, event management) to support automation and auditability
Python (preferred) for automation and data/ML pipelines
experience building integrations, services, and operational tooling
Workflow orchestration and integrations (ServiceNow APIs, event pipelines, runbook automation) with strong reliability, security, and auditability practices
Agentic AI frameworks: building tool-using agents, retrieval workflows, prompt/response logging, evaluation, and guardrails
Operational ML/Analytics: anomaly detection and time-series analysis, correlation approaches, and model/agent evaluation & monitoring in production
Bachelor’s degree or equivalent experience (Highschool diploma plus 4 years relevant work work experience)
Nice to have:
Demonstrated experience applying machine learning and/or LLM-based approaches to operational problems (noise reduction, correlation, anomaly detection, summarization, and assisted remediation) in production environments
Experience building an agentic AI platform/ecosystem (shared tools, standardized patterns, evaluation, and guardrails) and enabling multiple teams to safely deliver automations
Familiarity with ServiceNow ITOM / Event Management / AIOps capabilities (or equivalent) and integrating observability signals into ITSM workflows
Strong Linux and networking fundamentals (TCP/IP, DNS, TLS, load balancing) and ability to troubleshoot distributed systems end-to-end
DevOps, or platform engineering experience supporting highly available services along with experience with Product model
Excellent communication skills with the ability to lead incident bridges, write clear postmortems, and influence reliability improvements across teams