CrawlJobs Logo

Principal AIOps Engineer

https://www.cvshealth.com/ Logo

CVS Health

Location Icon

Location:
United States

Category Icon

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

144200.00 - 288400.00 USD / Year

Job Description:

We’re building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health®, you’ll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time.

Job Responsibility:

  • Lead the AIOps strategy, roadmap, and operating model (intake, triage, automation lifecycle, KPIs) to measurably improve MTTR, alert quality, and operational efficiency
  • Own the observability-to-AIOps pipeline (metrics, logs, traces, events) and drive standardization of telemetry, service health models, and actionable alerting across teams and platforms
  • Design and implement event intelligence: correlation, deduplication, suppression, anomaly detection, incident clustering, and probable-cause analysis using topology/CMDB context
  • Advise operations, service owners, and leadership stakeholders
  • lead change enablement, adoption, and value measurement for AIOps and agentic automation across the organization
  • Develop ServiceNow-centric AIOps integrations (ITSM + ITOM/Event Management where applicable): event ingestion, alert-to-incident policies, enrichment, assignment/routing, approvals, change workflows, and closure updates for auditable closed-loop ops
  • Establish governance for operational AI (risk controls, approvals, auditability, data access, prompt/response logging, evaluation, and continuous improvement) in partnership with security, compliance, and operations
  • Build and operationalize agentic AI workflows for incident triage and resolution: signal summarization, similar-incident retrieval, knowledge article drafting, ticket updates, stakeholder communications, and human-in-the-loop remediation
  • Enable closed-loop automation and self-healing by connecting AIOps detections to orchestrated actions (runbooks/workflows), with clear approvals, safety checks, and rollback paths
  • Partner with NOC/SOC, infrastructure, and application owners to onboard services into AIOps, define service models, and improve signal quality, escalation paths, and operational readiness
  • Create enablement materials (playbooks, operating procedures, dashboards) and coach teams on AIOps practices, agentic AI usage, and responsible automation

Requirements:

  • 10+ years of experience in SRE, production operations supporting highly available services along with experience with Product model
  • Proven technical leadership: ability to set direction, lead cross-team initiatives, and advise stakeholders through architecture reviews, tradeoffs, and operational readiness
  • Strong programming/scripting skills (Python preferred) and experience building automation, integrations, and APIs
  • Experience integrating observability platforms and event sources across hybrid environments (cloud/on-prem) and operating production-grade monitoring/event management at scale
  • Strong ServiceNow experience as an ITSM system of record (Incident/Problem/Change
  • CMDB/asset concepts). Ability to build and operate integrations at scale (REST, webhooks, event management) to support automation and auditability
  • Python (preferred) for automation and data/ML pipelines
  • experience building integrations, services, and operational tooling
  • Workflow orchestration and integrations (ServiceNow APIs, event pipelines, runbook automation) with strong reliability, security, and auditability practices
  • Observability: Prometheus/Grafana, OpenTelemetry, ELK/Splunk/Datadog (or equivalent)
  • ServiceNow ITSM/ITOM: Incident/Problem/Change, CMDB/service mapping concepts, and Event Management/AIOps integrations (where applicable)
  • Agentic AI frameworks: building tool-using agents, retrieval workflows, prompt/response logging, evaluation, and guardrails
  • Operational ML/Analytics: anomaly detection and time-series analysis, correlation approaches, and model/agent evaluation & monitoring in production
  • Bachelor’s degree or equivalent experience (Highschool diploma plus 4 years relevant work work experience)

Nice to have:

  • Demonstrated experience applying machine learning and/or LLM-based approaches to operational problems (noise reduction, correlation, anomaly detection, summarization, and assisted remediation) in production environments
  • Experience building an agentic AI platform/ecosystem (shared tools, standardized patterns, evaluation, and guardrails) and enabling multiple teams to safely deliver automations
  • Familiarity with ServiceNow ITOM / Event Management / AIOps capabilities (or equivalent) and integrating observability signals into ITSM workflows
  • Strong Linux and networking fundamentals (TCP/IP, DNS, TLS, load balancing) and ability to troubleshoot distributed systems end-to-end
  • DevOps, or platform engineering experience supporting highly available services along with experience with Product model
  • Excellent communication skills with the ability to lead incident bridges, write clear postmortems, and influence reliability improvements across teams
What we offer:
  • Medical, dental, and vision coverage
  • Paid time off
  • Retirement savings options
  • Wellness programs
  • Bonus, commission or short-term incentive program
  • Equity award program

Additional Information:

Job Posted:
May 15, 2026

Expiration:
July 01, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Principal AIOps Engineer

Principal Engineer

The Principal AI/ML Operations Engineer leads the architecture, automation, and ...
Location
Location
United States , Pleasanton, California
Salary
Salary:
251000.00 - 314500.00 USD / Year
blackline.com Logo
BlackLine
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Machine Learning, Data Science, or a related field
  • 10+ years in ML infrastructure, DevOps, and software system architecture
  • 4+ years in leading MLOps or AI Ops platforms
  • Strong programming skills in languages such as Python, Java, or Scala
  • Expertise in ML frameworks (TensorFlow, PyTorch, scikit-learn) and orchestration tools (Airflow, Kubeflow, Vertex AI, MLflow)
  • Proven experience operating production pipelines for ML and LLM-based systems across cloud ecosystems (GCP, AWS, Azure)
  • Deep familiarity with LangChain, LangGraph, ADK or similar agentic system runtime management
  • Strong competencies in CI/CD, IaC, and DevSecOps pipelines integrating testing, compliance, and deployment automation
  • Hands-on with observability stacks (Prometheus, Grafana, Newrelic) for model and agent performance tracking
  • Understanding of governance frameworks for Responsible AI, auditability, and cost metering across training and inference workloads
Job Responsibility
Job Responsibility
  • Define enterprise-level standards and reference architectures for ML-Ops and AIOps systems
  • Partner with data science, security, and product teams to set evaluation and governance standards (Guardrails, Bias, Drift, Latency SLAs)
  • Mentor senior engineers and drive design reviews for ML pipelines, model registries, and agentic runtime environments
  • Lead incident response and reliability strategies for ML/AI systems
  • Lead the deployment of AI models and systems in various environments
  • Collaborate with development teams to integrate AI solutions into existing workflows and applications
  • Ensure seamless integration with different platforms and technologies
  • Define and manage MCP Registry for agentic component onboarding, lifecycle versioning, and dependency governance
  • Build CI/CD pipelines automating LLM agent deployment, policy validation, and prompt evaluation of workflows
  • Develop and operationalize experimentation frameworks for agent evaluations, scenario regression, and performance analytics
What we offer
What we offer
  • short-term and long-term incentive programs
  • robust offering of benefit and wellness plans
  • Fulltime
Read More
Arrow Right

Lead / Principal Software Engineer

We’re hiring Lead and Principal Software Engineers to build the next generation ...
Location
Location
Australia , Sydney
Salary
Salary:
Not provided
blumeglobal.com Logo
Blume Global
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years building scalable, fault-tolerant systems and enterprise software
  • Strong experience with backend architecture, platform modernization, and CI/CD
  • Proficiency in C#, Java, Python, SQL, and JavaScript
  • Experience with cloud infrastructure (AWS, Kinesis, Lambda) and DevOps tools (Docker, Kubernetes, Jenkins)
  • Proven ability to lead technical decisions, mentor engineers, and improve team productivity
  • Strong experience integrating and evaluating AI tools like GitHub Copilot and AIOps in real-world engineering workflows
  • Strong communication across product, compliance, and engineering teams
  • Track record of aligning technical work with business outcomes and customer value
Job Responsibility
Job Responsibility
  • Build the next generation of our platforms
  • Work on high-scale systems that process billions of transactions
  • Modernize core infrastructure
  • Drive AI initiatives to improve performance and reliability
  • Set technical direction
  • Mentor senior engineers
  • Shape architecture across multiple domains
What we offer
What we offer
  • Competitive Package + Equity
  • Find the team/project that fits you best
  • Hybrid and Flexible Work
  • Continuous Learning and Growth
  • Access learning platforms (Coursera, Pluralsight, LinkedIn Learning, WiseTech Academy), mentorship, and development opportunities
  • Top-Tier Hardware
  • Onsite Meals and Snacks
Read More
Arrow Right

Principal Machine Learning Engineer

As a Principal Engineer on the ITSM team, you will get the opportunity to work o...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of total experience
  • Fluency in Python
  • Solid understanding of machine learning concepts and algorithms, including supervised and unsupervised learning, deep learning, and NLP
  • Familiarity with popular ML libraries like sci-kit-learn, Keras/TensorFlow/PyTorch, numpy, pandas
  • Good Understanding of Machine Learning project lifecycle
  • Experience in architecting and implementing high-performance RESTful microservices (API development for ML Models)
  • Familiarity with MLOps and experience with scaling and deploying Machine Learning models
Job Responsibility
Job Responsibility
  • Shape the future of AIOps
  • Master Generative AI
  • Become a machine learning maestro
  • Collaborate with diverse minds
  • Make a tangible impact
  • Routinely tackle complex architectural challenges, spar with other principal engineers to build ML pipelines and models that scale for thousands of customers
  • Lead code reviews & documentation as well as take on complex bug fixes, especially on high-risk problems.
  • Develop leadership skills
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Colombia
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering
  • 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Ecuador
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer (AI-first SRE)

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Peru
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Machine Learning Engineer

As a Principal Engineer on the ITSM team, you will get the opportunity to work o...
Location
Location
Australia , Sydney
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of total experience
  • Fluency in at least 1 scripting, OOP language
  • Solid understanding of machine learning concepts and algorithms, including supervised and unsupervised learning, deep learning, and NLP
  • Familiarity with popular ML libraries like sci-kit-learn, Keras/TensorFlow/PyTorch, numpy, pandas
  • Good Understanding of Machine Learning project lifecycle
  • Familiarity with MLOps and experience with scaling and deploying Machine Learning models
Job Responsibility
Job Responsibility
  • Work on cutting-edge AI and ML algorithms that help modernize IT Operations by reducing MTTR (mean time to resolve), and MTTI (Mean time to identify)
  • Use software development expertise to solve difficult problems, tackling complex infrastructure and architecture challenges
  • Lead engineers to drive involved projects from technical design to launch
  • Collaborate with other teams and internal customers to set expectations, gather input, and communicate results
  • Work with a distributed, world-class team shaping the future of AIOps
  • Master Generative AI
  • Become a machine learning maestro
  • Collaborate with diverse minds
  • Make a tangible impact
  • Routinely tackle complex architectural challenges
What we offer
What we offer
  • Health coverage
  • Paid volunteer days
  • Wellness resources
  • Fulltime
Read More
Arrow Right

Principal Frontend Engineer - Azure Monitor AIOps & Experience (Azure Data)

Principal Frontend Engineer — Azure Monitor AIOps & Experience (Azure Data). Abo...
Location
Location
Israel , Tel Aviv, Herzliya
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years building modern web applications with TypeScript and React (or similar), including component architecture, state management, and testing at scale
  • Proven experience delivering data-intensive UIs (dashboards, charts, investigative workflows) with performance and accessibility best practices
  • Strong API integration skills, client-side performance profiling, and production debugging
  • Ability to collaborate across PM, Design, and ML/Backend to shape product direction and ship iteratively
  • Demonstrated expertise in migrating codebases from Angular to React
Job Responsibility
Job Responsibility
  • Own end-to-end web experiences for Azure Monitor: from UX design through production rollout
  • Visualize telemetry at scale with interactive charts, timelines, and dashboard integrations
  • Advance intelligent alerting: dynamic thresholds, smart grouping, contextual enrichment, and noise reduction in the alert lifecycle
  • Integrate AIOps capabilities (anomaly detection, RCA hints, agentic flows) into approachable, explainable UIs in partnership with data/ML engineers
  • Partner with platform teams to consume Azure Monitor, Alerts, Azure Resource Graph, and Log Analytics APIs
  • shape UI contracts, performance and resilience patterns
  • Champion quality: instrumentation, accessibility, localization, reliability, and front-end performance
  • Mentor and lead: drive technical design, code reviews, and best practices across cross-geo teams
  • Fulltime
Read More
Arrow Right