CrawlJobs Logo

Principal AIOps Engineer

United States Employment contract 144200.00 - 288400.00 USD / Year · Job Posted May 15, 2026
Apply Position
Job Link Share

Job Description

We’re building a world of health around every individual — shaping a more connected, convenient and compassionate health experience. At CVS Health®, you’ll be surrounded by passionate colleagues who care deeply, innovate with purpose, hold ourselves accountable and prioritize safety and quality in everything we do. Join us and be part of something bigger – helping to simplify health care one person, one family and one community at a time.

Job Responsibility

  • Lead the AIOps strategy, roadmap, and operating model (intake, triage, automation lifecycle, KPIs) to measurably improve MTTR, alert quality, and operational efficiency
  • Own the observability-to-AIOps pipeline (metrics, logs, traces, events) and drive standardization of telemetry, service health models, and actionable alerting across teams and platforms
  • Design and implement event intelligence: correlation, deduplication, suppression, anomaly detection, incident clustering, and probable-cause analysis using topology/CMDB context
  • Advise operations, service owners, and leadership stakeholders
  • lead change enablement, adoption, and value measurement for AIOps and agentic automation across the organization
  • Develop ServiceNow-centric AIOps integrations (ITSM + ITOM/Event Management where applicable): event ingestion, alert-to-incident policies, enrichment, assignment/routing, approvals, change workflows, and closure updates for auditable closed-loop ops
  • Establish governance for operational AI (risk controls, approvals, auditability, data access, prompt/response logging, evaluation, and continuous improvement) in partnership with security, compliance, and operations
  • Build and operationalize agentic AI workflows for incident triage and resolution: signal summarization, similar-incident retrieval, knowledge article drafting, ticket updates, stakeholder communications, and human-in-the-loop remediation
  • Enable closed-loop automation and self-healing by connecting AIOps detections to orchestrated actions (runbooks/workflows), with clear approvals, safety checks, and rollback paths
  • Partner with NOC/SOC, infrastructure, and application owners to onboard services into AIOps, define service models, and improve signal quality, escalation paths, and operational readiness
  • Create enablement materials (playbooks, operating procedures, dashboards) and coach teams on AIOps practices, agentic AI usage, and responsible automation

Requirements

  • 10+ years of experience in SRE, production operations supporting highly available services along with experience with Product model
  • Proven technical leadership: ability to set direction, lead cross-team initiatives, and advise stakeholders through architecture reviews, tradeoffs, and operational readiness
  • Strong programming/scripting skills (Python preferred) and experience building automation, integrations, and APIs
  • Experience integrating observability platforms and event sources across hybrid environments (cloud/on-prem) and operating production-grade monitoring/event management at scale
  • Strong ServiceNow experience as an ITSM system of record (Incident/Problem/Change
  • CMDB/asset concepts). Ability to build and operate integrations at scale (REST, webhooks, event management) to support automation and auditability
  • Python (preferred) for automation and data/ML pipelines
  • experience building integrations, services, and operational tooling
  • Workflow orchestration and integrations (ServiceNow APIs, event pipelines, runbook automation) with strong reliability, security, and auditability practices
  • Observability: Prometheus/Grafana, OpenTelemetry, ELK/Splunk/Datadog (or equivalent)
  • ServiceNow ITSM/ITOM: Incident/Problem/Change, CMDB/service mapping concepts, and Event Management/AIOps integrations (where applicable)
  • Agentic AI frameworks: building tool-using agents, retrieval workflows, prompt/response logging, evaluation, and guardrails
  • Operational ML/Analytics: anomaly detection and time-series analysis, correlation approaches, and model/agent evaluation & monitoring in production
  • Bachelor’s degree or equivalent experience (Highschool diploma plus 4 years relevant work work experience)

Nice to have

  • Demonstrated experience applying machine learning and/or LLM-based approaches to operational problems (noise reduction, correlation, anomaly detection, summarization, and assisted remediation) in production environments
  • Experience building an agentic AI platform/ecosystem (shared tools, standardized patterns, evaluation, and guardrails) and enabling multiple teams to safely deliver automations
  • Familiarity with ServiceNow ITOM / Event Management / AIOps capabilities (or equivalent) and integrating observability signals into ITSM workflows
  • Strong Linux and networking fundamentals (TCP/IP, DNS, TLS, load balancing) and ability to troubleshoot distributed systems end-to-end
  • DevOps, or platform engineering experience supporting highly available services along with experience with Product model
  • Excellent communication skills with the ability to lead incident bridges, write clear postmortems, and influence reliability improvements across teams

What we offer

  • Medical, dental, and vision coverage
  • Paid time off
  • Retirement savings options
  • Wellness programs
  • Bonus, commission or short-term incentive program
  • Equity award program

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Principal AIOps Engineer

8 matching positions

Principal Engineer Software (AIOps)

At Palo Alto Networks®, we're united by a shared mission—to protect our digital ...
Location
Location
United States , Santa Clara
Salary
Salary:
147000.00 - 237500.00 USD / Year
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Must have 5+ years of hands-on experience in building large enterprise applications
  • Must have extensive hands-on programming skills in Java and distributed systems
  • Deep understanding of design pattern
  • Good communication skills and ability to work in a fast-paced environment.
Job Responsibility
Job Responsibility
  • Tackle new and challenging problems by building a new generation of highly scaled data processing and analytics systems
  • Contribute in architecture, design and development of features
  • Solve complex problems in pipeline scaling and data storage to facilitate dashboards
  • Suggest and implement improvements to the development processes
  • Work with DevOps and Technical Support teams to investigate and resolve critical customer defects.
What we offer
What we offer
  • Restricted stock units
  • Bonus
  • Employee benefits may be found here
  • Fulltime
Read More
Arrow Right

Principal Engineer Software (AIOps)

Strata Logging Service (SLS) powers advanced cybersecurity innovations by provid...
Location
Location
United States , Santa Clara
Salary
Salary:
147000.00 - 237500.00 USD / Year
paloaltonetworks.it Logo
Palo Alto Networks Italia
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Must have 5+ years of hands-on experience in building large enterprise applications
  • Must have extensive hands-on programming skills in Java and distributed systems
  • Deep understanding of design pattern
  • Good communication skills and ability to work in a fast-paced environment
Job Responsibility
Job Responsibility
  • Tackle new and challenging problems by building a new generation of highly scaled data processing and analytics systems
  • Contribute in architecture, design and development of features, solve complex problems in pipeline scaling and data storage to facilitate dashboards
  • Suggest and implement improvements to the development processes
  • Work with DevOps and Technical Support teams to investigate and resolve critical customer defects
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer (AIOps)

Palo Alto Networks runs a large hybrid infrastructure and is one of the largest ...
Location
Location
United States , Santa Clara
Salary
Salary:
151600.00 - 245300.00 USD / Year
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS or MS in Computer Science, a related field, or equivalent professional experience
  • Expertise in configuration management with a framework such as Ansible, Terraform, Helm
  • Experience in Production Engineering, DevOps, or Site Reliability
  • Expertise in private or public cloud
  • Strong Linux administration, internals, and network troubleshooting
  • Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
  • Familiarity with CI/CD pipelines, GitLab and GitHub preferred
  • Ability to diagnose and troubleshoot complex distributed systems handling high volume transactions
  • Excellent written and verbal communication, able to collaborate and rally support
  • Self-disciplined, self-managed, self-motivated and strong sense of ownership, urgency, and drive
Job Responsibility
Job Responsibility
  • Contribute to the success of SRE and DevOps
  • Develop expertise in new technologies
  • Work with developers, researchers, data scientists, and security experts
  • Design, build and operate reliable, secure Cloud infrastructure
  • Ensure that applications are production-ready, scalable, and reliable
  • Develop tools and automation frameworks
  • Automate robust deployment of robust services
  • Orchestrate end-to-end monitoring and alerting
  • Participate with SRE and Dev teams in the on-call rotation
  • Lead root cause analysis of critical business and production issues
What we offer
What we offer
  • restricted stock units
  • bonus
  • employee benefits
  • Fulltime
Read More
Arrow Right

Principal Frontend Engineer - Azure Monitor AIOps & Experience (Azure Data)

Principal Frontend Engineer — Azure Monitor AIOps & Experience (Azure Data). Abo...
Location
Location
Israel , Tel Aviv, Herzliya
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years building modern web applications with TypeScript and React (or similar), including component architecture, state management, and testing at scale
  • Proven experience delivering data-intensive UIs (dashboards, charts, investigative workflows) with performance and accessibility best practices
  • Strong API integration skills, client-side performance profiling, and production debugging
  • Ability to collaborate across PM, Design, and ML/Backend to shape product direction and ship iteratively
  • Demonstrated expertise in migrating codebases from Angular to React
Job Responsibility
Job Responsibility
  • Own end-to-end web experiences for Azure Monitor: from UX design through production rollout
  • Visualize telemetry at scale with interactive charts, timelines, and dashboard integrations
  • Advance intelligent alerting: dynamic thresholds, smart grouping, contextual enrichment, and noise reduction in the alert lifecycle
  • Integrate AIOps capabilities (anomaly detection, RCA hints, agentic flows) into approachable, explainable UIs in partnership with data/ML engineers
  • Partner with platform teams to consume Azure Monitor, Alerts, Azure Resource Graph, and Log Analytics APIs
  • shape UI contracts, performance and resilience patterns
  • Champion quality: instrumentation, accessibility, localization, reliability, and front-end performance
  • Mentor and lead: drive technical design, code reviews, and best practices across cross-geo teams
  • Fulltime
Read More
Arrow Right

Principal Engineer

The Principal AI/ML Operations Engineer leads the architecture, automation, and ...
Location
Location
United States , Pleasanton, California
Salary
Salary:
251000.00 - 314500.00 USD / Year
blackline.com Logo
BlackLine
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Machine Learning, Data Science, or a related field
  • 10+ years in ML infrastructure, DevOps, and software system architecture
  • 4+ years in leading MLOps or AI Ops platforms
  • Strong programming skills in languages such as Python, Java, or Scala
  • Expertise in ML frameworks (TensorFlow, PyTorch, scikit-learn) and orchestration tools (Airflow, Kubeflow, Vertex AI, MLflow)
  • Proven experience operating production pipelines for ML and LLM-based systems across cloud ecosystems (GCP, AWS, Azure)
  • Deep familiarity with LangChain, LangGraph, ADK or similar agentic system runtime management
  • Strong competencies in CI/CD, IaC, and DevSecOps pipelines integrating testing, compliance, and deployment automation
  • Hands-on with observability stacks (Prometheus, Grafana, Newrelic) for model and agent performance tracking
  • Understanding of governance frameworks for Responsible AI, auditability, and cost metering across training and inference workloads
Job Responsibility
Job Responsibility
  • Define enterprise-level standards and reference architectures for ML-Ops and AIOps systems
  • Partner with data science, security, and product teams to set evaluation and governance standards (Guardrails, Bias, Drift, Latency SLAs)
  • Mentor senior engineers and drive design reviews for ML pipelines, model registries, and agentic runtime environments
  • Lead incident response and reliability strategies for ML/AI systems
  • Lead the deployment of AI models and systems in various environments
  • Collaborate with development teams to integrate AI solutions into existing workflows and applications
  • Ensure seamless integration with different platforms and technologies
  • Define and manage MCP Registry for agentic component onboarding, lifecycle versioning, and dependency governance
  • Build CI/CD pipelines automating LLM agent deployment, policy validation, and prompt evaluation of workflows
  • Develop and operationalize experimentation frameworks for agent evaluations, scenario regression, and performance analytics
What we offer
What we offer
  • short-term and long-term incentive programs
  • robust offering of benefit and wellness plans
  • Fulltime
Read More
Arrow Right

Principal Network Engineer, Operations & Observability

The System Engineer job family has responsibility for infrastructure/technical p...
Location
Location
United States , Englewood
Salary
Salary:
60.24 - 89.60 USD / Hour
americannursingcare.com Logo
American Nursing Care
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelors of Arts degree or equivalent experience
  • 10 years of professional IT experience in an IT technical or infrastructure field
  • 5+ years Unix operational experience (Solaris, AIX, Linux)
  • 5+ years Windows Server operational experience
Job Responsibility
Job Responsibility
  • Platform Lifecycle Management
  • Enterprise Architecture and Strategy
  • Future-State Vision
  • Strategy and Roadmap
  • Architectural Standards
  • Collaboration and Operational Model
  • Develops organizational policies, standards, and guidelines for methods and tools
  • Determines testing policy
  • Sets the release policy for the organization
  • Maintain primary responsibility for strategic planning, technical roadmap development, standards and architecture
What we offer
What we offer
  • medical
  • prescription drug
  • dental
  • vision plans
  • life insurance
  • paid time off
  • tuition reimbursement
  • retirement plan benefit(s) including 401(k), 403(b), and other defined benefits offerings
  • Fulltime
Read More
Arrow Right

Principal Software Engineer - Azure Storage

Storage, the core of Microsoft's Azure Cloud, provides 10 exabytes of capacity a...
Location
Location
Australia , Sydney
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, or Python OR equivalent experience
  • 5+ years developing production software
  • 5+ years of system design, algorithmic skills, and data structures experience
  • 5+ years of debugging, testing, and problem-solving skills
  • 5+ years of proficiency working cross teams and collaborating with partners
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Design, develop, test and support features, experiences and solutions for highly scalable services
  • Develop high quality secure and compliant solutions
  • Support highly available services used by millions of users on a daily basis
  • Provide technical leadership across multiple projects, aligning engineering priorities with business objectives and driving measurable impact through innovation and execution excellence
  • Infuse AIOps practices to drive productivity, operational excellence, observability, incident detection, accurate root-cause analysis and mitigation
  • Fulltime
Read More
Arrow Right

Lead / Principal Software Engineer

We’re hiring Lead and Principal Software Engineers to build the next generation ...
Location
Location
Australia , Sydney
Salary
Salary:
Not provided
blumeglobal.com Logo
Blume Global
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years building scalable, fault-tolerant systems and enterprise software
  • Strong experience with backend architecture, platform modernization, and CI/CD
  • Proficiency in C#, Java, Python, SQL, and JavaScript
  • Experience with cloud infrastructure (AWS, Kinesis, Lambda) and DevOps tools (Docker, Kubernetes, Jenkins)
  • Proven ability to lead technical decisions, mentor engineers, and improve team productivity
  • Strong experience integrating and evaluating AI tools like GitHub Copilot and AIOps in real-world engineering workflows
  • Strong communication across product, compliance, and engineering teams
  • Track record of aligning technical work with business outcomes and customer value
Job Responsibility
Job Responsibility
  • Build the next generation of our platforms
  • Work on high-scale systems that process billions of transactions
  • Modernize core infrastructure
  • Drive AI initiatives to improve performance and reliability
  • Set technical direction
  • Mentor senior engineers
  • Shape architecture across multiple domains
What we offer
What we offer
  • Competitive Package + Equity
  • Find the team/project that fits you best
  • Hybrid and Flexible Work
  • Continuous Learning and Growth
  • Access learning platforms (Coursera, Pluralsight, LinkedIn Learning, WiseTech Academy), mentorship, and development opportunities
  • Top-Tier Hardware
  • Onsite Meals and Snacks
Read More
Arrow Right