CrawlJobs Logo

Staff Observability Operations Engineer

United States, Hartford Employment contract 130295.00 - 260590.00 USD / Year · Job Posted August 28, 2025
Apply Position
Job Link Share

Job Description

We are currently seeking several experienced and highly skilled Staff Observability Operations Engineers with a strong background in Site Reliability Engineering (SRE), modern observability practices, and the management and implementation of observability and event management platforms. Responsibilities include deploying observability solutions, administration of platforms, release management, system upgrades, integrations, troubleshooting incidents, and continuous planning to enhance platform performance. Successful candidates will play a key role in ensuring our observability infrastructure meets the current and future needs of CVS Health’s dynamic environment.

Job Responsibility

  • Deploy and implement modern observability solutions
  • Manage and administer observability and event management platforms
  • Coordinate and manage release cycles for observability platforms
  • Troubleshoot and resolve incidents related to observability platforms
  • Continuously monitor and enhance platform performance
  • Collaborate with cross-functional stakeholders
  • Provide training and mentoring to junior engineers
  • Ensure compliance and security of observability platforms
  • Maintain documentation of observability platform configurations
  • Generate and analyze reports on platform performance and capacity

Requirements

  • 7+ Years of experience in IT operations, with significant responsibilities in system monitoring, performance tuning, and troubleshooting enterprise applications
  • 5+ Years in a Site Reliability Engineering (SRE) role deploying and managing modern observability solutions
  • 5+ Years managing and implementing observability and event management platforms (e.g., AppDynamics, Splunk, Prometheus, Grafana)
  • Experience developing and administering ServiceNow ITOM event management solutions
  • Experience deploying and managing service reliability platforms (e.g., xMatters, OpsGenie, PagerDuty)
  • Experience with and deep knowledge of cloud environments, cloud monitoring platforms, and container orchestration tools (e.g., AWS/CloudTrail, Azure/Monitor, GCP/GCM, Kubernetes, OpenShift)
  • Proficiency in Python and other scripting languages such as Ansible, PowerShell, Bash for automation and configuration
  • Hands-on experience deploying, managing, and administering observability platforms
  • Hands-on experience leading, coordinating, and performing migration of application, platform, and infrastructure observability solutions
  • Proven ability to troubleshoot and resolve complex technical issues
  • Experience monitoring platform performance and implementing enhancements to support scalability
  • Knowledge of compliance and security standards related to observability platforms
  • Excellent communication skills, both verbal and written
  • Experience with configuring and leveraging source code management tools and workflows
  • Proficiency in scripting and programming languages such as Ansible, PowerShell, Bash, Python, YAML, XML, and JSON
  • Preferred certifications: ITIL 4 Practitioner, DevOps Institute Observability Foundation, ServiceNow CIS-Event Management Implementer, xMatters Integrator

Nice to have

  • ITIL 4 Practitioner: Monitoring and Event Management
  • DevOps Institute Observability Foundation
  • DevOps Institute Site Reliability Engineering Foundation or Practitioner
  • ServiceNow CIS-Event Management Implementer
  • ServiceNow Certified Application Developer
  • xMatters Integrator

What we offer

  • Affordable medical plan options
  • a 401(k) plan (including matching company contributions)
  • an employee stock purchase plan
  • No-cost programs for all colleagues including wellness screenings, tobacco cessation and weight management programs
  • confidential counseling and financial coaching
  • Paid time off
  • flexible work schedules
  • family leave
  • dependent care resources
  • colleague assistance programs
  • tuition assistance
  • retiree medical access

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Staff Observability Operations Engineer

8 matching positions

Principal Network Engineer, Operations & Observability

The System Engineer job family has responsibility for infrastructure/technical p...
Location
Location
United States , Englewood
Salary
Salary:
60.24 - 89.60 USD / Hour
americannursingcare.com Logo
American Nursing Care
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelors of Arts degree or equivalent experience
  • 10 years of professional IT experience in an IT technical or infrastructure field
  • 5+ years Unix operational experience (Solaris, AIX, Linux)
  • 5+ years Windows Server operational experience
Job Responsibility
Job Responsibility
  • Platform Lifecycle Management
  • Enterprise Architecture and Strategy
  • Future-State Vision
  • Strategy and Roadmap
  • Architectural Standards
  • Collaboration and Operational Model
  • Develops organizational policies, standards, and guidelines for methods and tools
  • Determines testing policy
  • Sets the release policy for the organization
  • Maintain primary responsibility for strategic planning, technical roadmap development, standards and architecture
What we offer
What we offer
  • medical
  • prescription drug
  • dental
  • vision plans
  • life insurance
  • paid time off
  • tuition reimbursement
  • retirement plan benefit(s) including 401(k), 403(b), and other defined benefits offerings
  • Fulltime
Read More
Arrow Right

Staff Operations AI Engineer

We're looking for a Staff Operations AI Engineer who will architect, build, and ...
Location
Location
United States , Denver
Salary
Salary:
150000.00 - 176000.00 USD / Year
https://checkr.com Logo
Checkr
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years in systems engineering, automation platforms, integration architecture, or AI-enabled operations
  • Deep expertise in API design, OAuth/token-based authentication, webhooks, and event-driven systems
  • Proven experience building reliable automation workflows with observability, retries, and failure handling
  • Strong JavaScript and/or Python expertise for backend logic, scripting, and internal tooling
  • Solid SQL experience for data transformation, validation, and operational analytics (Snowflake a plus)
  • Advanced comfort with JSON, schemas, and data normalization for LLM and automation use cases
  • Hands-on experience running LLM-powered agents or automations in production, including guardrails and output validation
  • Formal experience in prompt engineering frameworks
  • Familiarity with integrating with CRM and support systems (e.g., Salesforce, Zendesk)
  • Track record designing end-to-end integration architectures across multiple platforms
Job Responsibility
Job Responsibility
  • Design and own the integration architecture that enables AI Agents to operate safely and reliably across Checkr systems and third-party platforms
  • Build production-grade API integrations with secure authentication flows, webhook and event-driven patterns, and robust automation workflows that coordinate actions across tools like Zendesk, Salesforce, and internal services
  • Ensure AI-driven operations are resilient through strong error handling, retries, observability, and fallback mechanisms
  • Proactively identify integration gaps, system bottlenecks, and failure modes, and lead the technical solutions that improve reliability and scale
  • Build and maintain the technical foundations that power AI-driven workflows, including structured data pipelines, normalized schemas, and predictable JSON inputs for LLMs
  • Write high-quality JavaScript and SQL to support automation logic, internal tooling, and operational insights
  • Establish engineering standards for workflow design, code quality, and system observability
  • Communicate architecture clearly through documentation and diagrams, and serve as a technical authority across teams to ensure consistent, high-quality execution
  • Implement guardrails, validation rules, and safety checks that ensure AI Agents act responsibly and accurately in production
  • Evaluate model output quality and continuously refine prompts, transformations, and logic to improve consistency, reliability, and trust in AI-driven decisions
What we offer
What we offer
  • A fast-paced and collaborative environment
  • Learning and development allowance
  • Competitive cash and equity compensation, and opportunity for advancement
  • 100% medical, dental, and vision coverage
  • Up to $25K reimbursement for fertility, adoption, and parental planning services
  • Flexible PTO policy
  • Monthly wellness stipend
  • In-office perks are provided, such as lunch five times a week, a commuter stipend, and an abundance of snacks and beverages
  • A relocation stipend may be available for those willing to relocate to a Checkr hub location
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, DevProd (Observability)

We have an opening for a Staff Software Engineer on our Infrastructure Team, wit...
Location
Location
United States
Salary
Salary:
196000.00 - 245000.00 USD / Year
temporal.io Logo
Temporal
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • User-first mindset
  • Motivated by impact
  • Strong opinions about tools and technology balanced by a pragmatic drive for impact
  • Ability to work in a self-directed manner in a fast-paced environment
  • Excellent collaboration and communication skills
  • Demonstrated ability to develop horizontally scalable, resilient, and high performance distributed systems in a production environment
  • Experience designing, implementing, deploying, and supporting large scale, geographically distributed observability and/or high throughput data streaming/processing pipelines, or similar
  • Expert in one or more high-level programming languages, preferably Go
  • Expert-level Kubernetes skills
  • Expert-level query development skills, preferably SQL
Job Responsibility
Job Responsibility
  • Lead the end-to-end Software Development Lifecycle: goals & requirements solicitation, design & review, implementation, operationalization & deployment, support & maintenance
  • Lead feature design, review with stakeholders, iterate to incorporate feedback and drive consensus
  • Clearly document design choices and operational knowledge to successfully deploy and manage the software you develop
  • Provide appropriate test and production readiness coverage for unit, integration, and performance of your feature ownership area
  • Set a high bar for technical excellence and take pride in the software you develop
  • Design and build multi-component, distributed systems that operate at scale
  • Investigate issues with a methodical approach to identify a root cause
  • Understand performance and reliability implications of design options at scale
  • Make related tradeoffs
  • Participate in the team’s on-call rotation
What we offer
What we offer
  • Unlimited PTO, 12 Holidays + 2 Floating Holidays
  • 100% Premiums Coverage for Medical, Dental, and Vision
  • AD&D, LT & ST Disability, and Life Insurance (Standard & Supplemental Available)
  • Empower 401K Plan
  • Additional Perks for Learning & Development, Lifestyle Spending, In-Home Office Setup, Professional Memberships, WFH Meals, Internet Stipend and more
  • $3,600 / Year Work from Home Meals
  • $1,800 / Year Professional Enrichment (Career Development & Professional Memberships)
  • $1,200 / Year Lifestyle Spending Account
  • $1,000 / Year In-Home Office Setup (In addition to Temporal issued equipment)
  • $74 / Month Reimbursement for Internet
  • Fulltime
Read More
Arrow Right

Staff Observability Data Infrastructure Engineer

CVS Health is seeking a highly skilled Observability Data Infrastructure Enginee...
Location
Location
United States , Work at Home, Maryland
Salary
Salary:
130295.00 - 260590.00 USD / Year
https://www.cvshealth.com/ Logo
CVS Health
Expiration Date
June 30, 2026
Flip Icon
Requirements
Requirements
  • 7+ years of experience building and operating log, metric, and trace pipelines in Data, Security Data, or Observability Engineering roles
  • 5+ years of hands-on experience with Databricks, Apache Spark, or other large-scale distributed data platforms
  • 5+ years of experience working across cloud platforms (AWS, Azure, or GCP), including storage, compute, and event-driven services
  • 5+ years of production experience using SQL and Python in data-intensive environments
  • 3+ years of experience with enterprise observability platforms (Splunk, Datadog, Elastic, or equivalent)
  • 3+ years of experience with high-throughput ingestion and streaming technologies such as Cribl, Vector, or Kafka
  • 3+ years of experience designing telemetry systems aligned to OpenTelemetry (OTEL) or similar standards
  • Bachelor's degree from accredited university or equivalent work experience (HS diploma + 4 years relevant experience)
Job Responsibility
Job Responsibility
  • Design, build, and operate high-volume log, metric, and trace pipelines using Databricks, cloud data lakes, and distributed processing engines
  • Architect and evolve an Observability Lakehouse aligned with OpenTelemetry (OTEL) data models and standards
  • Implement ingestion and transformation workflows using technologies such as Cribl, Vector, Jenkins, GitHub Actions, or equivalent tools
  • Normalize, model, and enrich telemetry data to support detection engineering, forensics, and operational analytics
  • Develop scalable ETL/ELT frameworks, Delta Lake architectures, and automated data quality validation for unstructured and semi-structured data
  • Partner with Security Engineering, SRE, Cloud, and SOC teams to improve enterprise visibility and detection accuracy
  • Build and maintain CI/CD pipelines and reusable Infrastructure-as-Code (IaC) patterns for observability platform deployment
  • Identify and resolve performance, latency, cost, and reliability issues across telemetry pipelines
  • Contribute to engineering standards, documentation, and knowledge sharing across observability and security platforms
What we offer
What we offer
  • Medical, dental, and vision coverage
  • Paid time off
  • Retirement savings options
  • Wellness programs
  • Bonus, commission or short-term incentive program
  • Equity award program
  • Fulltime
!
Read More
Arrow Right

Staff Security Software Engineer - Security Operations

The Role GM’s Cybersecurity Team safeguards the company’s global information ...
Location
Location
United States , Austin
Salary
Salary:
Not provided
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in software engineering with a focus on distributed systems, security integrations, and data platforms
  • Deep expertise building event-driven, horizontally scalable services and contract-first APIs
  • Track record productizing AI in security workflows (multi-agent patterns, RAG at scale, evaluation harnesses, guardrails, red-teaming)
  • Cloud architecture depth (Azure/AWS/GCP), including networking, Kubernetes, service meshes, observability stacks, and IaC at scale
  • Data platform expertise: streaming (Kafka/Event Hub/PubSub), vector/search (pgvector/FAISS/Pinecone), schema/versioning, governance/lineage
  • Demonstrated org-wide influence: authored standards, drove cross-team adoption, led multi-quarter programs to successful outcomes
  • Exceptional communication with executives
  • ability to frame risk, ROI, and tradeoffs succinctly
Job Responsibility
Job Responsibility
  • Set the reference architecture for security data integration and AI orchestration (agents, policy-guard railed workflows, governance)
  • Lead cross-org programs that unify SIEM/EDR/IAM/SSPM/CSPM/ITSM/cloud data models and establish single sources of truth
  • Operationalize AI at scale with safety, privacy, and governance—including data retention, PII controls, model routing, evaluation, and fallback strategies
  • Drive cost/performance optimization (throughput, latency, storage tiering, vector index strategies) for high-volume security telemetry
  • Influence vendor strategy and negotiate integration roadmaps
  • guide build-vs-buy decisions and multi-year investments
  • Mentor/coach Staff/Senior engineers
  • build a culture of design excellence, pragmatic risk management, and measurable outcomes
  • Communicate upward with crisp executive narratives, metrics, and business impact framing
What we offer
What we offer
  • Relocation benefits
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, Add-on Operations

addons.mozilla.org (AMO) is the foundation of the Firefox add-ons ecosystem. It’...
Location
Location
Canada; United Kingdom; France
Salary
Salary:
Not provided
mozilla.org Logo
Mozilla
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience leading and building modern web applications
  • Strong experience with Python/Django or similar backend frameworks
  • Understanding of web security principles and practices
  • Strong collaboration and communication skills in a distributed team environment
  • Adept at navigating ambiguity, exploring solutions, and shaping direction in new problem spaces
  • Ability to work across teams and align stakeholders on engineering vision
  • Experience mentoring and supporting junior engineers
  • Commitment to our values: Welcoming differences, Being relationship-minded, Practicing responsible participation, Having grit
Job Responsibility
Job Responsibility
  • Plan and deliver major features and architectural improvements across the Add-ons stack, including automated moderation pipelines, Reviewer tools, and DevHub
  • Partner with Engineering management to set Operations Engineering standards (SLOs, incident management, observability baselines)
  • Mentor engineers, sharing knowledge and delegating responsibilities to help others grow
  • Improve platform reliability through deployments, monitoring, and incident response
  • Help keep the platform safe and trustworthy, with attention to security and user trust
  • Step in to resolve issues impacting users and developers, from small bugs to larger incidents
  • Collaborate with designers, product managers, QA, and community contributors to deliver end-to-end improvements
  • Contribute in the open through pull requests, code reviews, and discussions
What we offer
What we offer
  • Generous performance-based bonus plans
  • Rich medical, dental, and vision coverage
  • Generous retirement contributions with 100% immediate vesting
  • Quarterly all-company wellness days
  • Country specific holidays plus a day off for your birthday
  • One-time home office stipend
  • Annual professional development budget
  • Quarterly well-being stipend
  • Considerable paid parental leave
  • Employee referral bonus program
  • Fulltime
Read More
Arrow Right

Staff Software engineer - Authentication and Security Observability

The Login Services team sits within Core Security Engineering and owns Uber’s au...
Location
Location
United States , Sunnyvale
Salary
Salary:
232000.00 - 258000.00 USD / Year
uber.com Logo
Uber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • 8+ years of industry experience building large-scale backend platforms, with deep experience in distributed systems and production infrastructure
  • Strong programming experience in multiple languages (e.g., Go, Java, Python, Node.js/TypeScript), with a track record of shipping reliable systems
  • Demonstrated expertise designing and operating scalable distributed services, including reliability engineering and operational excellence (observability, incident response, SLAs)
  • Strong background in security engineering, preferably in identity/authentication and building or operating security-critical pipelines at scale
  • Proven ability to own complex systems end-to-end—from architecture and implementation to rollout, monitoring, and long-term maintainability—in large-scale environments
Job Responsibility
Job Responsibility
  • Lead architecture and execution of core authentication capabilities for human and non-human identities, delivering secure, resilient, and frictionless login experiences at Uber scale
  • Own and evolve Uber’s tier-zero authentication and SSO infrastructure, maintaining high availability, security, and performance for core login flows and enabling secure, policy-driven access to internal and third-party applications
  • Build and evolve platform services (APIs, workflows, policy enforcement) with strong engineering fundamentals: reliability, performance, observability, and safe rollout/rollback
  • Develop the Security Knowledge Platform, building the data/graph foundations and risk signals to categorize identity + asset risk and power multiple security and product use cases
  • Build the next generation of automation and intelligence—agentify IAM operations to reduce toil/cost and develop the Security Knowledge Platform to power identity + asset risk insights across Security Engineering
  • Partner cross-functionally and raise the bar—align stakeholders across Security/IT/Ops/Product, mentor engineers through design reviews and incident learning, and set technical direction for the team
What we offer
What we offer
  • Eligible to participate in Uber's bonus program
  • May be offered an equity award & other types of comp
  • All full-time employees are eligible to participate in a 401(k) plan
  • Eligible for various benefits
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, Add-on Operations

addons.mozilla.org (AMO) is the foundation of the Firefox add-ons ecosystem. It’...
Location
Location
Salary
Salary:
Not provided
mozilla.org Logo
Mozilla
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience leading and building modern web applications
  • Strong experience with Python/Django or similar backend frameworks
  • Understanding of web security principles and practices
  • Strong collaboration and communication skills in a distributed team environment
  • Adept at navigating ambiguity, exploring solutions, and shaping direction in new problem spaces
  • Ability to work across teams and align stakeholders on engineering vision
  • Experience mentoring and supporting junior engineers
  • Commitment to our values: Welcoming differences, Being relationship-minded, Practicing responsible participation, Having grit
Job Responsibility
Job Responsibility
  • Plan and deliver major features and architectural improvements across the Add-ons stack, including automated moderation pipelines, Reviewer tools, and DevHub
  • Partner with Engineering management to set Operations Engineering standards (SLOs, incident management, observability baselines)
  • Mentor engineers, sharing knowledge and delegating responsibilities to help others grow
  • Improve platform reliability through deployments, monitoring, and incident response
  • Help keep the platform safe and trustworthy, with attention to security and user trust
  • Step in to resolve issues impacting users and developers, from small bugs to larger incidents
  • Collaborate with designers, product managers, QA, and community contributors to deliver end-to-end improvements
  • Contribute in the open through pull requests, code reviews, and discussions
What we offer
What we offer
  • Generous performance-based bonus plans
  • Rich medical, dental, and vision coverage
  • Generous retirement contributions with 100% immediate vesting
  • Quarterly all-company wellness days
  • Country specific holidays plus a day off for your birthday
  • One-time home office stipend
  • Annual professional development budget
  • Quarterly well-being stipend
  • Considerable paid parental leave
  • Employee referral bonus program
Read More
Arrow Right