CrawlJobs Logo

Principal Service Reliability Engineer

United States, Redmond 142800.00 - 304200.00 USD / Year · Job Posted May 31, 2026
Apply Position
Job Link Share

Job Description

We are seeking a Principal Service Reliability Engineer (SRE) to lead the reliability strategy for mission-critical, large-scale distributed systems. This role operates at a system and organizational level, driving reliability engineering practices across services, influencing architecture decisions, and establishing scalable frameworks for availability, performance, and operational excellence. The Principal SRE defines reliability standards (SLOs/SLIs/error budgets), and partners with engineering, product, and platform teams to design, build, and operate resilient systems at enterprise scale. This role is accountable for reducing systemic risk, eliminating operational toil, and advancing toward autonomous, self-healing platforms.

Job Responsibility

  • Define and drive reliability strategy across services, including measurable targets for availability, latency, and performance aligned to business priorities
  • Establish and enforce SLO/SLI frameworks and error budgets across multiple teams, ensuring consistent adoption and accountability
  • Lead complex incident management and systemic RCA efforts, identifying cross-service failure patterns and driving durable, long-term fixes
  • Influence architecture and platform design to improve operability, scalability, fault isolation, and disaster recovery at enterprise scale
  • Drive reliability engineering standards for observability (metrics, logs, traces), capacity planning, and production readiness across the organization
  • Eliminate operational toil through automation, enabling self-healing systems and reducing manual intervention
  • Embed security, compliance, and resiliency practices into design and operational processes, ensuring alignment with enterprise requirements
  • Partner with engineering leadership to prioritize reliability investments and balance feature velocity with system stability
  • Lead and mentor engineers while shaping a strong reliability culture across teams and org boundaries

Requirements

  • 8+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • Proven track record of defining and operationalizing SLOs, SLIs, and error budgets across multiple services or organizations
  • Experience leading reliability efforts for enterprise-scale or globally distributed systems
  • Advanced debugging and troubleshooting skills across application, platform, and infrastructure layers
  • Demonstrated ability to mentor senior engineers and influence engineering culture at scale
  • Experience driving platform-level improvements (e.g., standardized observability, shared reliability tooling, automated remediation frameworks)
  • Extensive experience operating large-scale, distributed production systems, including cloud-native platforms (Azure preferred)
  • Demonstrated ability to drive cross-team technical initiatives and influence architecture and engineering standards
  • Deep experience in observability, incident management, and production operations at scale
  • Strong understanding of Azure networking, distributed systems performance, and reliability engineering principles
  • Experience leveraging data platforms (Kusto, Power BI, telemetry pipelines) to drive operational insights and decision-making

Nice to have

  • Proven track record of defining and operationalizing SLOs, SLIs, and error budgets across multiple services or organizations
  • Experience leading reliability efforts for enterprise-scale or globally distributed systems
  • Advanced debugging and troubleshooting skills across application, platform, and infrastructure layers
  • Demonstrated ability to mentor senior engineers and influence engineering culture at scale
  • Experience driving platform-level improvements (e.g., standardized observability, shared reliability tooling, automated remediation frameworks)
  • Extensive experience operating large-scale, distributed production systems, including cloud-native platforms (Azure preferred)
  • Demonstrated ability to drive cross-team technical initiatives and influence architecture and engineering standards
  • Deep experience in observability, incident management, and production operations at scale
  • Strong understanding of Azure networking, distributed systems performance, and reliability engineering principles
  • Experience leveraging data platforms (Kusto, Power BI, telemetry pipelines) to drive operational insights and decision-making

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Principal Service Reliability Engineer

8 matching positions

Principal Site Reliability Engineer

Location
Location
United States , Ft. Meade
Salary
Salary:
Not provided
cipherlogix.com Logo
CipherLogix
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Fourteen (14) years experience in software development/engineering, including requirements analysis, software development, installation, integration, evaluation, enhancement, maintenance, testing, and problem diagnosis/resolution
  • Ten (10) years experience in system engineering/architecture
  • Ten (10) years experience working with products that support highly distributed, massively parallel computation needs such as Hbase, Hadoop, CloudBase/Acumulo, Big Table, Cassandra, Scality etc
  • At least ten (10) years experience writing software scripts using scripting languages such as Perl, Python, or Ruby for software automation
  • At least four (4) years experience managing and monitoring large Cloud System (>200 nodes). Cloud Systems Administrator or Developer Certification
  • Experience in performing and providing technical direction for the development, engineering, interfacing, integration, and testing of complete hardware/software systems to include monitoring technical health of a system, improving organizational processes, implementation of postmortem (failure) analysis and incident management
  • Ten (10) years experience in the cleared environment
  • Ten (10) years demonstrated experience developing software for one of the following: Windows, UNIX, or Linux OS
  • Knowledge and experience with developing distributed storage routing and querying algorithms
  • Experience in developing documentation required to support a program’s technical issues and training situations
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

We are looking for a reliability expert who is passionate about scaling Cloud se...
Location
Location
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expert-level proficiency with 10+ years experience in one or more prominent languages such as Java, Go or Python
  • Expert-level proficiency with 7+ years experience in public cloud offerings (with at least 2+ years specifically on GCP)
  • Expert-level proficiency with 7+ years experience in operating high-availability, fault-tolerant, scalable, distributed software in production: building monitoring into your code, tweaking dashboards, defining alerts, writing runbooks, etc.
  • Excellent communication skills in written and verbal forms, and an ability to communicate complex technical issues to a range of technical and non-technical audiences (management, peers, clients)
  • An ability and desire to mentor and coach engineers
Job Responsibility
Job Responsibility
  • Analyse and help improve our services and processes to get us to an even higher level of reliability, performance, scalability, and cost efficiency
  • Cross team and functional boundaries to advocate for reliability methodologies
  • Work with a variety of platform, product and SRE teams to both build reliability into our platform and drive adoption of those practices into our products
  • Be the driving force for change
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Colombia
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering
  • 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Ecuador
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer (AI-first SRE)

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Peru
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Platform Engineer

Principal Platform Engineer role at Endor Labs building the Application Security...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
https://www.endorlabs.com Logo
Endor Labs
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years of Site Reliability Engineering or Platform Engineering experience
  • Deep hands-on expertise with Kubernetes and CNCF ecosystem in production environments
  • Significant experience with at least one major cloud provider (Azure, Google Cloud, or AWS)
  • Strong experience managing large infrastructure deployments using Terraform, OpenTofu, or Terragrunt
  • Hands-on experience with open source observability tools (Prometheus, Grafana, Mimir, Pyroscope)
  • Self-driven problem solver with initiative
  • Customer-focused engineering mindset
  • Clear communication skills across technical and non-technical audiences
Job Responsibility
Job Responsibility
  • Build Cloud Infrastructure at Scale on Azure, Google Cloud, and AWS
  • Master Kubernetes & CNCF Ecosystem with multi-tenant clusters
  • Scale Observability Platform with Prometheus, Grafana, Mimir, and Pyroscope
  • Transform Developer Experience with self-service tools and automation
  • Drive Infrastructure as Code with Terraform/OpenTofu
  • Solve Complex Technical Challenges like zero-downtime migrations and cost optimization
  • Collaborate Across Teams with Security, Backend, and Product Engineering
  • Iterate and Innovate in fast-paced environment
  • Fulltime
Read More
Arrow Right

Senior Principal Engineer - Atlassian Ecosystem and Marketplace

The Atlassian Ecosystem and Marketplace organization enables our customers to do...
Location
Location
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of experience building software
  • 4+ years in an architect/principal role working across teams
  • Broad experience architecting, designing, and building large-scale systems with multiple dependencies
  • Passion for building quality solutions and up-keeping quality standards
  • Success with building, expressing, and pitching a technical vision to stakeholders
  • Experience with collaboration with an ecosystem of teams
  • Success with leading the long-term strategy for software architecture
  • Experience with building and operating large scale, high availability, high reliability services
  • Experience in operational requirements and common challenges of software systems
  • Experience working on developer productivity initiatives
Job Responsibility
Job Responsibility
  • Shape the forward-looking technical direction and long-term architecture for Ecosystem and Marketplace
  • Collaborate with product, engineering and design leaders to understand and influence the broader department level long term strategy
  • Ensure that the technical strategy you build is aligned with the technical strategy of Atlassian products and platforms
  • Partner with principal engineers and architects from other teams and drive exploration of large-scale projects spanning multiple teams in Enterprise
  • Provide pragmatic and balanced advice to the engineering leaders to invest in the long term architecture while also servicing the current systems with high quality
  • Improve, through example, the quality of software construction and meaningful code reviews in an agile environment
  • Be a role model for, and influence a large team of engineers at multiple seniority levels all the way from grads to principal engineers, and mentor engineers across the teams
  • Be influential within your team and work with peers and senior leaders to define and revise the standards for operational excellence across Atlassian
  • Mentor, hire and develop other engineers
What we offer
What we offer
  • Health and wellbeing resources
  • Paid volunteer days
Read More
Arrow Right

Principal Engineering Manager - Applied AI

We are looking for a Principal Engineering Manager to join our growing Applied A...
Location
Location
United States , Seattle
Salary
Salary:
240870.00 - 297652.00 USD / Year
highspot.com Logo
Highspot
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2+ years of experience in Generative AI and Agentic AI systems, including LLMs, context engineering, and modern vector-based retrieval systems
  • 4+ years working as an engineering manager
  • 8+ years working as a professional software developer
  • A great understanding of Generative AI systems, best practices and experience in shipping Agentic AI into distributed, data-intensive production systems
  • Experience developing and operating Cloud services at enterprise scale
  • Strong programming skills in Java, Python, C#, Typescript or equivalent programming language
  • Substantial depth and breadth of management experience to lead and grow an Applied AI team
  • Great collaboration with teams with different backgrounds/expertise/functions
  • Expertise in full product lifecycle
  • technical designs, project planning, iterative implementation, and successful product launches
Job Responsibility
Job Responsibility
  • Lead a team of Applied AI engineers that works at the bleeding edge of Generative AI to solve high-impact business challenges
  • Apply Generative AI to solve hard unsolved challenges in the application of Agentic AI to real-world business challenges
  • Grow, coach, build and scale the Applied AI team
  • Drive operational excellence to achieve enterprise-grade scale, reliability, security, cost-efficiency and performance
  • Drive technical direction for building a safe, scalable and reliable Agentic AI platform for all of Highspot
  • Communicate complex concepts and the results of analyses in a clear and effective manner to technical and non-technical audiences
  • Collaborate with other team members and cross-functionally to share knowledge and discuss initiatives
What we offer
What we offer
  • Comprehensive medical, dental, vision, disability, and life benefits
  • Health Savings Account (HSA) with employer contribution
  • 401(k) Matching with immediate vesting on employer match
  • Flexible PTO
  • 8 paid holidays and 5 paid days for Annual Holiday Week
  • Quarterly Recharge Fridays (paid days off for mental health recharge)
  • 18 weeks paid parental leave
  • Access to Coaches and Therapists through Modern Health
  • 2 volunteer days per year
  • Commuting benefits
  • Fulltime
Read More
Arrow Right