This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking a Principal Service Reliability Engineer (SRE) to lead the reliability strategy for mission-critical, large-scale distributed systems. This role operates at a system and organizational level, driving reliability engineering practices across services, influencing architecture decisions, and establishing scalable frameworks for availability, performance, and operational excellence. The Principal SRE defines reliability standards (SLOs/SLIs/error budgets), and partners with engineering, product, and platform teams to design, build, and operate resilient systems at enterprise scale. This role is accountable for reducing systemic risk, eliminating operational toil, and advancing toward autonomous, self-healing platforms.
Job Responsibility
Define and drive reliability strategy across services, including measurable targets for availability, latency, and performance aligned to business priorities
Establish and enforce SLO/SLI frameworks and error budgets across multiple teams, ensuring consistent adoption and accountability
Lead complex incident management and systemic RCA efforts, identifying cross-service failure patterns and driving durable, long-term fixes
Influence architecture and platform design to improve operability, scalability, fault isolation, and disaster recovery at enterprise scale
Drive reliability engineering standards for observability (metrics, logs, traces), capacity planning, and production readiness across the organization
Eliminate operational toil through automation, enabling self-healing systems and reducing manual intervention
Embed security, compliance, and resiliency practices into design and operational processes, ensuring alignment with enterprise requirements
Partner with engineering leadership to prioritize reliability investments and balance feature velocity with system stability
Lead and mentor engineers while shaping a strong reliability culture across teams and org boundaries
Requirements
8+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
Proven track record of defining and operationalizing SLOs, SLIs, and error budgets across multiple services or organizations
Experience leading reliability efforts for enterprise-scale or globally distributed systems
Advanced debugging and troubleshooting skills across application, platform, and infrastructure layers
Demonstrated ability to mentor senior engineers and influence engineering culture at scale