Principal Service Reliability Engineer Job at Microsoft Corporation (Redmond)

Job Description

We are seeking a Principal Service Reliability Engineer (SRE) to lead the reliability strategy for mission-critical, large-scale distributed systems. This role operates at a system and organizational level, driving reliability engineering practices across services, influencing architecture decisions, and establishing scalable frameworks for availability, performance, and operational excellence. The Principal SRE defines reliability standards (SLOs/SLIs/error budgets), and partners with engineering, product, and platform teams to design, build, and operate resilient systems at enterprise scale. This role is accountable for reducing systemic risk, eliminating operational toil, and advancing toward autonomous, self-healing platforms.

Job Responsibility

Define and drive reliability strategy across services, including measurable targets for availability, latency, and performance aligned to business priorities
Establish and enforce SLO/SLI frameworks and error budgets across multiple teams, ensuring consistent adoption and accountability
Lead complex incident management and systemic RCA efforts, identifying cross-service failure patterns and driving durable, long-term fixes
Influence architecture and platform design to improve operability, scalability, fault isolation, and disaster recovery at enterprise scale
Drive reliability engineering standards for observability (metrics, logs, traces), capacity planning, and production readiness across the organization
Eliminate operational toil through automation, enabling self-healing systems and reducing manual intervention
Embed security, compliance, and resiliency practices into design and operational processes, ensuring alignment with enterprise requirements
Partner with engineering leadership to prioritize reliability investments and balance feature velocity with system stability
Lead and mentor engineers while shaping a strong reliability culture across teams and org boundaries

Requirements

8+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
Proven track record of defining and operationalizing SLOs, SLIs, and error budgets across multiple services or organizations
Experience leading reliability efforts for enterprise-scale or globally distributed systems
Advanced debugging and troubleshooting skills across application, platform, and infrastructure layers
Demonstrated ability to mentor senior engineers and influence engineering culture at scale
Experience driving platform-level improvements (e.g., standardized observability, shared reliability tooling, automated remediation frameworks)
Extensive experience operating large-scale, distributed production systems, including cloud-native platforms (Azure preferred)
Demonstrated ability to drive cross-team technical initiatives and influence architecture and engineering standards
Deep experience in observability, incident management, and production operations at scale
Strong understanding of Azure networking, distributed systems performance, and reliability engineering principles
Experience leveraging data platforms (Kusto, Power BI, telemetry pipelines) to drive operational insights and decision-making

Nice to have

Proven track record of defining and operationalizing SLOs, SLIs, and error budgets across multiple services or organizations
Experience leading reliability efforts for enterprise-scale or globally distributed systems
Advanced debugging and troubleshooting skills across application, platform, and infrastructure layers
Demonstrated ability to mentor senior engineers and influence engineering culture at scale
Experience driving platform-level improvements (e.g., standardized observability, shared reliability tooling, automated remediation frameworks)
Extensive experience operating large-scale, distributed production systems, including cloud-native platforms (Azure preferred)
Demonstrated ability to drive cross-team technical initiatives and influence architecture and engineering standards
Deep experience in observability, incident management, and production operations at scale
Strong understanding of Azure networking, distributed systems performance, and reliability engineering principles
Experience leveraging data platforms (Kusto, Power BI, telemetry pipelines) to drive operational insights and decision-making

Microsoft Corporation - All Job Offers

Select Country

Principal Service Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?

Principal Service Reliability Engineer

Principal Site Reliability Engineer (Sovereign Cloud)

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Sr Principal Site Reliability Engineer (Sovereign Cloud)

Sr Principal Site Reliability Engineer (Sovereign Cloud)

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer (DNS Security)

Our AI answers in your language