Principal Site Reliability Engineering Manager Job at Microsoft Corporation (Redmond)

Principal Site Reliability Engineering Manager

Are you a Principal Site Reliability Engineering Manager interested in improving...

Location

United States , Redmond

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
OR equivalent experience
3+ years of people management experience
5+ years of experience planning, designing, implementing, and delivering large initiatives spanning multiple engineers as the primary owner, including operating and improving production services at scale
Experience leading reliability engineering for developer-facing or platform services, including incident response, automation/toil reduction, and observability (metrics/logs/tracing) built on top of mature observability platforms and practices
Experience working across disciplines, groups, and teams to align reliability priorities and delivery plans
Experience architecting, deploying, and operating enterprise scale distributed cloud services (Azure preferred), including containerization and orchestration
Experience operating engineering systems outer loop processes (CI/CD, build, and release platforms) with reliability, safety, and governance practices

Job Responsibility

Partner with engineers, product managers, and partner teams to design, operate, and maintain reliable and resilient services, with clear operational requirements (monitoring, alerting, runbooks, capacity, and failure modes)
Drive cross-org alignment through partnerships and co-development following the “One Microsoft” philosophy, including shared reliability standards and operational tooling
Build, grow, and retain a team of Site Reliability Engineers
Provide mentorship and coaching on reliability engineering, incident response, and pragmatic automation—within and beyond your team
Define, implement, and operate SLOs/SLIs and error budgets for critical engineering systems services
use them to guide prioritization and continuous improvement
Lead incident management for your services, including on-call health, escalation paths, blameless post incident reviews, modeling follow-through on corrective and preventive actions
Drive automation to reduce toil and improve operational efficiency across build, validation, and deployment systems (e.g., self-healing, safe rollouts, and automated remediation)
Establish observability (metrics, logs, traces), capacity planning, and performance management to meet reliability and latency goals at scale
Foster a diverse and inclusive culture where everyone can bring their full and authentic self, while holding a high bar for customer impact and reliability

Fulltime

Principal Site Reliability Engineering Manager

The Principal SRE Manager leads the team responsible for durable, high quality h...

Location

Australia , Perth

Salary:

Not provided

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
equivalent experience
Proven experience leading teams through high severity production incidents in large, distributed systems
Demonstrated people leadership experience managing senior engineers or technical incident leaders
Strong understanding of incident management, reliability engineering, and live site operations at scale
Ability to drive clarity, accountability, and results in ambiguous, time critical situations
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check

Job Responsibility

Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high impact events
Act as the senior incident leader or sponsor for long running, high stakes, or cross service incidents, ensuring alignment on impact, risk, and recovery priorities
Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
Ensure high quality post incident reviews and drive accountability for repair items that reduce recurrence and systemic risk
Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths
Lead, coach, and develop a team of Site Reliability Engineers serving as incident responders
Build a culture of calm execution, accountability, psychological safety, and continuous learning during and after incidents
Hire and grow senior talent capable of operating as trusted leaders in high pressure, executive visible situations
Serve as a trusted advisor to engineering leaders and executives on live site risk, readiness, and incident response maturity
Communicate clearly and credibly with senior leadership during customer impacting events

Fulltime

Senior Site Reliability Engineering Manager

Microsoft Substrate is the foundational cloud platform that powers many of Micro...

Location

United States , Redmond

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Ability to obtain and maintain appropriate background investigations and customer screenings for access to GCC Moderate, GCC High, and Department of Defense environments
For access to GCCH and DoD environments, ability to obtain and maintain a favorably adjudicated Tier 3 (T3) background investigation
For access to GCCM environments, ability to meet Criminal Justice Information Services (CJIS) eligibility requirements
For manager-level roles, a Tier 5 (T5) background investigation is preferred
Pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter

Job Responsibility

Lead and develop a team of Site Reliability Engineer ICs, providing clear expectations, regular coaching, and career guidance across senior and principal levels
Own the operational health and reliability posture of Substrate services running in regulated environments
Drive change and influence across the org as you establish and drive SLOs, SLIs, and operational metrics
Lead effective incident management and post-incident reviews
Serve as an actively engaged on-call engineer (OCE) and participate in an on-call rotation
Own reliability, resilience, and disaster recovery, including driving and coordinating DR and game day exercises
Drive engineering led operational excellence at scale
Partner with engineering and product teams to embed reliability, security, and compliance considerations early in service design
Influence technical and operational strategy beyond your immediate team
Represent your team’s work clearly to leadership and partners

Fulltime

Principal Software Engineering Manager - AI Engineering

The Fabric Data Engineering Experience & Infrastructure team is hiring a Princip...

Location

Canada , Vancouver

Salary:

142400.00 - 257500.00 CAD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Job Responsibility

Lead and grow a team: Hire, onboard, coach, and develop engineers
set clear expectations
create an inclusive culture of accountability, learning, and collaboration.
Drive execution and delivery: Guide team planning and prioritization across multiple workstreams
manage dependencies, risks, and release readiness
ensure predictable delivery from requirements → architecture → implementation → rollout → live-site operations.
Shape requirements with partners: Partner with Product Management, Design, Research, and dependent engineering teams to translate ambiguous customer needs into crisp scenario plans and measurable outcomes.
Guide architecture and technical strategy: Lead identification of dependencies and development of design documents
guide architectural decisions for distributed, cloud-scale systems (Spark/PySpark + Python services) with explicit tradeoffs across performance, reliability, cost, security, privacy, and operability.
Raise the engineering quality bar: Establish and reinforce engineering standards (design reviews, coding patterns, test strategy, performance practices, operational readiness)

Fulltime

Principal Site Reliability Engineer (Sovereign Cloud)

Your Career: Palo Alto Networks runs a large infrastructure and is one of the la...

Location

Bulgaria , Sofia

Salary:

Not provided

Palo Alto Networks

Expiration Date

Until further notice

Requirements

7+ years as an engineer in Infrastructure, Operations, DevOps, or System Engineering
7+ years building high availability, scalable cloud native applications on AWS or GCP
BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience required
Expertise in configuration management with a framework such as Ansible, Terraform, Helm
Expertise in infrastructure automation tasks using Python and shell scripting
Experience in Site Reliability Engineering, Production Engineering, or DevOps
Expertise in public or private cloud
Solid experience in Kubernetes and containers
Linux administration, internals, and network troubleshooting
Proficiency with programming languages like Python, Java, Golang, and shell scripting to automate tasks

Job Responsibility

Contribute to the success of SRE and DevOps
Develop expertise in new technologies
Work with developers, researchers, data scientists, and security experts
Design, build and operate reliable, secure Cloud infrastructure
Ensure that applications are production-ready, scalable, and reliable
Develop tools and automation frameworks
Automate robust deployment of robust services
Orchestrate end-to-end monitoring and alerting
Participate in on-call rotations to support critical business and production systems
Lead root cause analysis of critical business and production issues

Fulltime

Principal Site Reliability Engineer

Microsoft Substrate is the foundational cloud platform that powers many of Micro...

Location

United States , Redmond

Salary:

142800.00 - 304200.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
OR equivalent experience
Candidates must be able to meet Microsoft, customer and/or government security screening requirements required for this role
This role requires access to Microsoft Government cloud environments, including GCC Moderate (GCCM), GCC High (GCCH), and Department of Defense (DoD) environments
The successful candidate must be able to obtain and maintain the appropriate background investigations and customer screenings required for access to these environments
For access to GCCH and DoD environments, this role requires the ability to obtain and maintain a favorably adjudicated Tier 3 (T3) background investigation
For access to GCCM environments, this role requires the ability to meet Criminal Justice Information Services (CJIS) eligibility requirements
For manager-level roles, a Tier 5 (T5) background investigation is preferred

Job Responsibility

Define and drive reliability strategy, SLO frameworks, and operational best practices across Substrate workloads in highly regulated environments
Serve as an actively engaged senior on-call engineer (OCE), participating in on-call rotations and leading incident response for Substrate services in regulated environments
Provide hands-on leadership during the most complex or high-impact incidents, setting technical direction and response strategy
Drive high-quality post-incident reviews that result in durable, systemic engineering improvements across teams
Architect and deliver large-scale automation, observability, and self-healing solutions
Drive architectural decisions and define software engineering standards that make reliability, security, and compliance intrinsic to Substrate services
Influence service design and engineering decisions across organizational boundaries
Mentor senior and principal engineers and shape the long-term technical direction of the SRE discipline
Represent Substrate SRE perspectives with senior leadership and cross-functional partners

Fulltime

Principal Site Reliability Engineer

Palo Alto Networks runs a large hybrid infrastructure and is one of the largest ...

Location

United States , Santa Clara

Salary:

151600.00 - 245300.00 USD / Year

Palo Alto Networks

Expiration Date

Until further notice

Requirements

BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience
Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
Proficient in Python and/or Go
Expertise in managing applications in the Kubernetes cluster with autoscaling enabled
Experience in Production Engineering, DevOps, or Site Reliability
Expertise in the public cloud (GCP or AWS), especially in GCP
Strong Linux administration, internals, and network troubleshooting
Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
Experience with CI/CD pipelines, GitLab, and GitHub preferred
Ability to diagnose and troubleshoot complex distributed systems handling high-volume transactions

Job Responsibility

Contribute to the success of SRE and DevOps
Develop expertise in new technologies
Work with developers, researchers, data scientists, and security experts
Design, build, and operate reliable, secure Cloud infrastructure
Ensure that applications are production-ready, scalable, and reliable
Develop tools and automation frameworks
Automate robust deployment of robust services
Orchestrate end-to-end monitoring and alerting
Participate with SRE and Dev teams in the on-call rotation
Lead root cause analysis of critical business and production issues

Fulltime

Principal Site Reliability Engineer (Sovereign Cloud)

Location

Bulgaria , Sofia

Salary:

Not provided

Palo Alto Networks

Expiration Date

Until further notice

Requirements

6+ years as DevOps engineer with a passion for technology, strong motivation and responsibility
Proficiency in DevOps and Platform Engineering with expertise in AWS, GCP, Terraform, ArgoCD, Kubernetes, and related tools
Experience in developing and maintaining CI/CD pipelines for continuous delivery in agile environments
Skilled in managing cloud infrastructure, particularly with AWS and GCP, and adept in infrastructure as code practices using Terraform/Terragrunt
Demonstrated capability in supporting high-scale SaaS applications, focusing on scalability, reliability, and performance
Strong communication, strategic thinking, and problem-solving skills
Self-disciplined, self-managed, self-motivated, strong sense of ownership, urgency, and drive
Ready to understand and dissect new technology stacks quickly

Job Responsibility

Implement and optimize CI/CD pipelines and cloud infrastructure using our technology stack, ensuring efficient and reliable deployment to production
Participate in the deployment of monitoring and alerting systems to maintain high system performance and reliability
Collaborate with software development and other cross-functional teams to streamline and enhance processes, aiming for efficiency and alignment with business goals
Contribute to the management of the cloud infrastructure, utilizing Infrastructure as Code principles
Participate in on-call rotations to support critical business and production systems

Fulltime

Select Country

Principal Site Reliability Engineering Manager

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?

Principal Site Reliability Engineering Manager

Principal Site Reliability Engineering Manager

Principal Site Reliability Engineering Manager

Senior Site Reliability Engineering Manager

Principal Software Engineering Manager - AI Engineering

Principal Site Reliability Engineer (Sovereign Cloud)

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer (Sovereign Cloud)

Our AI answers in your language