CrawlJobs Logo

Principal Site Reliability Engineering Manager

United States, Redmond 139900.00 - 274800.00 USD / Year · Job Posted March 19, 2026
Apply Position
Job Link Share

Job Description

Are you a Principal Site Reliability Engineering Manager interested in improving the reliability of large-scale engineering systems serving multiple major Microsoft divisions? Are you seeking an opportunity to transform how we deliver engineering services, built on a foundation of SRE principles and practices like automated monitoring and alerting, automatic failover, and broadscale service best practices? Are you motivated by coaching and people leadership, helping a team of diverse SREs to unlock their full potential? If so, we have an opportunity for you. The ES365 org is responsible for the engineering systems, tools, and services that comprise the end-to-end developer experiences for the teams that build Office, Exchange, and Microsoft 365, and who work in our largescale web frontend monorepo. Our areas of ownership cover source control, check-in processes, build, validation, and deployment automation. Reliability and operational proficiency are critical to keeping engineering teams productive, and our business needs have shifted from local on-prem operations experience to building and operating reliable cloud services at scale. The Principal Site Reliability Engineering Manager will work effectively with a range of stakeholders, from executives to engineers, balancing near-term reliability improvements with long-term resilience strategies. You will drive cross-org partnerships, establish service level objectives (SLOs) and indicators (SLIs), and lead incident response and continuous improvement through Engineering Service Reviews, SRE service coownership campaigns, and establishing updated service best practice. We believe that significant achievements happen within high-functioning, trust-filled teams. A reliable manager ensures success in execution, promotes career growth, and cultivates a culture centered on customer focus, collaboration, diversity, and inclusion. If you are committed to improving engineers' productivity and satisfaction through reliable, scalable tool and service operations, consider joining ES365. Be at the core of Microsoft and help shape the future of Engineering Systems by raising the bar on availability, performance, and operational success.

Job Responsibility

  • Partner with engineers, product managers, and partner teams to design, operate, and maintain reliable and resilient services, with clear operational requirements (monitoring, alerting, runbooks, capacity, and failure modes)
  • Drive cross-org alignment through partnerships and co-development following the “One Microsoft” philosophy, including shared reliability standards and operational tooling
  • Build, grow, and retain a team of Site Reliability Engineers
  • Provide mentorship and coaching on reliability engineering, incident response, and pragmatic automation—within and beyond your team
  • Define, implement, and operate SLOs/SLIs and error budgets for critical engineering systems services
  • use them to guide prioritization and continuous improvement
  • Lead incident management for your services, including on-call health, escalation paths, blameless post incident reviews, modeling follow-through on corrective and preventive actions
  • Drive automation to reduce toil and improve operational efficiency across build, validation, and deployment systems (e.g., self-healing, safe rollouts, and automated remediation)
  • Establish observability (metrics, logs, traces), capacity planning, and performance management to meet reliability and latency goals at scale
  • Foster a diverse and inclusive culture where everyone can bring their full and authentic self, while holding a high bar for customer impact and reliability
  • Embody our culture and values

Requirements

  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • 3+ years of people management experience
  • 5+ years of experience planning, designing, implementing, and delivering large initiatives spanning multiple engineers as the primary owner, including operating and improving production services at scale
  • Experience leading reliability engineering for developer-facing or platform services, including incident response, automation/toil reduction, and observability (metrics/logs/tracing) built on top of mature observability platforms and practices
  • Experience working across disciplines, groups, and teams to align reliability priorities and delivery plans
  • Experience architecting, deploying, and operating enterprise scale distributed cloud services (Azure preferred), including containerization and orchestration
  • Experience operating engineering systems outer loop processes (CI/CD, build, and release platforms) with reliability, safety, and governance practices

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Principal Site Reliability Engineering Manager

8 matching positions

Principal Site Reliability Engineering Manager

Microsoft Substrate is the foundational cloud platform that powers many of Micro...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • Candidates must be able to meet Microsoft, customer and/or government security screening requirements required for this role
  • This role requires access to Microsoft Government cloud environments, including GCC Moderate (GCCM), GCC High (GCCH), and Department of Defense (DoD) environments
  • For access to GCCH and DoD environments, this role requires the ability to obtain and maintain a favorably adjudicated Tier 3 (T3) background investigation
  • For access to GCCM environments, this role requires the ability to meet Criminal Justice Information Services (CJIS) eligibility requirements
  • For manager-level roles, a Tier 5 (T5) background investigation is preferred
  • Candidates may be considered without currently holding these background investigations, provided they are eligible for and able to successfully obtain them
Job Responsibility
Job Responsibility
  • Lead and develop a team of Site Reliability Engineer ICs, providing clear expectations, regular coaching, and career guidance across senior and principal levels
  • Own the operational health and reliability posture of Substrate services running in regulated environments
  • Drive change and influence across the org as you establish and drive SLOs, SLIs, and operational metrics
  • Lead effective incident management and post-incident reviews
  • Serve as an actively engaged on-call engineer (OCE) and participate in an on-call rotation
  • Own reliability, resilience, and disaster recovery, including driving and coordinating DR and game day exercises
  • Drive engineering led operational excellence at scale
  • Partner with engineering and product teams to embed reliability, security, and compliance considerations early in service design
  • Influence technical and operational strategy beyond your immediate team
  • Represent your team’s work clearly to leadership and partners
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineering Manager

The Principal SRE Manager leads the team responsible for durable, high quality h...
Location
Location
Australia , Perth
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • equivalent experience
  • Proven experience leading teams through high severity production incidents in large, distributed systems
  • Demonstrated people leadership experience managing senior engineers or technical incident leaders
  • Strong understanding of incident management, reliability engineering, and live site operations at scale
  • Ability to drive clarity, accountability, and results in ambiguous, time critical situations
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
Job Responsibility
Job Responsibility
  • Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high impact events
  • Act as the senior incident leader or sponsor for long running, high stakes, or cross service incidents, ensuring alignment on impact, risk, and recovery priorities
  • Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
  • Ensure high quality post incident reviews and drive accountability for repair items that reduce recurrence and systemic risk
  • Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths
  • Lead, coach, and develop a team of Site Reliability Engineers serving as incident responders
  • Build a culture of calm execution, accountability, psychological safety, and continuous learning during and after incidents
  • Hire and grow senior talent capable of operating as trusted leaders in high pressure, executive visible situations
  • Serve as a trusted advisor to engineering leaders and executives on live site risk, readiness, and incident response maturity
  • Communicate clearly and credibly with senior leadership during customer impacting events
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineering Manager

Microsoft Substrate is the foundational cloud platform that powers many of Micro...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Ability to obtain and maintain appropriate background investigations and customer screenings for access to GCC Moderate, GCC High, and Department of Defense environments
  • For access to GCCH and DoD environments, ability to obtain and maintain a favorably adjudicated Tier 3 (T3) background investigation
  • For access to GCCM environments, ability to meet Criminal Justice Information Services (CJIS) eligibility requirements
  • For manager-level roles, a Tier 5 (T5) background investigation is preferred
  • Pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Lead and develop a team of Site Reliability Engineer ICs, providing clear expectations, regular coaching, and career guidance across senior and principal levels
  • Own the operational health and reliability posture of Substrate services running in regulated environments
  • Drive change and influence across the org as you establish and drive SLOs, SLIs, and operational metrics
  • Lead effective incident management and post-incident reviews
  • Serve as an actively engaged on-call engineer (OCE) and participate in an on-call rotation
  • Own reliability, resilience, and disaster recovery, including driving and coordinating DR and game day exercises
  • Drive engineering led operational excellence at scale
  • Partner with engineering and product teams to embed reliability, security, and compliance considerations early in service design
  • Influence technical and operational strategy beyond your immediate team
  • Represent your team’s work clearly to leadership and partners
  • Fulltime
Read More
Arrow Right

Executive Principal, Site Reliability Engineering (SRE) – DevOps

The Executive Principal of Infra Engineering is a senior leader responsible for ...
Location
Location
United States , Irvine
Salary
Salary:
180000.00 - 210000.00 USD / Year
haeaus.com Logo
Hyundai AutoEver America
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in IT/IS or equivalent experience
  • 10 years of infrastructure engineering experience
  • 8+ years of management experience required
  • High availability, fault tolerance, and incident management
  • Automation of infrastructure and operations
  • CI/CD pipeline design and maintenance
  • Monitoring, metrics, and performance tuning
  • Multi-platform expertise (Windows, Linux, VMware, cloud)
  • Security, audit, and identity/access management
  • Change control and risk management
Job Responsibility
Job Responsibility
  • Guide the Site Reliability Engineering (SRE) function, integrating DevOps principles to drive operational excellence, reliability, and innovation across infrastructure platforms
  • Lead multiple technical teams, including Platform Engineering, Data Center Management, Infrastructure Planning & Architecture and Network & Telecommunications, ensuring 24x7 support and continuous improvement within a complex, hybrid environment
  • Mentor and develop infrastructure managers and SMEs
  • Lead onshore/offshore teams and manage service providers
  • Oversee 24x7 operations, incident response, and problem management
  • Manage OpEx/CapEx, SLAs, KPIs, and OKRs
  • Ensure reliability, disaster recovery, and lifecycle management
  • Champion automation, CI/CD, and Infrastructure as Code
  • Direct monitoring, observability, and performance optimization
  • Align with security and compliance requirements
  • Fulltime
Read More
Arrow Right

Principal Software Engineering Manager - AI Engineering

The Fabric Data Engineering Experience & Infrastructure team is hiring a Princip...
Location
Location
Canada , Vancouver
Salary
Salary:
142400.00 - 257500.00 CAD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Lead and grow a team: Hire, onboard, coach, and develop engineers
  • set clear expectations
  • create an inclusive culture of accountability, learning, and collaboration.
  • Drive execution and delivery: Guide team planning and prioritization across multiple workstreams
  • manage dependencies, risks, and release readiness
  • ensure predictable delivery from requirements → architecture → implementation → rollout → live-site operations.
  • Shape requirements with partners: Partner with Product Management, Design, Research, and dependent engineering teams to translate ambiguous customer needs into crisp scenario plans and measurable outcomes.
  • Guide architecture and technical strategy: Lead identification of dependencies and development of design documents
  • guide architectural decisions for distributed, cloud-scale systems (Spark/PySpark + Python services) with explicit tradeoffs across performance, reliability, cost, security, privacy, and operability.
  • Raise the engineering quality bar: Establish and reinforce engineering standards (design reviews, coding patterns, test strategy, performance practices, operational readiness)
  • Fulltime
Read More
Arrow Right
New

Principal Site Reliability Engineer (Sovereign Cloud)

Location
Location
Bulgaria , Sofia
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years as DevOps engineer with a passion for technology, strong motivation and responsibility
  • Proficiency in DevOps and Platform Engineering with expertise in AWS, GCP, Terraform, ArgoCD, Kubernetes, and related tools
  • Experience in developing and maintaining CI/CD pipelines for continuous delivery in agile environments
  • Skilled in managing cloud infrastructure, particularly with AWS and GCP, and adept in infrastructure as code practices using Terraform/Terragrunt
  • Demonstrated capability in supporting high-scale SaaS applications, focusing on scalability, reliability, and performance
  • Strong communication, strategic thinking, and problem-solving skills
  • Self-disciplined, self-managed, self-motivated, strong sense of ownership, urgency, and drive
  • Ready to understand and dissect new technology stacks quickly
Job Responsibility
Job Responsibility
  • Implement and optimize CI/CD pipelines and cloud infrastructure using our technology stack, ensuring efficient and reliable deployment to production
  • Participate in the deployment of monitoring and alerting systems to maintain high system performance and reliability
  • Collaborate with software development and other cross-functional teams to streamline and enhance processes, aiming for efficiency and alignment with business goals
  • Contribute to the management of the cloud infrastructure, utilizing Infrastructure as Code principles
  • Participate in on-call rotations to support critical business and production systems
  • Fulltime
Read More
Arrow Right
New

Sr Principal Site Reliability Engineer (Sovereign Cloud)

The Prisma Access team is seeking a seasoned Principal Site Reliability Engineer...
Location
Location
Bulgaria , Sofia
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of experience in Infrastructure, SRE, or DevOps roles
  • BS or MS in Computer Science, a related field, or equivalent professional experience
  • 7+ years of experience with GCP, and expertise in their architecture, services and PKI concepts for cloud security
  • Expert troubleshooting skills to resolve cloud infrastructure and service issues, effectively identifying root cause and devising effective solutions
  • Proficiency in automation using Python and shell scripting
  • Expertise in Infrastructure as Code (IaC) with Terraform and Helm, leveraging AI tools for development
  • Solid experience with Kubernetes, container networking, and container workloads
  • Strong Linux administration skills
  • Proficiency with CI/CD pipelines, GitOps principles, and tooling like GitLab and Jenkins
  • Excellent written and verbal communication skills, with the ability to collaborate effectively to drive outcomes
Job Responsibility
Job Responsibility
  • Design, build, and operate reliable, secure Cloud infrastructure across multi-cloud environments for our sovereign customers
  • Lead cross-functional initiatives to ensure applications are production-ready, scalable, secure, and resilient
  • Develop expertise in new technologies, embracing continuous learning and the adoption of AI tools
  • Develop tools and automation frameworks, championing Infrastructure as Code (IaC) and Monitoring as Code (MaC) principles
  • Automate robust deployments and orchestrate end-to-end monitoring and alerting solutions
  • Participate in on-call rotations to support critical business and production systems
  • Lead root cause analysis of critical issues, driving improvements and preventing recurrence
  • Champion the success of SRE and DevOps initiatives, aligning technical decisions with business goals
  • Fulltime
Read More
Arrow Right
New

Sr Principal Site Reliability Engineer (Sovereign Cloud)

Palo Alto Networks runs a large infrastructure and is one of the largest GCP cus...
Location
Location
Bulgaria , Sofia
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years as an engineer in Infrastructure, Operations, DevOps, or System Engineering
  • 7+ years building high availability, scalable cloud-native applications on AWS and GCP
  • BS or MS in Computer Science, a related field, or equivalent professional experience required
  • Expertise in configuration management with a framework such as Ansible, Terraform, Helm
  • Passion for infrastructure and monitoring as code
  • Solid experience in container workloads and Kubernetes
  • Familiarity with PKI concepts, Networking concepts
  • In-depth knowledge of different security controls ( app-id, user-id, security profile, url category, content, ssl decryption, firewall MFA etc)
  • Linux administration, internals, and network troubleshooting
  • Proficiency with programming languages like Golang or Python along with shell scripting to automate tasks
Job Responsibility
Job Responsibility
  • Contribute to the success of SRE and DevOps
  • Develop expertise in new technologies
  • Work with developers, researchers, data scientists, and security experts
  • Design, build and operate reliable, secure Cloud infrastructure
  • Ensure that applications are production-ready, scalable, and reliable
  • Develop tools and automation frameworks
  • Automate robust deployment of robust services
  • Orchestrate end-to-end monitoring and alerting
  • Participate in on-call rotations to support critical business and production systems
  • Lead root cause analysis of critical business and production issues
  • Fulltime
Read More
Arrow Right