CrawlJobs Logo

Principal Site Reliability Engineering Manager

Australia, Perth · Job Posted March 01, 2026
Apply Position
Job Link Share

Job Description

The Principal SRE Manager leads the team responsible for durable, high quality handling of high severity, customer impacting, incidents across Microsoft M365 Substrate Core services. As our systems continue to expand in scope and complexity, this role ensures incidents are handled consistently, predictably, and with clear ownership, minimizing customer impact while accelerating recovery and organizational learning. This role combines people leadership, incident command, and operational governance, working in close partnership with Incident Managers (IMs), Service Owners, and executive stakeholders. You will set standards for how Substrate responds to its most severe outages and drive the evolution of incident handling and escalation practices across all production rings and public/sovereign clouds.

Job Responsibility

  • Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high impact events
  • Act as the senior incident leader or sponsor for long running, high stakes, or cross service incidents, ensuring alignment on impact, risk, and recovery priorities
  • Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
  • Ensure high quality post incident reviews and drive accountability for repair items that reduce recurrence and systemic risk
  • Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths
  • Lead, coach, and develop a team of Site Reliability Engineers serving as incident responders
  • Build a culture of calm execution, accountability, psychological safety, and continuous learning during and after incidents
  • Hire and grow senior talent capable of operating as trusted leaders in high pressure, executive visible situations
  • Serve as a trusted advisor to engineering leaders and executives on live site risk, readiness, and incident response maturity
  • Communicate clearly and credibly with senior leadership during customer impacting events

Requirements

  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • equivalent experience
  • Proven experience leading teams through high severity production incidents in large, distributed systems
  • Demonstrated people leadership experience managing senior engineers or technical incident leaders
  • Strong understanding of incident management, reliability engineering, and live site operations at scale
  • Ability to drive clarity, accountability, and results in ambiguous, time critical situations
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check

Nice to have

  • Experience building or scaling incident response programs at organizational or enterprise scope
  • Background in SRE, production engineering, or platform reliability roles
  • Track record of reducing customer impact through improved incident handling, tooling, or prevention
  • Experience operating in follow the sun or globally distributed incident response models

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Principal Site Reliability Engineering Manager

8 matching positions

Principal Site Reliability Engineering Manager

Microsoft Substrate is the foundational cloud platform that powers many of Micro...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • Candidates must be able to meet Microsoft, customer and/or government security screening requirements required for this role
  • This role requires access to Microsoft Government cloud environments, including GCC Moderate (GCCM), GCC High (GCCH), and Department of Defense (DoD) environments
  • For access to GCCH and DoD environments, this role requires the ability to obtain and maintain a favorably adjudicated Tier 3 (T3) background investigation
  • For access to GCCM environments, this role requires the ability to meet Criminal Justice Information Services (CJIS) eligibility requirements
  • For manager-level roles, a Tier 5 (T5) background investigation is preferred
  • Candidates may be considered without currently holding these background investigations, provided they are eligible for and able to successfully obtain them
Job Responsibility
Job Responsibility
  • Lead and develop a team of Site Reliability Engineer ICs, providing clear expectations, regular coaching, and career guidance across senior and principal levels
  • Own the operational health and reliability posture of Substrate services running in regulated environments
  • Drive change and influence across the org as you establish and drive SLOs, SLIs, and operational metrics
  • Lead effective incident management and post-incident reviews
  • Serve as an actively engaged on-call engineer (OCE) and participate in an on-call rotation
  • Own reliability, resilience, and disaster recovery, including driving and coordinating DR and game day exercises
  • Drive engineering led operational excellence at scale
  • Partner with engineering and product teams to embed reliability, security, and compliance considerations early in service design
  • Influence technical and operational strategy beyond your immediate team
  • Represent your team’s work clearly to leadership and partners
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineering Manager

Are you a Principal Site Reliability Engineering Manager interested in improving...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • 3+ years of people management experience
  • 5+ years of experience planning, designing, implementing, and delivering large initiatives spanning multiple engineers as the primary owner, including operating and improving production services at scale
  • Experience leading reliability engineering for developer-facing or platform services, including incident response, automation/toil reduction, and observability (metrics/logs/tracing) built on top of mature observability platforms and practices
  • Experience working across disciplines, groups, and teams to align reliability priorities and delivery plans
  • Experience architecting, deploying, and operating enterprise scale distributed cloud services (Azure preferred), including containerization and orchestration
  • Experience operating engineering systems outer loop processes (CI/CD, build, and release platforms) with reliability, safety, and governance practices
Job Responsibility
Job Responsibility
  • Partner with engineers, product managers, and partner teams to design, operate, and maintain reliable and resilient services, with clear operational requirements (monitoring, alerting, runbooks, capacity, and failure modes)
  • Drive cross-org alignment through partnerships and co-development following the “One Microsoft” philosophy, including shared reliability standards and operational tooling
  • Build, grow, and retain a team of Site Reliability Engineers
  • Provide mentorship and coaching on reliability engineering, incident response, and pragmatic automation—within and beyond your team
  • Define, implement, and operate SLOs/SLIs and error budgets for critical engineering systems services
  • use them to guide prioritization and continuous improvement
  • Lead incident management for your services, including on-call health, escalation paths, blameless post incident reviews, modeling follow-through on corrective and preventive actions
  • Drive automation to reduce toil and improve operational efficiency across build, validation, and deployment systems (e.g., self-healing, safe rollouts, and automated remediation)
  • Establish observability (metrics, logs, traces), capacity planning, and performance management to meet reliability and latency goals at scale
  • Foster a diverse and inclusive culture where everyone can bring their full and authentic self, while holding a high bar for customer impact and reliability
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineering Manager

Microsoft Substrate is the foundational cloud platform that powers many of Micro...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Ability to obtain and maintain appropriate background investigations and customer screenings for access to GCC Moderate, GCC High, and Department of Defense environments
  • For access to GCCH and DoD environments, ability to obtain and maintain a favorably adjudicated Tier 3 (T3) background investigation
  • For access to GCCM environments, ability to meet Criminal Justice Information Services (CJIS) eligibility requirements
  • For manager-level roles, a Tier 5 (T5) background investigation is preferred
  • Pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Lead and develop a team of Site Reliability Engineer ICs, providing clear expectations, regular coaching, and career guidance across senior and principal levels
  • Own the operational health and reliability posture of Substrate services running in regulated environments
  • Drive change and influence across the org as you establish and drive SLOs, SLIs, and operational metrics
  • Lead effective incident management and post-incident reviews
  • Serve as an actively engaged on-call engineer (OCE) and participate in an on-call rotation
  • Own reliability, resilience, and disaster recovery, including driving and coordinating DR and game day exercises
  • Drive engineering led operational excellence at scale
  • Partner with engineering and product teams to embed reliability, security, and compliance considerations early in service design
  • Influence technical and operational strategy beyond your immediate team
  • Represent your team’s work clearly to leadership and partners
  • Fulltime
Read More
Arrow Right

Executive Principal, Site Reliability Engineering (SRE) – DevOps

The Executive Principal of Infra Engineering is a senior leader responsible for ...
Location
Location
United States , Irvine
Salary
Salary:
180000.00 - 210000.00 USD / Year
haeaus.com Logo
Hyundai AutoEver America
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in IT/IS or equivalent experience
  • 10 years of infrastructure engineering experience
  • 8+ years of management experience required
  • High availability, fault tolerance, and incident management
  • Automation of infrastructure and operations
  • CI/CD pipeline design and maintenance
  • Monitoring, metrics, and performance tuning
  • Multi-platform expertise (Windows, Linux, VMware, cloud)
  • Security, audit, and identity/access management
  • Change control and risk management
Job Responsibility
Job Responsibility
  • Guide the Site Reliability Engineering (SRE) function, integrating DevOps principles to drive operational excellence, reliability, and innovation across infrastructure platforms
  • Lead multiple technical teams, including Platform Engineering, Data Center Management, Infrastructure Planning & Architecture and Network & Telecommunications, ensuring 24x7 support and continuous improvement within a complex, hybrid environment
  • Mentor and develop infrastructure managers and SMEs
  • Lead onshore/offshore teams and manage service providers
  • Oversee 24x7 operations, incident response, and problem management
  • Manage OpEx/CapEx, SLAs, KPIs, and OKRs
  • Ensure reliability, disaster recovery, and lifecycle management
  • Champion automation, CI/CD, and Infrastructure as Code
  • Direct monitoring, observability, and performance optimization
  • Align with security and compliance requirements
  • Fulltime
Read More
Arrow Right

Principal Software Engineering Manager - AI Engineering

The Fabric Data Engineering Experience & Infrastructure team is hiring a Princip...
Location
Location
Canada , Vancouver
Salary
Salary:
142400.00 - 257500.00 CAD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Lead and grow a team: Hire, onboard, coach, and develop engineers
  • set clear expectations
  • create an inclusive culture of accountability, learning, and collaboration.
  • Drive execution and delivery: Guide team planning and prioritization across multiple workstreams
  • manage dependencies, risks, and release readiness
  • ensure predictable delivery from requirements → architecture → implementation → rollout → live-site operations.
  • Shape requirements with partners: Partner with Product Management, Design, Research, and dependent engineering teams to translate ambiguous customer needs into crisp scenario plans and measurable outcomes.
  • Guide architecture and technical strategy: Lead identification of dependencies and development of design documents
  • guide architectural decisions for distributed, cloud-scale systems (Spark/PySpark + Python services) with explicit tradeoffs across performance, reliability, cost, security, privacy, and operability.
  • Raise the engineering quality bar: Establish and reinforce engineering standards (design reviews, coding patterns, test strategy, performance practices, operational readiness)
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Palo Alto Networks runs a large hybrid infrastructure and is one of the largest ...
Location
Location
United States , Santa Clara
Salary
Salary:
151600.00 - 245300.00 USD / Year
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience
  • Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
  • Proficient in Python and/or Go
  • Expertise in managing applications in the Kubernetes cluster with autoscaling enabled
  • Experience in Production Engineering, DevOps, or Site Reliability
  • Expertise in the public cloud (GCP or AWS), especially in GCP
  • Strong Linux administration, internals, and network troubleshooting
  • Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
  • Experience with CI/CD pipelines, GitLab, and GitHub preferred
  • Ability to diagnose and troubleshoot complex distributed systems handling high-volume transactions
Job Responsibility
Job Responsibility
  • Contribute to the success of SRE and DevOps
  • Develop expertise in new technologies
  • Work with developers, researchers, data scientists, and security experts
  • Design, build, and operate reliable, secure Cloud infrastructure
  • Ensure that applications are production-ready, scalable, and reliable
  • Develop tools and automation frameworks
  • Automate robust deployment of robust services
  • Orchestrate end-to-end monitoring and alerting
  • Participate with SRE and Dev teams in the on-call rotation
  • Lead root cause analysis of critical business and production issues
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Microsoft Substrate is the foundational cloud platform that powers many of Micro...
Location
Location
United States , Redmond
Salary
Salary:
142800.00 - 304200.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • Candidates must be able to meet Microsoft, customer and/or government security screening requirements required for this role
  • This role requires access to Microsoft Government cloud environments, including GCC Moderate (GCCM), GCC High (GCCH), and Department of Defense (DoD) environments
  • The successful candidate must be able to obtain and maintain the appropriate background investigations and customer screenings required for access to these environments
  • For access to GCCH and DoD environments, this role requires the ability to obtain and maintain a favorably adjudicated Tier 3 (T3) background investigation
  • For access to GCCM environments, this role requires the ability to meet Criminal Justice Information Services (CJIS) eligibility requirements
  • For manager-level roles, a Tier 5 (T5) background investigation is preferred
Job Responsibility
Job Responsibility
  • Define and drive reliability strategy, SLO frameworks, and operational best practices across Substrate workloads in highly regulated environments
  • Serve as an actively engaged senior on-call engineer (OCE), participating in on-call rotations and leading incident response for Substrate services in regulated environments
  • Provide hands-on leadership during the most complex or high-impact incidents, setting technical direction and response strategy
  • Drive high-quality post-incident reviews that result in durable, systemic engineering improvements across teams
  • Architect and deliver large-scale automation, observability, and self-healing solutions
  • Drive architectural decisions and define software engineering standards that make reliability, security, and compliance intrinsic to Substrate services
  • Influence service design and engineering decisions across organizational boundaries
  • Mentor senior and principal engineers and shape the long-term technical direction of the SRE discipline
  • Represent Substrate SRE perspectives with senior leadership and cross-functional partners
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer (Sovereign Cloud)

Location
Location
Bulgaria , Sofia
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years as DevOps engineer with a passion for technology, strong motivation and responsibility
  • Proficiency in DevOps and Platform Engineering with expertise in AWS, GCP, Terraform, ArgoCD, Kubernetes, and related tools
  • Experience in developing and maintaining CI/CD pipelines for continuous delivery in agile environments
  • Skilled in managing cloud infrastructure, particularly with AWS and GCP, and adept in infrastructure as code practices using Terraform/Terragrunt
  • Demonstrated capability in supporting high-scale SaaS applications, focusing on scalability, reliability, and performance
  • Strong communication, strategic thinking, and problem-solving skills
  • Self-disciplined, self-managed, self-motivated, strong sense of ownership, urgency, and drive
  • Ready to understand and dissect new technology stacks quickly
Job Responsibility
Job Responsibility
  • Implement and optimize CI/CD pipelines and cloud infrastructure using our technology stack, ensuring efficient and reliable deployment to production
  • Participate in the deployment of monitoring and alerting systems to maintain high system performance and reliability
  • Collaborate with software development and other cross-functional teams to streamline and enhance processes, aiming for efficiency and alignment with business goals
  • Contribute to the management of the cloud infrastructure, utilizing Infrastructure as Code principles
  • Participate in on-call rotations to support critical business and production systems
  • Fulltime
Read More
Arrow Right