CrawlJobs Logo

Senior Site Reliability Engineering Manager

United States, Redmond 119800.00 - 234700.00 USD / Year · Job Posted March 22, 2026
Apply Position
Job Link Share

Job Description

Microsoft Substrate is the foundational cloud platform that powers many of Microsoft’s most critical services including Exchange Online and M365 Copilot. As a Senior Site Reliability Engineer, you will lead reliability improvements across multiple services or components, solving complex, cross-cutting operational problems. You will influence service architecture, define reliability strategies, mentor other SREs, and drive meaningful reductions in incidents, toil, and operational risk—particularly for services operating in regulated, sovereign, or compliance-sensitive environments.

Job Responsibility

  • Lead and develop a team of Site Reliability Engineer ICs, providing clear expectations, regular coaching, and career guidance across senior and principal levels
  • Own the operational health and reliability posture of Substrate services running in regulated environments
  • Drive change and influence across the org as you establish and drive SLOs, SLIs, and operational metrics
  • Lead effective incident management and post-incident reviews
  • Serve as an actively engaged on-call engineer (OCE) and participate in an on-call rotation
  • Own reliability, resilience, and disaster recovery, including driving and coordinating DR and game day exercises
  • Drive engineering led operational excellence at scale
  • Partner with engineering and product teams to embed reliability, security, and compliance considerations early in service design
  • Influence technical and operational strategy beyond your immediate team
  • Represent your team’s work clearly to leadership and partners
  • Build strong cross-functional relationships with security, compliance, infrastructure, and operations teams

Requirements

  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Ability to obtain and maintain appropriate background investigations and customer screenings for access to GCC Moderate, GCC High, and Department of Defense environments
  • For access to GCCH and DoD environments, ability to obtain and maintain a favorably adjudicated Tier 3 (T3) background investigation
  • For access to GCCM environments, ability to meet Criminal Justice Information Services (CJIS) eligibility requirements
  • For manager-level roles, a Tier 5 (T5) background investigation is preferred
  • Pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter

Nice to have

  • Doctorate Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 8+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 12+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • 7+ years technical experience working with large-scale cloud or distributed systems
  • 3+ years people management experience
  • Experience operating or supporting services in regulated, sovereign, or compliance-sensitive environments

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Site Reliability Engineering Manager

8 matching positions

Senior Manager, Reliability Engineering & Work Management

Join Amgen’s Mission of Serving Patients. At Amgen, if you feel like you’re part...
Location
Location
United States , Holly Springs
Salary
Salary:
148715.15 - 201202.85 USD / Year
amgen.com Logo
Amgen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • High school diploma / GED and 12 years of engineering, maintenance, reliability, or CMMS experience OR Associate’s degree and 10 years of engineering, maintenance, reliability, or CMMS experience OR Bachelor’s degree and 8 years of engineering, maintenance, reliability, or CMMS experience OR Master’s degree and 6 years of engineering, maintenance, reliability, or CMMS experience OR Doctorate degree and 2 years of engineering, maintenance, reliability, or CMMS experience
  • minimum of 2 years experience directly managing people and/or leadership experience leading teams, projects, programs, or directing the allocation or resources
Job Responsibility
Job Responsibility
  • Lead and develop a multi-functional engineering operations organization responsible for work order administration, CMMS governance, maintenance planning and scheduling, reliability engineering, central inventory management, and KPI reporting
  • Lead the Central Inventory (equipment spare parts) program working closely with the Internal Service Provider as it relates to the overall MRO process supporting the site
  • Lead the site reliability program, aligning on the network approach for the Amgen North Carolina reliability roadmap, implementing initiatives and own engineering KPIs and business reporting on program and site equipment health
  • Establish long-range maintenance and asset reliability strategies aligned with site operational goals, regulatory compliance requirements, and enterprise engineering standards
  • Identify opportunities for continuous improvement, recommend solutions, and manage implementation of initiatives to improve planning and scheduling accuracy and adherence and overall asset management program
  • Collaborate with key customers and support groups regarding preventative maintenance activities and on time completion as well as manage aging work order backlog
  • Collaborate with corporate network groups to improve planning and scheduling performance in addition to improving programs within CMMS, Central Inventory, and Reliability to meet business needs
  • Provide direct supervision and oversight of work order safety programs and planning procedures
  • Experience working in a regulated environment (e.g. cGMP, OSHA, EPA, etc.), experience interacting with regulatory agencies and inspectors, and familiarity with GMP quality systems/processes such as change control, non-conformances, corrective and preventative actions, and qualifications/validation
  • Strong leadership, technical writing, and communication/presentation skills
What we offer
What we offer
  • A comprehensive employee benefits package, including a Retirement and Savings Plan with generous company contributions, group medical, dental and vision coverage, life and disability insurance, and flexible spending accounts
  • A discretionary annual bonus program
  • Stock-based long-term incentives
  • Award-winning time-off plans
  • Flexible work models where possible
  • Fulltime
Read More
Arrow Right

Loan IQ Product Development and Site Reliability Engineering Manager

The Applications Development Group Manager is a senior management level position...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of relevant experience
  • 5 years of experience in Loan IQ product
  • 10+ years of experience with Java related technologies including spring boot and angular
  • 5+ years experience of platform management
  • Experience managing global technology teams
  • Working knowledge of industry practices and standards
  • Consistently demonstrates clear and concise written and verbal communication
  • Bachelor's degree/University degree or equivalent experience
Job Responsibility
Job Responsibility
  • Manage multiple teams of professionals to accomplish established goals and conduct personnel duties for team (e.g. performance evaluations, hiring and disciplinary actions)
  • Provide strategic influence and exercise control over resources, budget management and planning while monitoring end results
  • Utilize in-depth knowledge of concepts and procedures within own area and basic knowledge of other areas to resolve issues
  • Ensure essential procedures are followed and contribute to defining standards
  • Integrate in-depth knowledge of applications development with overall technology function to achieve established goals
  • Provide evaluative judgement based on analysis of facts in complicated, unique, and dynamic situations including drawing from internal and external sources
  • Influence and negotiate with senior leaders across functions, as well as communicate with external parties as necessary
  • Appropriately assess risk when business decisions are made, demonstrating particular consideration for the firm's reputation and safeguarding Citigroup, its clients and assets, by driving compliance with applicable laws, rules and regulations, adhering to Policy, applying sound ethical judgment regarding personal behavior, conduct and business practices, and escalating, managing and reporting control issues with transparency, as well as effectively supervise the activity of others and create accountability with those who fail to maintain these standards
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineering Manager

Microsoft Substrate is the foundational cloud platform that powers many of Micro...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • Candidates must be able to meet Microsoft, customer and/or government security screening requirements required for this role
  • This role requires access to Microsoft Government cloud environments, including GCC Moderate (GCCM), GCC High (GCCH), and Department of Defense (DoD) environments
  • For access to GCCH and DoD environments, this role requires the ability to obtain and maintain a favorably adjudicated Tier 3 (T3) background investigation
  • For access to GCCM environments, this role requires the ability to meet Criminal Justice Information Services (CJIS) eligibility requirements
  • For manager-level roles, a Tier 5 (T5) background investigation is preferred
  • Candidates may be considered without currently holding these background investigations, provided they are eligible for and able to successfully obtain them
Job Responsibility
Job Responsibility
  • Lead and develop a team of Site Reliability Engineer ICs, providing clear expectations, regular coaching, and career guidance across senior and principal levels
  • Own the operational health and reliability posture of Substrate services running in regulated environments
  • Drive change and influence across the org as you establish and drive SLOs, SLIs, and operational metrics
  • Lead effective incident management and post-incident reviews
  • Serve as an actively engaged on-call engineer (OCE) and participate in an on-call rotation
  • Own reliability, resilience, and disaster recovery, including driving and coordinating DR and game day exercises
  • Drive engineering led operational excellence at scale
  • Partner with engineering and product teams to embed reliability, security, and compliance considerations early in service design
  • Influence technical and operational strategy beyond your immediate team
  • Represent your team’s work clearly to leadership and partners
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineering Manager

The Principal SRE Manager leads the team responsible for durable, high quality h...
Location
Location
Australia , Perth
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • equivalent experience
  • Proven experience leading teams through high severity production incidents in large, distributed systems
  • Demonstrated people leadership experience managing senior engineers or technical incident leaders
  • Strong understanding of incident management, reliability engineering, and live site operations at scale
  • Ability to drive clarity, accountability, and results in ambiguous, time critical situations
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
Job Responsibility
Job Responsibility
  • Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high impact events
  • Act as the senior incident leader or sponsor for long running, high stakes, or cross service incidents, ensuring alignment on impact, risk, and recovery priorities
  • Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
  • Ensure high quality post incident reviews and drive accountability for repair items that reduce recurrence and systemic risk
  • Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths
  • Lead, coach, and develop a team of Site Reliability Engineers serving as incident responders
  • Build a culture of calm execution, accountability, psychological safety, and continuous learning during and after incidents
  • Hire and grow senior talent capable of operating as trusted leaders in high pressure, executive visible situations
  • Serve as a trusted advisor to engineering leaders and executives on live site risk, readiness, and incident response maturity
  • Communicate clearly and credibly with senior leadership during customer impacting events
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineering

Microsoft’s Azure Data engineering team is leading the transformation of analyti...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science/Engineering/related fields or equivalent industry experience
  • 8+ years of experience in managing cloud scale services
  • Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding
  • Partner with development teams to improve services through rigorous testing and release procedures
  • Participate in system design consulting, platform management, and capacity planning
  • Create sustainable systems and services through automation and uplifts
  • Good communications skills, both verbal and written
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Design, implement and ship distributed database management system offerings effectively providing customer value in terms of security, performance, reliability, usability and manageability while ensuring business goals are met
  • Collaborate effectively with the team, make appropriate systems tradeoffs in design and implementation, and ensure customer success in their use of the product
  • Embody our culture and values
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Manager

RUCKUS Networks is seeking an experienced Site Reliability Engineering (SRE) Man...
Location
Location
United States , Sunnyvale
Salary
Salary:
135600.00 - 200000.00 USD / Year
commscope.com Logo
CommScope
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years in Site Reliability Engineering (SRE), with 6+ years leading SRE, DevOps, or infrastructure teams
  • Proven experience mentoring engineering managers and developing leadership talent
  • Track record of transforming traditional operations or NOC teams into modern SRE organizations
  • Strong project management skills with Agile/Kanban experience and JIRA proficiency
  • Excellent communication skills, including executive-level presentations
  • Deep SRE expertise: incident management, on-call systems, monitoring, and reliability engineering
  • Infrastructure automation experience with Terraform, Kubernetes, Docker, and CI/CD pipelines
  • Cloud platform proficiency (GCP/AWS), including networking, security, and cost optimization
  • Monitoring and observability experience with Prometheus, Grafana, APM tools, and log aggregation
  • 24/7 operations experience with global team coordination and escalation management
Job Responsibility
Job Responsibility
  • Lead and develop engineering managers and technical operations engineers across India and APAC time zones
  • Build a collaborative team culture that emphasizes knowledge sharing, automation, and operational excellence
  • Mentor engineering managers to strengthen leadership capabilities and technical expertise
  • Set clear performance expectations and provide ongoing coaching for growth
  • Partner cross-functionally with Product, Security, Development, and global operations teams
  • Own 24/7 operational stability for India/APAC, including incident response, escalation, and post-incident reviews
  • Drive comprehensive incident management: alert handling, outage response, and root cause analysis (RCA/CAR)
  • Transform traditional operations into modern SRE practices using SLOs, error budgets, and reliability engineering
  • Implement robust monitoring and alerting with APM tools, dashboards, and automation frameworks
  • Lead technical project delivery with clear timelines, resource planning, and stakeholder communication
What we offer
What we offer
  • medical, dental, and vision plans
  • life and accidental death insurance
  • a 401(k) plan
  • participation in the Company’s Incentive Plan
  • eleven paid holidays in a full calendar year
  • two weeks of paid vacation (prorated based on start date)
  • other leave options
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer Manager

RemoteStar is looking to hire a Senior Site Reliability Engineering Manager on b...
Location
Location
United Kingdom of Great Britain and Northern Ireland , London
Salary
Salary:
Not provided
Remotestar
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience in a senior or lead SRE role, with a strong track record of building and maintaining highly reliable infrastructure and services.
  • Expertise in incident management, including incident response, resolution, and post-mortem analysis.
  • Proficiency in monitoring, alerting, and observability tools such as Prometheus, Grafana, ELK stack or Datadog.
  • Experience with cloud platforms such as AWS, Azure, or GCP, including infrastructure as code tools like Terraform or CloudFormation.
  • Strong scripting and automation skills, with proficiency in languages such as Python, Bash, or Go.
  • Excellent communication and collaboration skills, with the ability to work effectively with cross-functional teams in a remote environment.
  • Demonstrated leadership capabilities, with a passion for mentoring and developing team members.
Job Responsibility
Job Responsibility
  • Take full ownership of the production estate from both a technical and process perspective.
  • Provide a consistent smooth operation of live systems and drive all on-call support issues.
  • Design and operate a new incident tracking process to ensure root causes are found and remediated in a timely fashion by the development team.
  • Create and maintain high end monitoring and automation tooling.
  • Drive automation initiatives to streamline operational workflows and improve efficiency.
  • Develop and maintain tools, scripts, and dashboards to monitor system health, performance, and reliability.
  • Build a first class SRE team.
  • Through a combination of leading by example, coaching and mentoring, mould the team would want to have around you.
  • Provide leadership and guidance to the SRE team, fostering a culture of collaboration, innovation, and continuous improvement.
What we offer
What we offer
  • Dynamic working environment in an extremely fast-growing company
  • Work in an international environment
  • Work in a pleasant environment with very little hierarchy
  • Intellectually challenging, play a massive role in client’s success and scalability
  • Flexible working hours
  • Fulltime
Read More
Arrow Right

Engineering Manager - Observability & Reliability Engineering Obsession

We are looking for an Engineering Manager to join the OREO (Observability Reliab...
Location
Location
France , Paris
Salary
Salary:
Not provided
doctolib.fr Logo
Doctolib
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 5+ years of software engineering or SRE experience, with a strong technical background in cloud-native environments (preferably AWS, GCP, and/or Kubernetes-based)
  • 3+ years of engineering management experience, leading technical teams (ideally SRE, platform, or infrastructure teams)
  • Deep understanding of observability tooling and architecture (Fluent Bit, OpenTelemetry, Loki, Elasticsearch, Prometheus, Thanos, Datadog)
  • Experience with infrastructure as code (Terraform, OpenTofu) and secrets management systems (Vault, AWS Secrets Manager)
  • Proven ability to balance technical depth with people leadership, able to mentor engineers, review technical designs, and guide architectural decisions
Job Responsibility
Job Responsibility
  • Lead, coach, and grow a team of Site Reliability Engineers, supporting their technical development and career progression
  • Create a culture of operational excellence, continuous improvement, and psychological safety within the team
  • Conduct regular 1:1s, performance reviews, and career development conversations
  • Recruit, onboard, and retain top SRE talent aligned with Doctolib's mission and values
  • Partner with SREs and senior engineers to define and evolve the observability strategy across the platform, focusing on logging, metrics, tracing, and alerting
  • Own the strategy and evolution of critical transversal services including HashiCorp Vault and Terraform Enterprise
  • Drive prioritization and roadmap planning for large-scale reliability and observability initiatives
  • Ensure alignment between team objectives and broader engineering and business goals
  • Advocate for and allocate resources toward reducing technical debt and improving developer experience
  • Own the team's on-call experience and contribute to the incident response processes, ensuring sustainable practices and continuous improvement
What we offer
What we offer
  • Free comprehensive health insurance for you and your children
  • Parent Care Program: receive one additional month of leave on top of the legal parental leave
  • Free mental health and coaching services through our partner Moka.care
  • For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
  • Work from EU countries and the UK for up to 10 days per year, thanks to our flexibility days policy
  • Work Council subsidy to refund part of sport club membership or creative class
  • Up to 14 days of RTT
  • A subsidy from the work council to refund part of the membership to a sport club or a creative class
  • Lunch voucher with Swile card
  • Fulltime
Read More
Arrow Right