CrawlJobs Logo

Site Reliability Engineer

United States, Redmond 100600.00 - 199000.00 USD / Year · Job Posted March 25, 2026
Apply Position
Job Link Share

Job Description

The Silver Edge team brings the power of Azure to the edge for our customers, tackling some of the most complex and mission-critical challenges in cloud and edge computing. Our mission is to provide stellar customer service so that their mission can succeed. We support the new Azure Local product that brings cloud computing to local hardware. We’re looking for a new member of our team that relishes solving complex, ambiguous problems at scale and is passionate about building resilient systems that matter. As a Site Reliability Engineer on the Silver Edge team, you will work on building out and ensuring the dependability of Azure Local services in 3 different sovereign clouds. You will be required to solve tough technical problems, and thrive in dynamic, sometimes chaotic environments. In this role, you will accelerate your career, deepen your expertise in sovereign cloud solutions and help implement the future of Azure edge solutions. We offer flexible work arrangements, including partial remote options, to support your best work.

Job Responsibility

  • Support customer deployments and use of Azure Local and Azure Local disconnected operations
  • Maintain Azure Service reliability including deployment, availability, security, performance and customer satisfaction for sovereign environments
  • Leverages technical expertise in cloud technologies and specific products, as well as objective insights drawn from analyses of production telemetry data to suggest changes or add-ons to product features or the automation to improve the availability, security, quality, observability, reliability, efficiency, observability, and performance of product components or features supported by their team
  • Engages with product engineering teams by participating code/design reviews, regular meetings, on-call rotations and incident responses throughout product development and operations cycles
  • Utilizes technical knowledge of systems/platforms and insights drawn from product engineering teams, security best practices, artificial intelligence (AI)/machine learning (ML), and telemetry analyses to suggest potential improvements in code base and designs across components and features of one or more products
  • Leverages technical expertise and telemetry analysis alongside advanced artificial intelligence (AI) and machine learning (ML) algorithms across a range of components and/or features to identify patterns and opportunities to implement configuration and data changes for one or more platforms, systems, or products in production using code, tooling, and automation
  • Independently writes code or scripts that automate the performance of scalable operations processes (e.g., monitoring, alerting, deploying products and updates) across components and features of products operating at scale
  • Shares insights and best practices via documented artifacts that can be applied to improve development and operations of system, platform, or product components and features by participating in code/design reviews, incident drills and debriefs, and regular meetings, as well as interactions with more experienced SREs and members of product engineering teams
  • Develops alerts and instrumentation across components and features to monitor product capacity, related security risk, and resource demands and analyze telemetry data using existing capacity planning models
  • Draws insights from analyses of capacity and resource data to optimize component and feature code to manage resources and capacity across limited range of use conditions and system parameters
  • Independently uses existing tools and/or models to troubleshoot problems or flaws affecting the availability, security, reliability, performance, and/or efficiency of components and features, leveraging the artificial intelligence (AI) and machine learning (ML) capabilities
  • Proposes solutions that will resolve and prevent recurring issues and brings them to the attention of their Site Reliability Engineering (SRE) and/or product engineering teams
  • Utilizes insights from performance and resource monitoring tools to identify whether there is a need to optimize the efficiency of component and feature code, or if changes to compute resources are required
  • Models the predicted effect of changes to code and/or compute resources across components or features to document the efficacy of proposed solutions
  • Proposes changes and drives implementation of solutions to identified performance and resource challenges
  • Identifies opportunities to leverage existing tools and automation, including the safe deployment process (SDP), to enable product engineering teams to increase the velocity in which they can reliably and safely implement changes in production
  • Monitors the effects of changes across multiple components or features within a single platform or system
  • Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting issues, taking appropriate action to mitigate impact, and deploying appropriate fixes to resolve root cause(s)
  • Notifies product teams and owners to major customer impacting issues and escalates resolution of highly impactful issues affecting multiple components or features to other engineers or engineering teams as needed
  • Communicates details and resolutions through post-mortem reports and review meetings
  • Designs, develops, and maintains telemetry pipelines and monitoring tools that detail operations metrics (e.g., availability, reliability, performance, efficiency) of product components and features operating at scale
  • Independently performs analyses using existing tools and/or models to identify insights and shares them with product engineering teams to directly contribute to improvements in product development and/or operations
  • Monitors the impact of changes on operations metrics (e.g., Time-to-X)
  • Demonstrates expertise in distributed systems design, interactions between cloud technology layers and components, common dependencies at scale, and the code that defines infrastructures
  • Can identify and recommend configurations optimal of cloud technology solutions and modify the code base that defines systems or cloud technologies to improve the security, quality, reliability, and operability of supported products with minimal guidance from other engineers
  • Researches and maintains an awareness in industry trends, advances in cloud technologies, new tools, and/or processes for maintaining and improving product availability, security, quality, observability, reliability, efficiency, observability, and/or performance
  • Contributes to the implementation of new solutions within their team by identifying ways they can be applied to solve persistent problems
  • Develops technical expertise in the code, features, and operations of specific products as required to identify opportunities to improve product availability, security, quality, observability, reliability, efficiency, observability, and/or performance
  • Actively participates in on-boarding, code/design reviews, and regular meetings with engineering teams that develop and/or manage those products

Requirements

  • Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Security Clearance Requirements: Candidates must be able to meet Microsoft, customer and/or government security screening requirements are required for this role
  • The successful candidate must have an active U.S. Government Top Secret Clearance with access to Sensitive Compartmented Information (SCI) based on a Single Scope Background Investigation (SSBI) with Polygraph
  • Ability to meet Microsoft, customer and/or government security screening requirements are required pre-offer and post-hire for this role
  • This position requires successful verification of the stated security clearance to meet federal government customer requirements
  • This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
  • This position requires verification of U.S citizenship due to citizenship-based legal restrictions

Nice to have

  • Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • 2+ years technical experience working with large-scale cloud or distributed systems

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineer

8 matching positions

Site Reliability Engineer

We are currently seeking a Site Reliability Engineer to join our team in Guadala...
Location
Location
Mexico , Guadalajara
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Perform L1.5 activities such as monitoring, deployment, rollback
  • Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
  • Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
  • Understand the Microsoft Azure Cloud - ideally Azure Fundamentals certified OR Computer Science/Information Systems Management degree
  • Familiar with PaaS and IaaS - VMs, Storage, EventHub, Service Fabric Cluster (SFC), Azure Kubernetes Service (AKS), CosmosDB, SQL Server, IoT Hub, Databricks, KeyVault, Datalake
  • Understand the concept of Internet of Things (IoT) - telemetry, ingestion, processing, data storage, reporting
  • Understand the concept tools - Octopus, Bamboo, Terraform, Azure DevOps, Jenkins, Github, Ansible
  • Understand the concept of container orchestration platforms (e.g. Kubernetes)
  • Understand the concept of scripts: Powershell, Python
  • Understand the difference between NoSQL and SQL databases, and how to maintain them
Job Responsibility
Job Responsibility
  • Perform L1.5 activities such as monitoring, deployment, rollback
  • Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
  • Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

Location
Location
South Africa , Johannesburg
Salary
Salary:
Not provided
nintex.com Logo
Nintex
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You provide guidance on infrastructure architecture and contribute to high-quality and successful product releases.
  • You contribute to your team and domain through successfully leading and consistently delivering on projects of ambiguous scope, high complexity, and critical business impact.
  • You contribute to relevant guilds, practice forums and other initiatives to improve Nintex’s DevOps and SRE discipline.
  • You have an in-depth understanding of distributed systems architecture, as well as monitoring and observability practices and tools.
  • You quickly resolve priority infrastructure issues and help other technical team members or Product Managers understand how to avoid them in the future.
  • You provide detailed estimates for work items you propose or assigned.
  • You assist in decision-making around tooling, automation practices, and testing solutions.
  • You stay up-to-date with technology trends and use this knowledge help your team and the broader Engineering practice.
  • You run Nintex infrastructure with IaC tools (as Terraform) and GitHub Actions for automation, containerize our environments (Kubernetes) and leverage cloud technologies to meet our goals
  • You build monitoring that alerts on symptoms rather than outages using tools like Prometheus, Grafana, Alertmanager and PagerDuty
Job Responsibility
Job Responsibility
  • You are highly skilled and sufficiently experienced in Nintex DevOps tools and processes to own a long-term program or technology such as Kubernetes, etc.
  • You write scripts, tools and utilities that support and integrate with delivery pipelines and you integrate telemetry where appropriate.
  • You are called into incidents and bring trusted knowledge in your platform domain.
  • You debug and fix infrastructure issues on production environments quickly using the relevant tools and guidelines to prevent recurrence.
  • You build, promote and support infrastructure patterns and practices within Nintex.
  • You provide coaching/mentoring to other Engineers on the team
  • You lead or contribute to post-mortems for incidents, including root cause analysis and identification of preventative and remedial actions.
  • You continuously monitor our platform performance and take immediate action to improve it
  • You review and advise on appropriate design patterns to solve automation and infrastructure problems without creating technical debt.
  • You design and build complex infrastructure components for distributed systems as Kubernetes.
What we offer
What we offer
  • Global Gratitude and Recharge Days
  • Flexible, paid time off policy
  • Employee wellness programs and counseling resources
  • Meaningful peer recognition and awards
  • Paid parental leave
  • Invention/patenting assistance
  • Community impact, paid volunteer time, and opportunities
  • Intercultural learning and celebration
  • Multiple tools through which to learn and grow, and an incredible global community
Read More
Arrow Right

Site Reliability Engineer

An Elite FinTech Firm is looking for a highly talented DevOps Engineer/Systems S...
Location
Location
Hong Kong , Hong Kong
Salary
Salary:
1200000.00 HKD / Year
hunterbond.com Logo
Hunter Bond
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Genuine passion in Linux & Open-source
  • Excellent knowledge of Python
  • Use of CI/CD, Docker, Ansible, Chef, Puppet
  • Knowledge of large-scale storage systems (on-prem)
Job Responsibility
Job Responsibility
  • Help architect a resilient, multi-petabyte storage solutions & build new data centres
  • Automate anything and everything with Python & config tools
  • Innovate whilst bringing in new ideas
What we offer
What we offer
  • Flexible hours/work options
  • Working in one of the world’s most elite teams
  • Invest heavily in cutting-edge and next-gen tech
  • Technologists only report to other technologists
  • Brand new skyline Manhattan office
  • Start-up style environment
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

As a Staff Software Engineer, you will play a key role in designing, building, a...
Location
Location
United States , San Jose
Salary
Salary:
120500.00 - 243000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum of 5 years of hands-on experience in Infra Ops, Dev Ops, or Site Reliability Engineering (SRE)
  • Proficiency with Linux systems, especially Debian-based distributions
  • Strong experience with cloud platforms such as AWS and GCP
  • Expertise in Infrastructure as Code tools like Terraform, Packer, and Ansible
  • Solid programming skills in Python and/or Golang
  • Deep understanding of containerization (Docker, Container) and orchestration tools (AWS EKS, GCP GKE)
  • Experience with GitOps workflows
  • Proven track record in implementing and maintaining CI/CD pipelines
  • Strong background in security and familiarity with security programs
  • Experience with monitoring and logging tools (Prometheus, Grafana, ELK)
Job Responsibility
Job Responsibility
  • Enhance Infrastructure as Code (IAC) and enforce best practices
  • Optimize cloud infrastructure for scalability, security, and cost-effectiveness
  • Develop internal tools to support and streamline cloud platform operations
  • Improve CI/CD pipelines and deployment workflows using FluxCD and Jenkins
  • Address container image vulnerabilities and standardize remediation processes
  • Build Amazon Machine Images (AMIs) aligned with CIS and STIG benchmarks
  • Strengthen monitoring, alerting, and observability using Prometheus, Grafana, and logging tools
  • Troubleshoot complex production issues to ensure system reliability and customer satisfaction
  • Fine-tune distributed systems such as Apache Kafka and Cassandra
  • Collaborate with development, security, and operations teams to align infrastructure with application needs
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

An Elite FinTech Firm is looking for a highly talented DevOps Engineer/Systems S...
Location
Location
United Kingdom , London
Salary
Salary:
150000.00 GBP / Year
hunterbond.com Logo
Hunter Bond
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Genuine passion in Linux & Open-source
  • Excellent knowledge of Python
  • Use of CI/CD, Docker, Ansible, Chef, Puppet
  • Knowledge of large-scale storage systems (on-prem)
Job Responsibility
Job Responsibility
  • Help architect a resilient, multi-petabyte storage solutions & build new data centres
  • Automate anything and everything with Python & config tools
  • Innovate whilst bringing in new ideas
What we offer
What we offer
  • Flexible hours/work options
  • Working in one of the world’s most elite teams
  • Invest heavily in cutting-edge and next-gen tech
  • Technologists only report to other technologists
  • Brand new skyline Manhattan office
  • Start-up style environment
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

An Elite FinTech Firm is looking for a highly talented DevOps Engineer/Systems S...
Location
Location
Canada , Montreal
Salary
Salary:
200000.00 CAD / Year
hunterbond.com Logo
Hunter Bond
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Genuine passion in Linux & Open-source
  • Excellent knowledge of Python
  • Use of CI/CD, Docker, Ansible, Chef, Puppet
  • Knowledge of large-scale storage systems (on-prem)
Job Responsibility
Job Responsibility
  • Help architect a resilient, multi-petabyte storage solutions & build new data centres
  • Automate anything and everything with Python & config tools
  • Innovate whilst bringing in new ideas
What we offer
What we offer
  • Flexible hours/work options
  • Working in one of the world’s most elite teams
  • Invest heavily in cutting-edge and next-gen tech
  • Technologists only report to other technologists
  • Brand new skyline Manhattan office
  • Start-up style environment
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

Microsoft Substrate is the foundational cloud platform that powers many of Micro...
Location
Location
United States , Redmond
Salary
Salary:
102100.00 - 202200.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • 4+ years technical experience in software engineering, network engineering, or systems administration
  • ability to meet Microsoft, customer and/or government security screening requirements
  • ability to obtain and maintain favorably adjudicated Tier 3 (T3) background investigation
  • ability to meet Criminal Justice Information Services (CJIS) eligibility requirements
  • must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Own reliability and operational health for one or more Substrate components or services in highly regulated environments
  • Serve as an actively engaged on-call engineer (OCE), participating in an on-call rotation and independently responding to incidents for owned services
  • Respond to, diagnose, and resolve production incidents with minimal supervision
  • Design and implement automation to reduce operational toil and improve service stability
  • Develop and maintain monitoring, alerting, and telemetry to support SLOs and operational metrics
  • Lead post-incident reviews for owned incidents, focusing on root cause analysis and durable fixes
  • Collaborate with software engineering teams to embed reliability and operability into service design
  • Write and maintain production-quality code and automation that improves reliability, scalability, and operational efficiency
What we offer
What we offer
  • Benefits and other compensation may be eligible
  • additional benefits and pay information available at https://careers.microsoft.com/us/en/us-corporate-pay
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

Microsoft is a company where passionate innovators come to collaborate, envision...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 3+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Own the end-to-end readiness of Event Stream across Azure regions, including onboarding new regions, driving deployment automation, and ensuring consistent, secure, and compliant service rollout
  • Work closely with platform, infrastructure, and partner teams (e.g., Event Hubs, Kusto, Fabric platform) to deliver resilient, low-latency streaming experiences on a global scale
  • Play a key role in advancing our reliability posture, improving availability, monitoring, and incident response across regions
  • Build strong observability, telemetry, and automated recovery mechanisms to meet high availability and SLA targets
  • Region Build-out & Deployment: Onboard new regions, drive deployment automation, and ensure consistent service configuration
  • Reliability & SRE: Improve availability, resiliency, and incident response
  • own service health across regions
  • Observability & Operations: Enhance telemetry, monitoring, alerting, and troubleshooting capabilities
  • Cross-team Collaboration: Partner with platform and infra teams to unblock dependencies and ensure smooth rollout
  • Production Excellence: Drive root-cause analysis, repair items, and continuous improvement on service reliability
  • Fulltime
Read More
Arrow Right