This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Microsoft has an exciting opportunity for a Senior Site Reliability Engineer (SRE) to join the Azure Silver and Sovereign Team as part of the Azure Data Transfer (ADT) team. Azure Data Transfer enables secure access and data transfer between enclaves and supports multiple transfer and access patterns for highly regulated industries. In this role, you will apply SRE principles—availability, latency, performance, efficiency, change management, and incident response—to help ensure ADT is dependable at scale. We are looking for engineers to join a fast-paced team and solve complex reliability challenges in mission-critical distributed systems spanning data transmission across clouds. Our team works across all facets of isolated system engineering and is deeply involved in defining and improving service health through SLIs/SLOs and error budgets, building automation to reduce toil, strengthening observability (logs, metrics, traces), reducing systemic latency, validating and transforming data, and optimizing throughput and capacity. You will build, deploy, and operate systems that enable a broad set of Azure services to be consumed by customers in highly secured and regulated environments, meeting strict security policy and assurance requirements for public and private sector customers.
Job Responsibility:
Owns reliability architecture and end-to-end service understanding (dependencies, failure modes, and customer journeys) for distributed systems at scale
Defines and improves service health via SLIs/SLOs, error budgets, and well-defined operational readiness criteria
Drives cross-team reliability reviews and recommends design changes, runbooks, and safe rollout/rollback strategies that improve availability, latency, performance, and efficiency while managing cost
Maintains deep, current expertise in cloud reliability practices and the evolving technology landscape
Drives adoption of new platform capabilities and operational patterns (e.g., progressive delivery, resilience testing, chaos engineering where appropriate)
Mentors engineers through design reviews, incident walkthroughs, and knowledge sharing to raise the reliability bar across related services
Implements reliable, scalable, and high-performance changes using SRE practices (progressive delivery, feature flags where applicable, safe rollouts/rollbacks)
Owns implementation and rollback plans, validates operational readiness, and reduces toil through automation, self-healing, and standardized playbooks
Leverages telemetry and production signals to identify reliability risks and recurring failure patterns, then ships configuration changes, code fixes, or automation to address root causes
Expands infrastructure-as-code and operational tooling so teams can manage platforms and services safely and repeatably through code and policy
Builds and improves observability (metrics, logs, traces, dashboards, alerts) and uses it to detect, diagnose, and prevent incidents
Defines actionable alerting, reduces noise, and ensures instrumentation supports SLO reporting and rapid troubleshooting
Develops automation to validate telemetry pipelines and to enable automated mitigation and safer incident response
Participates in on-call rotations and leads response for complex, high-impact incidents by establishing incident command, assessing impact, coordinating responders, and driving mitigations to restore service within SLOs
Produces and contributes to blameless postmortems with corrective and preventative actions (CPAs), tracks them to completion, and implements automation and guardrails to prevent recurrence
Applies secure-by-design and compliance requirements to operations, monitoring, and automation (least privilege, auditability, change control, and data handling)
Partners with security, privacy, and compliance teams to identify gaps, prioritize fixes, and implement automated controls and detection to prevent repeated violations
Requirements:
Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
Security Clearance Requirements: Candidates must be able to meet Microsoft, customer and/or government security screening requirements are required for this role
The successful candidate must have an active U.S. Government Top Secret Clearance with access to Sensitive Compartmented Information (SCI) based on a Single Scope Background Investigation (SSBI) with Polygraph
Ability to meet Microsoft, customer and/or government security screening requirements are required pre-offer and post-hire for this role
This position requires successful verification of the stated security clearance to meet federal government customer requirements
This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
This position requires verification of U.S. citizenship due to citizenship-based legal restrictions
Nice to have:
Bachelor's Degree in Computer Science, Information Technology, or related field AND 8+ years technical experience in software engineering, network engineering, service engineering, or systems engineering OR equivalent experience
3+ years technical experience working with large-scale cloud or distributed systems
Experience building automation with Ansible and developing/operating CI/CD pipelines (e.g., Azure DevOps, GitHub Actions) to deliver reliable, repeatable deployments
Expertise in problem solving and analyzing distributed systems and critical production service environments
Expertise in Linux, specifically Rocky 9, Redhat, Mariner or similar in throughput management, troubleshooting and security hardening