This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Site Reliability Engineer (SRE) for Azure xDPU Storage Team – Hardware Enablement is responsible for ensuring the reliability, availability, and performance of Fungible DPU based Azure Storage devices as they integrate next-generation networking and compute offload hardware. This role focuses on safe bring-up, validation, and scaled production operation of DPU-enabled platforms, bridging hardware, firmware, and software reliability and maintenance.
Job Responsibility:
Own end-to-end reliability for Azure Storage hardware running in on-prem lab environments
Partner with silicon, firmware, BIOS, networking, and OS teams to enable and validate DPU hardware for specific storage use cases
Define, measure, and improve Service Level Objectives (SLOs), Service Level Indicators (SLIs) for DPU-accelerated storage scenarios within our lab and pre-prod environments
Lead live-site incident response and mitigation for hardware-, firmware-, or DPU-related issues, including deep root-cause analysis across hardware/software boundaries within our lab and pre-prod environments
Build automation for provisioning, configuration, validation, canarying, rollback, patching, and recovery of DPU-enabled Azure Storage systems within our lab and pre-prod environments
Develop reliability validation strategies, including stress, fault-injection, and chaos testing for DPU hardware enablement and management
Create and maintain operational runbooks, diagnostics, telemetry, and health models specific to Fungible DPU platforms within our lab and pre-prod environments
Drive improvements in observability and alerting by extending Azure Monitor and internal systems with DPU- and hardware-level signals
Requirements:
Associate's Degree in Computer Science, Information Technology, or related field OR Bachelor's Degree in Computer Science, Information Technology, or related field OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check
Bachelor's Degree in Computer Science, Electrical Engineering, Computer Engineering, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
Experience operating large-scale, distributed systems in a lab/validation
Experience working close to hardware, including networking, storage, or accelerator technologies such as SmartNICs, DPUs, or offload engines
Proficiency in one or more programming or scripting languages (C++, C#, Python, Go, or PowerShell)
with experience reading lower-level system code
Hands-on experience with Microsoft and Azure lab infrastructure and live-site operations
Demonstrated understanding of networking, operating systems, and performance characteristics of I/O-intensive distributed systems
Direct experience with Fungible DPU technology or similar SmartNIC/DPU platforms
Existing hands-on experience working in Microsoft MLS (Microsoft Lab Services) or equivalent internal lab environments, including lab-based hardware validation, performance testing, and bring-up workflows
Experience enabling new hardware platforms or accelerators in a Windows/mixed OS environment
Familiarity with firmware lifecycles, hardware validation, and silicon bring-up processes
Experience with infrastructure-as-code and CI/CD pipelines (ARM/Bicep, Terraform, Azure DevOps)