This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC) organization powers some of the world’s largest cloud native supercomputers used for frontier AI training, scientific computing, and large scale distributed simulations. Our team builds and operates hyperscale GPU clusters that consistently place Azure among global leaders in the Top500, MLPerf, and Graph500 benchmarks. By joining us, you step into the engineering core responsible for ensuring these systems remain reliable, performant, and ready for the next wave of AI innovation.
Job Responsibility:
Serve as the technical authority and DRI for InfiniBand and GPU interconnect fabric operations across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
Lead and orchestrate complex, high severity fabric incidents end to end, including detection, triage, mitigation, recovery, and root cause analysis, making high impact decisions under ambiguity
Perform deep, multi layer systems debugging across InfiniBand, Subnet Manager, GPU interconnect, PCIe, GPUs, firmware, drivers, and OS layers to identify true root causes at fleet scale
Drive operational excellence and systemic prevention by identifying recurring failure patterns, defining reliability models and failure domains, and authoring authoritative TSGs, playbooks, and escalation frameworks adopted across teams
Architect and drive automation, telemetry, diagnostics, and tooling that materially improve detection, observability, debuggability, and mean time to mitigation, raising the operational bar for interconnect fabrics across the platform
Requirements:
Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check
6+ years of experience operating large‑scale distributed systems, high‑performance computing (HPC), or artificial intelligence (AI) infrastructure in production environments
Demonstrated ownership of mission‑critical production infrastructure with direct impact on service availability, GPU workloads, and customer SLAs
Strong Linux systems knowledge with experience debugging low‑level infrastructure issues across operating systems, drivers, and services
Proven ability to reason across hardware, firmware, drivers, and software stacks to diagnose and resolve complex production issues
Nice to have:
Bachelor's Degree in Computer Science OR related technical field AND 10+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, OR Python
OR Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python