This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Microsoft AI operates one of the world’s most advanced AI training infrastructures, featuring multi-gigawatt clusters spanning tens of thousands of high-performance GPUs, ultra-low-latency NVLink/NVSwitch networks, and innovative liquid-cooling systems. Our team is seeking a Member of Technical Staff, Hardware Health, to ensure these systems deliver sustained reliability, performance, and availability across exascale-class deployments. We work closely with research, hardware, datacenter, and platform engineering teams to develop predictive health models, failure detection frameworks, and autonomous remediation systems that keep our AI clusters operating at frontier scale.
Job Responsibility
Design and develop next-generation hardware health monitoring and diagnostic frameworks for large GPU clusters (NVL16/NVL72/GB200+ scale).
Build predictive analytics pipelines leveraging telemetry, power, and thermal data to anticipate hardware degradation and systemic issues.
Collaborate with silicon, firmware, and datacenter engineers to identify root causes and remediate large-scale hardware anomalies.
Define system health KPIs (e.g., NIS/RIS, MTBF, failure domain analysis) and integrate them into real-time observability platforms.
Lead incident triage for high-impact GPU, network, and cooling issues across distributed clusters.
Drive automation in health management to reduce manual intervention to the top 5% of anomalies.
Partner with cross-functional teams to influence hardware design for reliability, thermal efficiency, and serviceability.
Requirements
Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Experience working with large-scale HPC or GPU systems (NVIDIA H100/GB200 or equivalent).
Deep understanding of GPU architecture, high-speed interconnects (NVLink, InfiniBand, RoCE), and large datacenter topologies.
Proficiency in hardware telemetry, diagnostics, or failure analysis tools.
Experience with exascale-class systems or cloud-scale AI clusters.
Familiarity with reliability modeling, machine learning-based anomaly detection, or predictive maintenance.
Contributions to large-scale infrastructure operations, supercomputing centers, or AI hardware design.