This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Substrate Fleet Health team is engineering the future of cloud reliability and efficiency of managing health and capacity of the Substrate Fleet. We are a high-impact team driving innovation in hardware health, fleet lifecycle management, intelligent repair systems, and proactive capacity optimization to ensure Microsoft’s hyperscale infrastructure operates at peak performance. Our mission is bold: Maximize fleet availability through proactive detection and mitigation of hardware issues. Accelerate repair intelligence with AI-driven insights and automation, reducing repair times from hours to seconds. Optimize spare machine utilization and capacity forecasting across global datacenters, unlocking millions in cost savings and enabling sustainable growth. Enhance fleet lifecycle management by predicting failures, improving component health, and reducing stranded capacity. We are building next-generation solutions like RepairBox vNext, Fleet Health Copilot, Unified Spare Pool, and Smart Recovery Services—systems that integrate telemetry, predictive analytics, and automation to transform how cloud infrastructure is managed and scaled.
Job Responsibility:
Lead architecture and design for intelligent repair and fleet optimization systems, including Repairbox Vnext, and Fleet Copilot.
Drive development of AI-powered telemetry pipelines and automation frameworks for predictive diagnostics and lifecycle management.
Establish capacity forecasting and spare pool optimization strategies across global datacenters.
Ensure security, scalability, and operational excellence across all solutions, including live-site readiness and DRI pathways.
Collaborate with Azure, vendor, and platform teams to align technical solutions with business goals and reliability standards.
Requirements:
Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Nice to have:
Master's Degree in Computer Science or related technical field AND 3+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 5+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Expertise in distributed systems, cloud infrastructure, and large-scale automation.
Solid background in AI/ML-driven telemetry, anomaly detection, and predictive analytics.
Experience with capacity planning, hardware lifecycle management, and hyperscale reliability preferred.