This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Microsoft Silicon, Cloud Hardware, and Infrastructure Engineering (SCHIE) is the team behind Microsoft’s expanding Cloud Infrastructure and responsible for powering Microsoft’s “Intelligent Cloud” mission. SCHIE delivers the core infrastructure and foundational technologies for Microsoft's over 200 online businesses including Bing, MSN, Office 365, Xbox Live, Teams, OneDrive, and the Microsoft Azure platform globally with our server and data center infrastructure, security and compliance, operations, globalization, and manageability solutions. Our focus is on smart growth, high efficiency, and delivering a trusted experience to customers and partners worldwide and we are looking for a passionate Senior HW Quality Engineer to help achieve that mission. As Microsoft's cloud business continues to grow the ability to deploy new offerings and hardware infrastructure on time, in high volume with high quality and lowest cost is of paramount importance. To achieve this goal, the Hardware, Infrastructure Management, and Fundamentals Engineering (HIFE) team is instrumental in defining and delivering operational measures of success for hardware manufacturing, improving the planning process, quality, delivery, scale and sustainability related to Microsoft cloud hardware.
Job Responsibility:
Lead deep‑dive investigations into complex hardware failures across AI/GPU platforms
Perform single‑node and rack‑level validation to confirm hardware remediation effectiveness before fleet re‑entry
Analyze large volumes of hardware telemetry, logs, and diagnostics data to identify systemic failure patterns
Define and drive lightweight diagnostics and telemetry workflows
Partner with diagnostics and platform teams to enable out‑of‑band telemetry collection
Own fleet‑level quality assessments for AI and GPU deployments
Drive improvements to Failure detection latency, root cause attribution accuracy and preventative quality controls
Sync with Firmware and electrical engineering teams on corrective actions
Collaborate with supply chain and spares teams
Ability to work with Data center operations leadership to ensure solutions scale globally and align with operational SLAs
Produce and maintain technical documentation, failure mode analyses (FMEA), and quality playbooks
Requirements:
Doctorate Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 2+ years technical engineering experience
Master's Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 4+ years technical engineering experience
Bachelor's Degree in Mechanical Engineering, Materials Engineering, Reliability Engineering, Electrical Engineering, or related field AND 5+ years technical engineering experience
12+ years relevant technical engineering experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Nice to have:
Master’s degree in Electrical Engineering, Computer HW, or System Engineering
Leadership skills and ability to collaborate with diverse teams and drive a call to action
10+ years of experience in working with the modern server architectures and/or their subsystems – including GPU, CPU, AI hardware, Memory and methods for root cause analysis and debugging