This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Principal Cloud Engineer-Storage to scale Azure’s Fault Self-Healing and Failure Prediction systems. Own the endtoend technical design and execution of the fault prevention ecosystem, spanning telemetry, ML models, automation, isolation logic, firmware interactions, and repair workflows, operating at hyperscale across millions of nodes.
Job Responsibility:
Design and build best-in-class fleet resiliency systems for storage devices at scale
Develop scalable live monitoring capabilities, fault detection and repair solutions
Design features for SSDs and Storage Accelerator firmware deployment
Lead collaboration projects with hardware, firmware and software teams that fault reduction projects
Build automation to drive repair efficiency for storage operations in the production fleet
Collaborate with suppliers to design reliable, high performance and quality storage devices
Analyze data to identify, prototype, and drive the implementation of technical and process improvements to increase the predictability, agility, and quality of Azure systems
Actively support Azure service stakeholders.
Requirements:
Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements is required for this role
M.S. Computer or Electrical Engineering
12+ years of SSD firmware engineering development experience
8+ years of NVMe and PCIe experience
Deep expertise in SSD virtualization, reliability, fault analysis, and live‑site operations
Lead end‑to‑end design decisions across detection, prediction, mitigation, and repair of SSDs in hyper scale environment
Design component‑agnostic reliability frameworks that work across different components
Proven ability to build automation heavy systems that operate safely at hyperscale.