This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Azure Compute team builds a fault-tolerant, distributed system on top of commodity datacenter hardware to deliver infrastructure for hosting cloud applications in virtual machines (VMs). The team creates the experience that resources are limitless, elastic, and always available. This role is part of the Availability Platform team within Azure Compute, which focuses on ensuring every Azure virtual machine achieves a service-level agreement (SLA) of 99.99 percent or higher. Meeting this target requires innovative thinking, data-driven decisions, and intelligent automation. The team owns services that monitor the health of millions of Azure machines and the control plane services that make repair decisions. We use artificial intelligence (AI) and machine learning to build predictive failure models that proactively migrate virtual machines before failures occur, reducing customer impact and improving platform resilience. We are also exploring generative AI to enhance diagnostics, automate root cause analysis, and accelerate incident resolution. Collaboration with data scientists and AI researchers enables us to continuously evolve the platform with smarter, self-healing capabilities. As a Software Engineer II, you will design and deliver services architecture at hyperscale, work on incremental development with high quality, and adapt quickly to customer feedback while integrating advanced AI technologies. Microsoft’s mission is to empower every person and every organization on the planet to achieve more.
Job Responsibility:
Partners with appropriate stakeholders to determine project requirements
Leads the design and architecture of change management features and services in Azure Compute
Identifies dependencies and authors design documents for features and services
Develops high quality, extensible, maintainable code and coaches others
Supports livesite as Designated Responsible Individual (DRI)
Proactively seeks new knowledge and adapts to new trends
Collaborates with data scientists and ML engineers to design and integrate predictive models
Leads initiatives to embed AI-driven diagnostics and root cause analysis into availability services
Drives the adoption of generative AI tools to automate documentation, incident summaries, and engineering workflows
Partners with platform teams to build intelligent observability pipelines
Evaluates and integrates large-scale AI models into control plane services
Requirements:
Bachelor's Degree in Computer Science or related technical field AND technical engineering experience with coding in languages including, but not limited to, C,Rust, C++, C#, Java, JavaScript, or Python OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Candidates must be able to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Nice to have:
Bachelor's Degree in Computer Science OR related technical field AND technical engineering experience with coding in languages including, but not limited to, C, Rust, C++, C#, Java, JavaScript, OR Python OR Master's Degree in Computer Science or related technical field AND technical engineering experience with coding in languages including, but not limited to, C,Rust, C++, C#, Java, JavaScript, or Python OR equivalent experience
Ability and passion for designing and building highly available distributed systems at scale
Ability to exercise sound judgment in ambiguous situations
Experience with agile methodologies and willingness to adopt them