This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Azure CXP team’s mission is to transform Microsoft Cloud customers into fans. Through our deep engineering engagements with customers and teams across Microsoft, we analyze and amplify customer needs and drive the vision to improve Cloud quality, security, and reliability. Our culture of growth mindset and empowerment are central to who we are and how we work. We are the Azure Reliability team; a multidisciplinary engineering organization committed to making Azure the world’s safest and most reliable cloud. For Azure’s most critical services and products, we apply a Site Reliability Engineering (SRE) approach. Our software engineers work closely with product teams to enhance availability, reliability, observability, and operability across our planet-scale systems.
Job Responsibility:
Defining system reliability goals through Service Level Objectives (SLOs)
Enhancing production posture with targeted improvements in observability and operability (telemetry, alerting, incident/change management, safe deployment practices)
Building reusable automation and processes that help multiple teams meet their reliability goals
Influencing product architecture and roadmaps to ensure customer-experienced reliability is a core design principle
Contributing directly to product code to achieve reliability outcomes
Leveraging AI to proactively detect anomalies, predict incidents, and automate operational workflows - scaling reliability efforts across complex systems
Providing technical leadership across multiple Azure teams
Mentoring others on SRE principles, practices, and tools as well as AI usage to boost software development productivity
Designing and developing large-scale distributed software services and solutions
Delivering “best-in-class” engineering by ensuring services are modular, secure, reliable, testable, diagnosable, observable, and reusable
Collaborating with internal and external partners to support team goals
Balancing pragmatism with vision—driving continuous improvements in process and codebase
Building automation to prevent or remediate service issues before they impact users
Driving innovation in large-scale operations by applying cutting-edge AI tools and techniques to reduce operational toil and scale reliability engineering across complex systems
Gaining a working understanding of Microsoft businesses and contributing to cohesive, end-to-end user experiences
Requirements:
Bachelor's Degree in Computer Science or related technical field AND technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR Master's Degree in Computer Science or related technical field AND technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience working with large-scale distributed systems (e.g., cloud computing providers, SaaS services, etc., ideally with millions or billions of users) or similarly complex environments
Awareness of, and ability to reason about, modern distributed software design patterns and cloud systems architecture, including microservices, containers, load-balancing, queuing, caching
Experience with C#/Java/C/C++/Golang
Experience in building, shipping and operating reliable solutions
Nice to have:
Familiarity with modern distributed software design patterns and cloud systems architecture, including microservices, containers, load balancing, queuing, caching
Experience as a technical lead or engineering manager
Experience working on large and unfamiliar codebases (millions of lines of code)
Experience with open-source projects, Kubernetes, Linux and containers is desired
Proven track record in building, shipping, and operating reliable solutions
Proficiency in programming languages like C#/Java/Python
Experience with data technologies (SQL/NoSQL/etc.)
Experience with Azure is a plus
Experience in AI adoption with tools like GitHub Copilot, Azure OpenAI and custom copilots to streamline development and reduce toil