This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Are you a customer-obsessed, AI-curious problem-solver who thrives in an inclusive, collaborative global team? Join Engineering Operations (EngOps) – the organization driving operational excellence across the Microsoft Cloud to strengthen quality, reliability, security, and customer trust. As part of EngOps, you’ll design solutions that prevent issues before they happen, embed AI-powered automation, and turn signals into actions that deliver measurable customer impact. Our culture of empowerment, inclusion, and growth mindset defines how we work. Azure Reliability is driving transformation to AI-powered operations by building scalable ML infrastructure that enables autonomous, reliable, and secure cloud systems. We are looking for candidates that can combine deep technical expertise in MLOps with a proven ability to deliver measurable business impact through continuous learning, policy-driven governance, and responsible AI practices. Success in this role means advancing operational autonomy, quality, and security, while fostering collaboration and accountability across teams. Every day, customers stake their business and reputation on our cloud. You can help #EngOps keep them secure, resilient, and ready. This role will require a minimum of three days in office. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Job Responsibility:
Partner across multiple product groups to apply subject-matter expertise in distributed systems design practices, interactions between cloud technology layers and components, basic dependencies at scale, and the code that defines infrastructures
Lead by example and mentors' others to produce extensible and maintainable code used across products
Develop and evangelize insights, best practices, and standards that can be applied to improve system, platform, and/or product development and operations across the business
Drive continuous improvements in the architecture, code, features, operations and comprehensive use scenarios of products by leveraging end-to-end technical expertise
Make improvements to the product fundamentals and architecture, share knowledge and code, always looking for ways to make what we build useful to multiple teams and products
Demonstrates end-to-end expertise in distributed systems design, interactions between cloud technology layers
Provide technical leadership in test maturity reviews, static analysis reviews, meetings, on-call rotations, and incident responses throughout product development and operations cycles
Provides deep business and technical expertise as required to resolve major incidents
Requirements:
Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Awareness of, and ability to reason about, modern distributed software design patterns and cloud systems architecture, including microservices, containers, load-balancing, queuing, caching
Experience with C#/Java/C/C++/Golang
Experience in building, shipping and operating reliable solutions
Distributed Cloud Systems - Demonstrated experience designing and operating large-scale distributed platforms where reliability, safety, and governance are first-class concerns. Deep understanding of cloud-native architectures, CI/CD pipelines, infrastructure as code (IaC), identity, security, and policy-as-code
Platform Engineering - Background in platform engineering with a focus on internal developer platforms, shared services, and ecosystems used by many teams
Preventative Engineering - Experience with incident root cause with proven ability to build preventative, shift-left engineering systems such as policy engines, PR / build validators, scanners, or automated release gates that eliminate entire classes of failures
AI and ML Skills - Hands-on experience building production-grade AI/ML or LLM-powered systems, including event-driven architectures, agent-based workflows, or intelligent automation embedded into developer workflows (IDE, PRs, CI/CD)
Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience