This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Join the team that keeps Microsoft 365 running in sovereign cloud environments where reliability, scalability, and security are non-negotiable. You'll work on distributed systems at massive scale, automating operations, building disaster recovery capabilities, and engineering solutions that eliminate toil and improve service delivery. Bring your expertise in large-scale systems and help us set the standard for sovereign cloud reliability.
Job Responsibility:
Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting issues, taking appropriate action to mitigate impact, and deploying appropriate fixes to resolve root cause(s)
Independently writes code or scripts that automate the performance of scalable operations processes across components and features of products operating at scale
Designs, develops, and maintains telemetry pipelines and monitoring tools that detail operations metrics of product components and features operating at scale
Independently uses existing tools and/or models to troubleshoot problems or flaws affecting the availability, security, reliability, performance, and/or efficiency of components and features
Independently creates, tests, and deploys changes through a safe deployment process (SDP) to enhance code quality and improve the observability, security, reliability and operability of one or more platforms, systems, or products operating at scale
Shares insights and best practices via documented artifacts that can be applied to improve development and operations of system, platform, or product components and features
Engages with product engineering teams by participating code/design reviews, regular meetings, on-call rotations and incident responses throughout product development and operations cycles
Requirements:
Master's Degree in Computer Science or related technical field AND 3+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR Bachelor's Degree in Computer Science or related technical field AND 5+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
2+ years technical experience working with large-scale cloud or distributed systems
Candidates must have an active TS/SCI and be willing and eligible to upgrade to TS/SCI (with polygraph)
This role will require candidates to maintain the TS/SCI (with polygraph) clearance
Ability to meet Microsoft, customer and/or government security screening requirements are required pre-offer and post-hire for this role
Nice to have:
Passionate about distributed systems and working with highly scalable services
Enjoys new technological challenges and is motivated to solve them
Excited about making better software and continuously improving the development, integration, and deployment processes
Self-starter who thrives in a bottoms-up, fast-paced, highly technical environment
Effective collaborator, experienced in creating technical partnerships across teams
Committed to ensuring exceptional customer satisfaction through technical excellence