This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Windows and Devices mission is to create innovative, trusted, and open products focused on people, showcasing Microsoft’s best and empowering everyone to achieve more. Microsoft Devices designs and manufactures premium hardware like Surface and Xbox, innovating throughout the supply and manufacturing process. Within the Windows and Devices group, Microsoft Devices Operations manages supply chains, product engineering, manufacturing, and services to deliver iconic products. We are looking to hire a Senior Site Reliability Engineer to join our team to develop and operate next generation, world-class services supporting putting these iconic products into consumers’ hands. You will be instrumental in moving Surface and XBOX devices fresh from the factory floor, through global transit networks, and ultimately fulfill the excitement and anticipation of our customers by landing our products on their doorstep.
Job Responsibility:
Independently designs, creates, tests, and deploys changes through a safe deployment process (SDP) to enhance code quality and improve the observability, security, reliability and operability of platforms, systems, and products at scale
Leverages technical expertise in the infrastructure of cloud technologies and specific products to advocate for, or directly contribute to the automation to improve the availability, security, quality, observability, reliability, efficiency, observability, and performance of related sets of products
Leverages end-to-end technical expertise and telemetry analysis alongside advanced artificial intelligence (AI) and machine learning (ML) algorithms to identify patterns and opportunities to implement configuration and data changes
Shares insights and best practices via documented artifacts that can be applied to improve development and operations across related sets of systems, platforms, and/or products
Writes code, scripts, systems, and/or artificial intelligence (AI)/machine learning (ML) platforms to automate operations tasks at scale
Develops, maintains, and implements capacity planning models and monitoring tools to forecast product capacity, related security risk, and resource demands
Handles incidents during on-call shifts assessing impact, troubleshooting complex problems, taking appropriate action to mitigate impact, and heading investigations to address root cause(s)
Leverages existing tools and automation, including the safe deployment process (SDP), to enable product engineering teams within their organization to increase the velocity in which they can reliably and safely implement changes in production
Draws insights from performance and resource monitoring across products and services within their organization to identify whether there is a need to optimize algorithms, security, infrastructure, or architecture
Analyzes data from telemetry pipelines and monitoring tools that detail operations metrics of systems, platforms, or products operating at scale
Develops end-to-end technical expertise in the architecture, code, features, and operations of specific products as required to implement improvements in product availability, security, quality, observability, reliability, efficiency, observability, and/or performance
Demonstrates end-to-end expertise in distributed systems design, interactions between cloud technology layers and components, functions of physical network devices, and dependencies at scale
Researches and maintains deep knowledge of industry trends as well as advances in cloud technologies
Requirements:
Master's Degree in Computer Science, Information Technology, or related field AND 6+ years technical experience in software engineering, network engineering, or systems administration
OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 8+ years technical experience in software engineering, network engineering, or systems administration
OR equivalent experience
3+ years technical experience working with large-scale cloud or distributed systems
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check
Nice to have:
Experience running and operating online live site services, including DRI rotation and incident management
Experience using AI tools to rapidly analyze large volumes of service telemetry
Doctorate Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
OR Master's Degree in Computer Science, Information Technology, or related field AND 6+ years technical experience in software engineering, network engineering, or systems administration
OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 8+ years technical experience in software engineering, network engineering, or systems administration