This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Microsoft Cloud Infrastructure and Operations (CO+I) is the engine that powers Microsoft's cloud services. The group is responsible for designing, building, and operating Microsoft’s global datacenters; managing the programmatic delivery of our critical infrastructure design, equipment procurement, construction delivery, infrastructure innovation, demand planning and capacity utilization of our unified infrastructure; and responsible for all operations needed to run the physical infrastructure. We focus on smart growth with an emphasis on automation, data-driven engineering, cost‐effectiveness, and environmental sustainability. We deliver the core infrastructure and foundational technologies for Microsoft's 200+ online businesses including Azure, Office 365, Bing, Xbox Live, Skype, and OneDrive. Our portfolio is built and managed by a team of subject matter experts working 24x7x365 to support services for more than 1 billion customers and 20 million businesses in over 90 countries worldwide. Within CO+I, the Data Center Incident Management Team (DCIM) is responsible for 24 x 7 x 365 incident management for Microsoft data centers worldwide. Within the DCIM Team, we are seeking a highly motivated and experienced Senior Incident Manager to join our team. If you are a strategic thinker with a passion for driving business success, we encourage you to apply for this exciting opportunity.
Job Responsibility:
Shares insights and best practices that can be applied to improve development and operations across related sets of the systems, services, platforms, and/or products
Mentors and coaches other engineers to help them identify and propose relevant solutions
Collaborates within and across teams by proactively and systematically sharing information with an appropriate level of detail for their audience
Overcomes obstacles by resolving conflicts and issues across interdependent teams and engages with partners and stakeholders so issues can be resolved and mutual objectives are met
Develops, leverages, and drives sharing of information and knowledge base across teams
Leverages advanced technical expertise, judgment, and decision making to coordinate multiple work streams and resources in crisis situations to drive mitigation plan and resolve, reduce, or mitigate the impact of a crisis by engaging necessary teams and escalating to appropriate stakeholders
Independently conducts root cause analyses and participates in post-incident reviews based on incidences/crises for the purposes of leading continuous improvement
Applies diagnostic expertise
Provides guidance to other engineers working to mitigate and resolve issues
Communicates customer impact and other relevant information with key stakeholders, leadership, and customers
Develops and drives projects and programs to improve crisis response by creating standard practices for consistent response across engineering teams
Fosters increased service stability
Reduces future noise by participating in optimization of telemetry and alarming
Influences key stakeholders to adopt new standards and practices to broadly improve crisis and problem management
Creates, monitors, and takes action on telemetry data and influences telemetry analytics to better identify patterns that reveal errors and unexpected problems that are affecting the system's availability, reliability, performance, and/or efficiency
Develops scripts and/or automation and leverages an understanding of solutions to define, develop, measure, track, change, and improve the quality of telemetry pipelines that support automated monitoring and incident response
Identifies and develops telemetry collaborations that result in better-together services
Responds to incidents during regular on-call rotations, including complex incidents with major customer or business impact, by identifying the level of impact, troubleshooting, contributing to difficult decisions based on business impact, deploying appropriate fixes to resolve root cause(s), and implementing automations for prevention of recurring incidents through coordinating resources required for incident resolution
Escalates resolution of highly complex, ambiguous, and impactful incidents as needed
Contributes to postmortems and shares details related to incidents and their resolution through post-mortem reports and regular review meetings
Provides expert incident response assistance to other Service Engineers as needed, and develops incident response and resolution guidance
Adheres to and promotes prescriptive guidance for security, privacy, and compliance standards in alignment with direction from the business and technical experts
Works with security, privacy, and compliance teams to identify and address issues relevant to their services and resolve them within the service level agreement (SLA)
Provides assistance to other service engineers as needed
Independently implements reliable, scalable, and high-performance solutions across teams
Contributes to design documents
Owns implementation and rollback plans
Maintains quality checklist and related documentation
Quantifies and ensures the health and compliance of a service according to Engineering and industry standards
Monitors and maintains security by addressing security vulnerabilities through patches, reconfigurations, and/or settings updates
Identifies, prioritizes, and targets solutions to complex security issues that may impact customers and partners, and drives action to promote the adoption of relevant mitigations
Drives program and process of mitigation, troubleshoots system issues, and partners closely with internal customers and engineering teams to conduct root cause analyses, share end-to-end expertise in services, and to mitigate and resolve issues
Communicates and drives adherence to security policies and procedures
Takes ownership of service design by driving efforts within an organization to identify, define, recommend, and build optimal configurations of technology solutions with considerations for cost management, and service health, security, resiliency, and reliability, while taking into account scalability of services
Develops end-to-end expertise in service and/or system design, interactions between technology layers and components, functions of infrastructure, and dependencies at scale
Independently adjusts configurations and defines infrastructures to improve the availability, reliability, efficiency, observability, and/or performance of supported products and services
Drives collaborative reviews with the engineering teams that develop and/or manage services and other stakeholders, identifying opportunities for efficiencies in operations and sharing learnings and recommendations across engineering teams and other stakeholders working on related services within their organization
Independently designs a service/system in a manner that allow for robust and scalable measurement of quantifiable metrics for assessing health, quality, and functionality
Stays current in knowledge and expertise as technology landscape evolves, maintaining awareness of industry norms
Uses knowledge to drive the adoption of new solutions across engineering teams working with related products within an organization
Provides guidance to others through sharing, coaching, conferences, and other means to drive improvements across teams
Requirements:
Bachelor's Degree in Computer Science, Information Technology, Mechanical Engineering, Electrical Engineering, Aerospace Engineering, Data Science, Cybersecurity, or related field AND 3+ years technical experience in data center or critical environment space OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Nice to have:
Master's Degree in Computer Science, Information Technology, Mechanical Engineering, Electrical Engineering, Aerospace Engineering, Data Science, Cybersecurity, or related field AND 6+ years technical experience in data center or critical environment space OR equivalent experience OR Bachelor's Degree in Computer Science, Information Technology, Mechanical Engineering, Electrical Engineering, Aerospace Engineering, Data Science, Cybersecurity, or related field AND 8+ years technical experience in in data center or critical environment space OR equivalent experience OR equivalent experience
3+ years technical experience working with large-scale cloud or distributed systems