This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Support for Mission Critical is a team within Microsoft that provides solution-specific expertise designed to drive peak health and optimum performance of a customer’s most important solutions. As a key technical resource for the customer, you will be primarily focused on delivering proactive services such as education workshops, delivering assessments, and providing tailored guidance. Troubleshooting skills are essential as this role will include working with Microsoft Support to expedite incident resolution.
Job Responsibility:
Responsible for delivering Support Mission Critical Service offerings, collaborating with CSU (including CSAs, CSAMs), CSS, Engineering, and other teams as needed
Direct accountability to lead the Proactive Resiliency Efforts, coordinate with other teams on the Accelerated Incident Resolution, and Monitoring & Observability features of an offering
Proactive Resiliency: Lead technical engagement with specific workloads that prioritizes Reliability, Security, Supportability, Manageability, and Monitoring and Observability
Coordinating the onboarding phase which includes the Consolidated Assessment Week Delivery
Remediate proactive recommendations for the specified workloads identified
Plan and implement both a Workload-Specific Service Improvement Plan and a Customer Success Plan
Accelerated Incident Resolution: Awareness and visibility into critical incidents to ensure RCAs and recommendations are captured and linked to Proactive Resiliency efforts
Monitoring & Observability: Collaborate with relevant resources when engaged to help onboard the customer efficiently and effectively, prioritizing customer experience and effort, as well as drive customer-owned monitoring to enable and improve customer’s observability capabilities
Cross-Team Leadership: Build partnership with CSAM to ensure roles are clearly understood and responsibilities are established, maintaining partnership throughout contract and relying on CSAM for account escalation
Coordinate with the leads of the Accelerated Incident Resolution work stream and, when required, the Proactive Monitoring work stream with our internal partners
Collaborate with support and stakeholders to ensure there is a comprehensive, up-to-date KnowMe available across various teams including CSS
Work with internal teams to request, augment with KnowMe, and share RCAs to customer
Requirements:
Bachelor's Degree in Computer Science, Information Technology, Engineering, Business, Liberal Arts, or related field
4+ years experience in technical projects and specifically in cloud/infrastructure technologies, information technology (IT) consulting/support, systems administration, network operations, software development/support, technology solutions, practice development, architecture, and/or consulting OR equivalent experience
Technical Certification in Cloud (e.g., Azure, Amazon Web Services, Google, security certifications)
Proven experience in Cloud Solutions Architecture or Mission Critical Support for enterprise customers
Deep knowledge of Azure infrastructure services (Compute, Storage, Networking), Container services (such as Azure Kubernetes Service) and Platform-as-a-Service (PaaS) offerings
Strong troubleshooting skills across distributed systems and mission-critical workloads
Familiarity with performance optimization, high availability, and disaster recovery strategies
Demonstrated ability to manage high-severity incidents and provide rapid mitigation strategies
Experience working with financial services customers or other highly regulated industries
Excellent verbal and written communication skills for executive-level updates and technical deep dives
Ability to collaborate across engineering, product groups, and global support teams
Expertise in virtualization, VM performance tuning, and cache optimization
Knowledge of observability tools, telemetry, and proactive monitoring solutions
Site reliability / operational troubleshooting experience for large infrastructure as a service or infrastructure environments