Cloud Solution Architecture - Infrastructure Job at Microsoft Corporation (Singapore)

Job Description

The Infrastructure Cloud Solution Architect (CSA) serves as a trusted technical advisor for Microsoft's most strategic and mission-critical customers. This role helps customers improve the reliability, resilience, security, performance, and operational excellence of their Azure environments through proactive assessments, technical guidance, incident leadership, and cross-functional collaboration. Working within a global follow-the-sun operating model, the CSA collaborates closely with customers, Microsoft Engineering, Support, and Customer Success teams across multiple regions and time zones to drive rapid incident resolution, operational improvements, and long-term business outcomes. Success requires deep technical expertise, strong customer advocacy, and the ability to navigate complex operational challenges while influencing stakeholders across diverse organizations and cultures.

Job Responsibility

Trusted Advisor & Customer Advocacy
Act as a trusted technical advisor, helping customers improve the reliability, resiliency, security, performance, and operational maturity of mission-critical workloads running on Azure
Advise customers and stakeholders on architecture, operations, and best practices aligned with the Azure Well-Architected Framework
Actively listen to and understand customer priorities, advocate on their behalf within Microsoft, and drive outcomes measured through customer satisfaction, operational excellence, and business impact
Build strong technical relationships with customers and Microsoft stakeholders, establishing credibility through deep technical expertise and trusted guidance
Communicate complex technical concepts and recommendations in clear, actionable terms to both technical and executive audiences
Incident Leadership & Operational Excellence
Lead complex troubleshooting efforts across infrastructure, platform, and application layers, including critical and high-severity incidents
Operate effectively in high-stakes, customer-impacting incidents, combining platform expertise and customer business context to accelerate mitigation, recovery, and restoration of service
Facilitate Root Cause Analysis (RCA) activities for critical incidents, helping customers identify corrective and preventative actions that reduce future risk
Analyze support cases, operational telemetry, incident trends, and platform events to identify recurring risks and recommend proactive remediation measures
Drive reduction of reactive operational demand through reliability-focused recommendations, operational maturity improvements, resiliency best practices, and service optimization initiatives
Promote operational excellence across reliability, availability, security, performance, recoverability, and capacity management
Proactive Risk Management & Continuous Improvement
Perform proactive health assessments, risk reviews, and operational analysis to identify opportunities for improvement and escalation prevention
Maintain a culture of curiosity by looking beyond immediate symptoms and root causes to understand systemic factors, historical decisions, and operational patterns that drive long-term improvements
Correlate customer requirements, operational events, and platform signals into actionable recommendations with clear accountability and ownership
Drive operational maturity through recommendations for observability, monitoring, automation, governance, reliability engineering practices, disaster recovery preparedness, and service management processes
Utilize telemetry, monitoring platforms, observability tools, and query languages to investigate issues, identify trends, and develop actionable insights
Customer Engagement & Service Delivery
Develop and maintain deep technical understanding of assigned customer environments, architectures, dependencies, and mission-critical workloads
Create and maintain customer knowledge documentation, operational records (KnowMe), and workload profiles
Deliver onboarding assessments and help define service delivery and improvement plans aligned with customer objectives
Scope technical engagements, facilitate discussions on workstreams, prioritize recommendations, and align stakeholders on action plans and expected outcomes
Track remediation progress and drive alignment across customers and Microsoft stakeholders
Global Collaboration & Stakeholder Management
Operate effectively within a global follow-the-sun support model, collaborating with teams across multiple regions and time zones to ensure continuity of service for mission-critical workloads
Maintain awareness of ongoing customer engagements, incidents, escalations, and engineering activities occurring outside local business hours, incorporating relevant developments into ongoing service delivery
Drive effective cross-time-zone coordination through structured handoffs, action tracking, stakeholder alignment, and knowledge sharing
Build strong partnerships across Microsoft Engineering, Support, Customer Success, Product Groups, and other stakeholders to accelerate issue resolution and drive customer outcomes
Communicate complex technical and operational topics clearly across diverse technical, business, and cultural audiences
Establish trusted technical relationships with both customers and Microsoft stakeholders, enabling effective collaboration during critical incidents, proactive engagements, and strategic initiatives
Build and strengthen partnerships across Microsoft teams, including Engineering, Azure Engineering Direct (AED), Azure Rapid Response (ARR), Customer Success Account Managers (CSAMs), Support, and other stakeholders
Collaborate effectively across teams, cultures, and organizational boundaries to drive customer success and operational improvements
Success Measures
Improvements in workload reliability, resiliency, security, and operational maturity
Adoption of recommended architecture, operational practices, and remediation plans
Reduction in customer-impacting incidents, repeat escalations, and operational risk
Faster mitigation and recovery of critical incidents
Effective coordination across global teams, ensuring seamless customer support and operational continuity across regions and time zones
Increased customer satisfaction and trusted advisor influence
Positive business outcomes through improvements in reliability, security, performance, capacity management, and service resilience

Requirements

Bachelor’s Degree in Computer Science, Information Technology, Engineering, or a related field, AND 7+ years of relevant experience supporting mission-critical production environments
OR equivalent practical experience
Experience supporting mission-critical production environments
Experience leading or coordinating Sev A / P1 incidents
Experience providing recommendations to enterprise customers
Experience improving reliability, resiliency, performance, security, or operational maturity
Experience working across multiple time zones and globally distributed teams
Experience coordinating multiple technical teams to resolve customer issues
Experience with telemetry, monitoring, logging, and root-cause analysis
Experience with DR, HA, BCP, and recovery planning

Microsoft Corporation - All Job Offers

Select Country

Cloud Solution Architecture - Infrastructure

Job Description

Job Responsibility

Requirements

Looking for more opportunities?

Cloud Solution Architecture - Infrastructure

Cloud Solution Architecture - Cloud Infrastructure

Cloud Solution Architecture - Cloud & AI Infrastructure

Cloud Solution Architect - Cloud & AI Infrastructure

Cloud Solution Architect - Cloud & AI Infrastructure

Cloud Solution Architect - Cloud & AI Infrastructure

Senior Cloud Solution Architect, Cloud & AI Infrastructure

Cloud Solution Architect, Cloud & AI Infrastructure

Sr. Cloud Solution Architect - Cloud & AI Infrastructure

Our AI answers in your language