Site Reliability Engineer Job at Microsoft Corporation (Bangalore)

Job Description

Microsoft is a company where passionate innovators come to collaborate, envision what can be and take their careers further. This is a world of more possibilities, more innovation, more openness, and the sky is the limit thinking in a cloud-enabled world. Microsoft’s Azure Data engineering team is leading the transformation of analytics in the world of data with products like databases, data integration, big data analytics, messaging & real-time analytics, and business intelligence. The products include Microsoft Fabric, Azure SQL DB, Azure Cosmos DB, Azure PostgreSQL, Azure Data Factory, Azure Synapse Analytics, Azure Service Bus, Azure Event Grid, and Power BI. Our mission is to build the data platform for the age of AI, powering a new class of data-first applications, and driving a data culture. Within Azure Data, the messaging and real-time analytics team provides comprehensive solutions and a robust platform that enables users to ingest high granularity signals (real-time & observability) and complex data, converting those into a competitive advantage in real-time for both end users and modern applications. We’re Azure Messaging – a rapidly growing group of around 40 engineers – and we’re experts at moving hundreds of millions of small packets of information into and out of the cloud, per second. We work on the cutting edge of distributed messaging systems, where milliseconds latency, massive throughput and 99.99% service availability aren’t tradeoffs – they’re all necessary. Our infrastructure needs to be resilient enough for financial transactions, rapid enough for streaming and gaming applications, and still nimble enough to move many petabytes of data per day. We build the Azure Service Bus (http://aka.ms/servicebus), Azure Event Hub (http://aka.ms/eventhub), Azure Event Grid (http://aka.ms/azureeventgrid) and Fabric RTI Eventstreams (http://aka.ms/eventstream) services, which help power Microsoft SaaS applications like Office 365, Xbox Live, Halo, Application Insights (and many, many more), as well as thousands of external Microsoft customers. Our group fosters a diverse, inclusive, and collaborative work culture that prioritizes people at all times. We are looking for a Site Reliability Engineer II to help scale and operate Fabric Event Stream as a globally distributed, highly reliable platform, with a primary focus on region build-out, deployment, and site reliability engineering (SRE).

Job Responsibility

Own the end-to-end readiness of Event Stream across Azure regions, including onboarding new regions, driving deployment automation, and ensuring consistent, secure, and compliant service rollout
Work closely with platform, infrastructure, and partner teams (e.g., Event Hubs, Kusto, Fabric platform) to deliver resilient, low-latency streaming experiences on a global scale
Play a key role in advancing our reliability posture, improving availability, monitoring, and incident response across regions
Build strong observability, telemetry, and automated recovery mechanisms to meet high availability and SLA targets
Region Build-out & Deployment: Onboard new regions, drive deployment automation, and ensure consistent service configuration
Reliability & SRE: Improve availability, resiliency, and incident response
own service health across regions
Observability & Operations: Enhance telemetry, monitoring, alerting, and troubleshooting capabilities
Cross-team Collaboration: Partner with platform and infra teams to unblock dependencies and ensure smooth rollout
Production Excellence: Drive root-cause analysis, repair items, and continuous improvement on service reliability

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 3+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter

Nice to have

Solid understanding of concurrency, scalability, and fault tolerance
Hands-on experience with cloud platforms (Azure preferred), including service deployment, region onboarding, or infrastructure automation
Experience with streaming or messaging systems (e.g., Azure Event Hubs, Kafka, Service Bus, or similar), including understanding of throughput, latency, and reliability trade-offs
Experience in automation and deployment pipelines, including CI/CD, safe rollout practices, and multi-region configuration management
Proven ability to debug complex production issues and drive fixes across distributed components
Demonstrated ability to work across teams (platform, infra, partner services) to deliver end-to-end solutions and unblock dependencies
Proficient problem-solving skills with the ability to navigate ambiguity, design clear solutions, and deliver incrementally at scale

Microsoft Corporation - All Job Offers

Select Country

Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?