This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Microsoft’s Cloud Operations & Innovation (CO+I) organization powers the infrastructure that enables Microsoft’s cloud services. Within CO+I, Critical Environment Systems Intelligence (CESI) builds and maintains intelligence systems, environmental telemetry pipelines, reliability models, and automation workflows that keep Microsoft’s datacenters operating safely and efficiently at hyperscale. Central to CESI is the Data Center Infrastructure Data Engineering & Analytics (DC IDEA) team extends this foundation by developing telemetry pipelines, analytics platforms, and data models that transform raw datacenter signals into actionable insights. IDEA increases observability, accelerates fault detection, and strengthens operational readiness across Microsoft’s global datacenter fleet. Within IDEA is the RADAR team which leads Microsoft’s sensor‑health visibility and detection strategy. RADAR designs and operates sensor‑health detection logic, alerting frameworks, and triage workflows that ensure sensor reliability across leased and company‑operated datacenter environments, making Microsoft’s cloud more resilient and reliable. As a Senior Technical Program Manager on the RADAR team, you will own cross‑organizational programs that deliver end‑to‑end sensor‑health detection, alerting, and triage. You will design and operationalize workflows, establish clear engagement models, and drive the onboarding of new detection scenarios, directly improving reliability for Microsoft’s global datacenter fleet. This role blends technical depth with program leadership to turn noisy telemetry into actionable signals, streamline incident response, and raise the bar on observability and availability.
Job Responsibility:
Lead delivery of RADAR’s mission by implementing and scaling sensor‑health detection, alerting, and triage capabilities across Microsoft datacenters, ensuring high‑quality signal visibility and reliable operational outcomes
Design and operationalize core workflows for sensor‑health detection, alert routing, validation, and triage, partnering closely with upstream telemetry systems and downstream incident‑response teams
Drive cross‑team orchestration by creating and strengthening relationships across engineering, hardware, operations, and service teams to integrate and execute multi‑feature scenarios and platform capabilities
Build and manage onboarding processes for new telemetry types and detection scenarios, including requirements templates, validation criteria, handoff procedures, and governance frameworks
Champion Process Excellence by maturing workflows, training partners, and driving adoption of consistent operating models for new signals, anomaly detection patterns, and incident‑response processes
Lead partner alignment and influence to shape and deliver shared roadmaps across divisional boundaries, ensuring detection, alerting, and observability capabilities evolve cohesively
Identify gaps and opportunities through structured feedback loops
synthesize insights into clear problem statements, repeatable patterns, and actionable guidance for leadership and engineering stakeholders
Manage schedules and execution across epics, sprints, semester plans, and releases, tracking dependencies, anticipating risks, and driving cohesive delivery across partner teams
Produce clear technical documentation including specifications, decision records, runbooks, and operational procedures to support partner readiness and consistent implementation
Drive continuous improvement by monitoring detection quality, validating system behavior, and guiding enhancements that strengthen reliability, observability, and operational readiness
Requirements:
Bachelor's Degree AND 4+ years experience in engineering, product/technical program management, data analysis, or product development OR equivalent experience
2+ years of experience managing cross-functional and/or cross-team projects
Nice to have:
8+ years of experience in technical program management, engineering, or reliability/observability domains, preferably in the Datacenter Critical Environment space
Demonstrated ability to lead complex, multi‑team initiatives from concept to production in large‑scale environments
Ability to read and reason about technical documentation, schemas, APIs, and data models to support design and decision‑making
Strong analytical and problem‑solving skills
comfortable working with metrics, dashboards, instrumentation, and system‑performance data
Proven ability to drive clarity, structure, and alignment across engineering and operations stakeholders
Experience with telemetry ingestion, stream processing, anomaly detection, signal quality evaluation, or alerting systems
Familiarity with incident management, SRE practices, service‑health measurement, and operational readiness frameworks
Experience collaborating across hardware, software, and datacenter operations teams in high‑scale technical environments
Ability to produce concise specifications, frameworks, and operational workflows that enable complex operational teams