This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Microsoft’s Cloud Operations & Innovation (CO+I) organization powers the infrastructure that enables Microsoft’s cloud services. Within CO+I, Critical Environment Systems Intelligence (CESI) builds and maintains intelligence systems, environmental telemetry pipelines, reliability models, and automation workflows that keep Microsoft’s datacenters operating safely and efficiently at hyperscale. Central to CESI is the Data Center Infrastructure Data Engineering & Analytics (DC IDEA) team extends this foundation by developing telemetry pipelines, analytics platforms, and data models that transform raw datacenter signals into actionable insights. IDEA increases observability, accelerates fault detection, and strengthens operational readiness across Microsoft’s global datacenter fleet. Within IDEA is the RADAR team which leads Microsoft’s sensor‑health visibility and detection strategy. RADAR designs and operates sensor‑health detection logic, alerting frameworks, and triage workflows that ensure sensor reliability across leased and company‑operated datacenter environments, making Microsoft’s cloud more resilient and reliable. As a Senior Technical Program Manager on the RADAR team, you will own cross‑organizational programs that deliver end‑to‑end sensor‑health detection, alerting, and triage. You will design and operationalize workflows, establish clear engagement models, and drive the onboarding of new detection scenarios, directly improving reliability for Microsoft’s global datacenter fleet. This role blends technical depth with program leadership to turn noisy telemetry into actionable signals, streamline incident response, and raise the bar on observability and availability.
Job Responsibility:
Lead delivery of RADAR’s mission by implementing and scaling sensor‑health detection, alerting, and triage capabilities across Microsoft datacenters
Design and operationalize core workflows for sensor‑health detection, alert routing, validation, and triage
Drive cross‑team orchestration by creating and strengthening relationships across engineering, hardware, operations, and service teams
Build and manage onboarding processes for new telemetry types and detection scenarios
Champion Process Excellence by maturing workflows, training partners, and driving adoption of consistent operating models
Lead partner alignment and influence to shape and deliver shared roadmaps across divisional boundaries
Identify gaps and opportunities through structured feedback loops
synthesize insights into clear problem statements
Manage schedules and execution across epics, sprints, semester plans, and releases
Produce clear technical documentation including specifications, decision records, runbooks, and operational procedures
Drive continuous improvement by monitoring detection quality, validating system behavior, and guiding enhancements
Requirements:
Bachelor's Degree AND 4+ years experience in engineering, product/technical program management, data analysis, or product development OR equivalent experience
2+ years of experience managing cross-functional and/or cross-team projects
Communicate complex technical and operational topics clearly and concisely to senior leaders
Drive end‑to‑end orchestration across upstream telemetry producers and downstream incident response and operations consumers
Facilitate cross‑team design discussions to align on workflows, validation criteria, handoffs, and governance models
8+ years of experience in technical program management, engineering, or reliability/observability domains, preferably in the Datacenter Critical Environment space
Demonstrated ability to lead complex, multi‑team initiatives from concept to production in large‑scale environments
Ability to read and reason about technical documentation, schemas, APIs, and data models
Strong analytical and problem‑solving skills
comfortable working with metrics, dashboards, instrumentation, and system‑performance data
Proven ability to drive clarity, structure, and alignment across engineering and operations stakeholders
Experience with telemetry ingestion, stream processing, anomaly detection, signal quality evaluation, or alerting systems
Familiarity with incident management, SRE practices, service‑health measurement, and operational readiness frameworks
Experience collaborating across hardware, software, and datacenter operations teams in high‑scale technical environments
Ability to produce concise specifications, frameworks, and operational workflows that enable complex operational teams