This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We’re hiring a Principal TPM Data & Telemetry – Windows Reliability, individual contributor to strengthen our Reliability Telemetry & Insights function—ensuring we can consistently operate and evolve the systems that measure Windows reliability and translate signals into clear, actionable decisions for engineering and partner teams. This role is equal parts telemetry operations, data quality/governance, and insight-to-action program leadership. You will own critical reliability datasets and dashboards end-to-end (from ingestion and validation through reporting and operational rhythms), partner across Windows engineering and ecosystem stakeholders, and help the team scale by building repeatable processes, documentation, and broader bench strength. Windows reliability is only as strong as the telemetry and operational system behind it. This role ensures our teams can detect regressions early, confidently explain what’s happening, and drive the right corrective actions—without being dependent on a single person’s knowledge. Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Job Responsibility:
Own/operate core reliability data pipelines and reporting workflows (availability, correctness, latency, completeness)
Drive data quality improvements: schema management, identity resolution, deduplication, and metric definitions
Build and maintain dashboards and recurring scorecards that track key reliability outcomes (e.g., crash trends, top drivers/components, device cohorts, regressions, risk flags)
Proactively identify “what changed” and “why it matters” signals
translate to recommended actions and owners
Create and maintain clear metric definitions, methodology notes, and interpretation guidance to avoid confusion/misalignment
Collaborate with Windows engineering (e.g., kernel/driver/servicing stakeholders), quality teams, and partner-facing teams to align on measurement and priorities
Support OEM/silicon/partner conversations with accurate, explainable reliability telemetry and narratives
Drive cross-team alignment on what actions are required when telemetry indicates regressions or out-of-policy behavior
Identify gaps in telemetry coverage and propose/drive work to close them (instrumentation improvements, new cuts, improved categorization)
Improve automation and scale: reduce manual reporting, simplify repetitive analysis, and harden tools so others can self-serve
Document critical workflows and institutional knowledge (how-to guides, data lineage, known pitfalls, “how to debug” playbooks)
Create training and enablement materials so others can reliably back up the function
Design work so it is system-owned rather than person-owned (clear ownership maps, redundancy, measurable SLAs)
Requirements:
Bachelor's Degree AND 6+ years’ experience in engineering, product/technical program management, data analysis, or product development OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check: must pass upon hire/transfer and every two years thereafter
Nice to have:
3+ years of experience managing cross-functional and/or cross-team projects
7+ years of experience in one or more of: program management, data/analytics engineering, reliability engineering, telemetry operations, or product analytics