This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
M365 Copilot Cadets (Customer & Analytics‑Driven Eval Team) turns real‑world customer feedback into evaluation datasets, rubrics, and insights that measurably improve Microsoft 365 Copilot quality. We connect customer scenarios, analytics, and rigorous evaluation frameworks to power a continuous feedback flywheel across Microsoft 365 Copilot to accelerate measurable product improvements. As a Senior Data Scientist part of Cadets, you will own evaluation analytics end‑to‑end: curate datasets from customer and production signals; author binary‑first rubrics; build LLM (Large Language Model)‑as‑judge graders and work on high‑quality synthetic data generation to scale evaluations with experience in human‑match rates. You’ll partner with PM/Eng/Design and VIP customers to ship quality gains and AI features with confidence.
Job Responsibility:
Evaluation & Feedback Analysis: Convert multi‑source feedback (dogfood, VIP customers, production traces) into a prioritized dataset of 10–100 tasks per scenario, each with prompts and golden outputs
maintain a living failure taxonomy prioritized by volume × impact × fixability
Build grader prompts (with few‑shots and counter‑examples) that achieve ≥80% human‑match rate, track TPR/TNR on held‑out sets, and prevent reward hacking
Synthetic & Human‑Labeled Data: Design structured tuples to scale high‑signal synthetic data
orchestrate vendor/partner annotation sprints and live calibrations to align shared judgment
Ensure datasets are reproducible with linked artifacts and robust metadata/trace hygiene
Customer‑Grounded Scenarios: Partner with PMs/solution architects to co‑develop evals with VIP customers so tasks reflect real outcomes and workflows
quantify lift from fixes and inform the next hill‑climb
Team Leadership & Ways of Working: Co‑own the Cadets “feedback flywheel” with PM/Eng (instrumentation, taxonomy, guardrails vs. evaluators) and help operationalize weekly checklists, change logs, and judge refresh cadence
Requirements:
Doctorate in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 1+ year(s) data-science experience (e.g., managing structured and unstructured data, applying statistical techniques and reporting results)
OR Master's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 3+ years data-science experience
OR Bachelor's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 5+ years data-science experience
OR equivalent experience
Experience with building data pipelines, performing large-scale analysis, and implementing ML workflows using Python and SQL
Experience in developing models or designing evaluation frameworks, including A/B testing or prompt-based assessments for LLMs
Ability to meet Microsoft, customer and/or government security screening requirements
This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Nice to have:
Doctorate in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 3+ years data-science experience
OR Master's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 5+ years data-science experience
OR Bachelor's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 7+ years data-science experience OR equivalent experience
Experience building graders that score persona/tone, contract/formatting (e.g., JSON validity, schema), and tool‑use correctness
Background with structured synthetic data generation and vendor annotation programs
familiarity with judge mutation/optimization loops
2+ years customer-facing, project-delivery experience, professional services, and/or consulting experience
AI & Technical Fluency: You don't need to train models, but you know how they work, how to test them, and how to build great products on top of them
Experience in communication and stakeholder management skills
Ability to work in a fast-paced, ambiguous environment and deliver results under tight deadlines