This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Applied Science team for GitHub Copilot sits at the intersection of frontier AI research and the world's largest developer platform. We ship AI-powered experiences (ex: code completion, code review, coding agents) used by millions of professional developers every day. As a member of the team, you will help lead GitHub Copilot's AI evaluation strategy end-to-end — from benchmark design and lifecycle governance, through evaluation infrastructure and internal adoption, to community engagement and public transparency. You are the person who ensures that every model swap, product harness, and feature launch is measured against what actually matters to developers — and that the world can see the results.
Job Responsibility:
Partner with Applied Science researchers to translate cutting-edge evaluation research into production systems: adaptive testing (IRT), agent-centric co-evolution, adversarial benchmarking, and telemetry-driven benchmark generation
Lead the deprecation of saturated benchmarks and design their next-generation replacements — including procedurally-generated code evaluations that can't be memorized and adaptive testing systems that skip trivial questions for frontier models
Build GitHub's community benchmark submission program — enabling external researchers, enterprises, and open-source developers to contribute domain-specific evaluations — and publish GitHub's first external benchmark transparency reports showing how models perform on real developer workflows
Design and operationalize multi-tier evaluation frameworks — from fast automated regression suites and LLM-as-judge systems, through expert human evaluation, to production A/B testing — so teams can iterate in hours, not weeks
Design feedback-to-benchmark pipelines that convert thumbs-down signals, user frustrations, and support tickets into candidate regression tests — systematizing informal practices into scalable, automated systems
Establish evaluation as a first-class discipline across GitHub Copilot — creating the rituals, dashboards, and communication cadences that make evaluation results accessible and actionable for every team
Requirements:
Bachelor's Degree AND 6+ years experience in engineering, product/technical program management, data analysis, or product development OR equivalent experience
3+ years of experience managing cross-functional and/or cross-team projects
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years
Nice to have:
5+ years of experience in technical program management, product management, applied science, or equivalent
2+ years managing programs in machine learning, AI/ML evaluation, or data science
2+ years managing cross-functional and/or cross-team projects
Deep, firsthand experience with AI/ML evaluation methodologies: benchmark design and validity, human evaluation frameworks, automated scoring systems (including LLM-as-judge), A/B testing, and statistical significance
Deep personal experience with AI coding tools — you use Copilot, Cursor, Claude Code, or similar tools daily and have strong opinions about what 'good' looks like from a developer's perspective
Understanding of software engineering workflows at scale — code review, CI/CD, testing, debugging, refactoring — and how AI tools should integrate into each
Experience with community or open-source program management — contributor programs, external research partnerships, or developer relations in a technical context
Proven ability to navigate competing priorities across teams and build shared commitment to common goals in ambiguous, fast-moving environments
Track record of building evaluation systems that directly influenced product or model shipping decisions at scale