This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As a Member of Technical Staff, LLM Evaluation, you will develop and implement cutting-edge methodologies to help us evaluate how well Copilot performs in real-world usage scenarios. Users turn to Copilot for all types of endeavors, making it critical that we ensure our AI systems effectively help them meet their needs. Our vision for meeting user needs is expansive and includes not only task completion, but also affective aspects of the experience. You will be responsible for developing new methods to evaluate LLMs, train classifiers, experimenting with data collection techniques, and implementing methodologies to provide real-time signals on Copilot performance. We're looking for outstanding individuals with experience in the social sciences, machine learning, and analysis of natural language. The right candidate is a creative problem solver who will work closely with user researchers and product leaders to build automated evaluation frameworks that help us drive improvements in Copilot.
Job Responsibility:
Leverage expertise to measure the performance of Copilot, identify failure modes and novel mitigation strategies, including data mining, prompt engineering, LLM as a judge, and classifier training
Creative problem solving, navigating complexity with clarity, independently shaping direction and delivering results even when the path isn’t obvious
Create and implement comprehensive evaluation frameworks across diverse scenarios, edge cases, and potential failure modes
Build automated testing systems, generalize solutions into repeatable frameworks, and write efficient code for model pipelines and intervention systems
Maintain a user-oriented perspective by understanding needs from user perspectives, validating approaches through user research, and serving as a trusted advisor on AI matters
Track advances in research, identify relevant state-of-the-art techniques, and adapt algorithms to drive innovation in production systems serving millions of users
Requirements:
Doctorate in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 5+ years data-science experience
OR Master's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 7+ years data-science experience
OR Bachelor's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 10+ years data science experience
OR equivalent experience
Experience prompting and working with large language models
Experience writing production-quality Python code
Demonstrated interest in Responsible AI
Nice to have:
Doctorate in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 8+ years data-science experience
OR Master's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 10+ years data-science experience
OR Bachelor's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 12+ years data-science experience