This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are looking for a deeply technical and forward thinking Principal Product Manager to lead the Evaluation Strategy for Microsoft Agent 365 Tools, which includes MCP, Skills etc. This role sits at the core of our agentic ecosystem, where large scale model orchestration, multi tool trajectories, grounding data, and evaluation rigor converge. You will be responsible for defining how we measure, validate, and continuously improve the performance of 1P and 3P agentic tools across diverse models (OpenAI, Claude, Microsoft models), orchestrators, and enterprise workflows. Your work will directly determine our product quality bars, release gates, certification processes, and how customers, and the industry perceive the reliability of Microsoft’s agentic platform.
Job Responsibility:
Define and own the evaluation strategy for all 1P and 3P Agentic tools like MCP servers, skills etc. including tool invocation success, tool quality, trajectory evaluation, intent detection, and scenario‑level scoring
Develop a unified framework covering offline evals, online evals, AI‑judge‑based evals, and assertion‑based rubric design
Partner with engineering to evolve internal platforms like Agent 365 Evals, Agent Arena, dashboards, CI/CD‑integrated nightly evals, and metrics pipelines
Create grading frameworks, mapping strategies, and ground truth generation mechanisms, including automation for user‑intent derivation
Establish Cross‑Model, Cross‑Orchestrator Eval Infrastructure i.e. ensure agentic tools reliably work across all major LLMs and orchestrators
Design and maintain evaluation suites that capture model regressions, tool invocation drift, and scenario fidelity as products evolve
Drive alignment with internal partners and ISV teams to ensure consistent evaluation approaches, shared pipelines, and consolidated quality dashboards
Define product readiness criteria for 1P/3P tools, aligning certification requirements for partner‑built agentic tools
Partner with responsible AI, security, governance, and compliance teams to ensure eval frameworks respect enterprise boundaries and safety constraints
Track the latest developments in multi‑agent evaluation frameworks, trajectory alignment research, and AI behavioral evals
Bring state‑of‑the‑art thinking from academia, industry, and trajectory evals to shape Microsoft's enterprise agentic tools evaluations strategy
Requirements:
Bachelor's Degree AND 8+ years experience in product/service/program management or software development OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Bachelor's Degree AND 12+ years experience in product/service/program management or software development OR equivalent experience
4+ years experience taking a product, feature, or experience to market (e.g., design, addressing product market fit, and launch, internal tool/framework)
6+ years experience improving product metrics for a product, feature, or experience in a market (e.g., growing customer base, expanding customer usage, avoiding customer churn)
6+ years experience disrupting a market for a product, feature, or experience (e.g., competitive disruption, taking the place of an established competing product)
Demonstrated technical depth across LLMs and line of business systems, with proven experience leading AI/LLM evaluation strategy—including offline/online eval frameworks, rubric and AI judge design, and defining measurable quality bars for agentic tools and orchestration workflows
Cross-functional collaboration skills, with the ability to influence across engineering, research, design and business teams
Exceptional written and verbal communication skills, with a knack for storytelling and clear articulation of complex ideas
High tolerance for ambiguity and a bias for action in dynamic environments