This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Agent Evaluation team is responsible for testing whether AI agents return the correct and expected responses. We build the framework, metrics, and test cases that validate agent behavior, accuracy, and reliability before release. Our goal is to ensure agents perform consistently and meet product and user expectations. The Manager, Agent Evaluation will lead the team responsible for building and scaling the evaluation framework that tests whether AI agents return accurate, reliable, and expected responses across real-world scenarios.
Job Responsibility:
Lead and grow a team focused on agent and model evaluation
Define the strategy, roadmap, and standards for agent testing and validation
Oversee development of metrics, benchmarks, and testing frameworks to measure response quality, accuracy, safety, and performance
Ensure evaluation coverage aligns with product, UX, and business requirements
Partner closely with Product, Engineering, Research, and Platform teams to integrate evaluation into the development lifecycle
Drive experimentation and continuous improvement of evaluation methodologies
Establish reporting mechanisms to clearly communicate evaluation results and trade-offs to leadership
Implement best practices for model versioning, monitoring, and release validation
Stay current with advancements in LLMs, AI agents, and evaluation techniques
Requirements:
Strong foundation in machine learning fundamentals and applied ML systems
Hands-on experience with model and agent evaluation methodologies
Familiarity with LLMs, AI agents, and prompt-driven systems
Proficiency in Python and modern ML frameworks (e.g., PyTorch, TensorFlow)
Experience defining metrics, benchmarks, and experimentation frameworks
Solid understanding of MLOps practices, including model versioning, monitoring, and CI/CD
Ability to collaborate effectively with product, platform, and research teams
Clear communicator of technical trade-offs, evaluation insights, and results