This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Benchmarking team defines how progress is measured. Researchers design evaluation frameworks that capture reasoning depth, interaction quality, reliability, and operational impact. They construct benchmarks that reflect real-world complexity. Their systems become the standard by which new architectures, techniques, and releases are judged. Researchers in Benchmarking explore new paradigms for evaluating intelligent systems: adversarial robustness testing, longitudinal performance tracking, and human-in-the-loop assessment. They investigate how metrics shape model behavior and establish rigorous methodologies for quantifying emergent capability. Their insights drive both Distyl’s internal research priorities and industry-wide standards.
Job Responsibility:
Design evaluation frameworks that capture reasoning depth, interaction quality, reliability, and operational impact
Construct benchmarks that reflect real-world complexity
Explore new paradigms for evaluating intelligent systems (adversarial robustness testing, longitudinal performance tracking, human-in-the-loop assessment)
Investigate how metrics shape model behavior
Establish rigorous methodologies for quantifying emergent capability
Requirements:
Experience designing and running evaluations (built or maintained benchmarks, test suites, or experimental frameworks)
Statistical and analytical rigor (design fair, reproducible experiments)
Experience building with models, not just building models (expertise in compound AI systems, agentic collaboration, ensembling, ReAct, graph-of-thoughts)
Proven track record of research results (published in top journals or posted work online)
Uses AI every day (tools like ChatGPT, Cursor, Perplexity)
Strong programming and data analysis skills
Biases towards showing vs telling
What we offer:
100% covered medical, dental, and vision for employees and dependents
401(k) with additional perks (commuter benefits, in-office lunch)
Access to state-of-the-art models
Generous usage of modern AI tools
Ownership of high-impact projects across top enterprises