This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As a Research Scientist, Human–AI Interaction, you will play a pivotal role in defining how AI systems support real human work by leading research at the intersection of Human–Computer Interaction (HCI), Large Language Models (LLMs), and task-level benchmarking. You will operate at the frontier of human-centered AI evaluation, with a focus on understanding what people actually do to accomplish meaningful work—and how AI systems change, accelerate, or reshape that activity. Your research will define jobs-to-be-done benchmarks, comparative evaluation frameworks, and empirical methods for measuring human effort, time, quality, and outcomes when working with AI copilots. Additionally, the Handshake AI platform is an interface used by thousands of the top subject matter experts in the world to evaluate AI systems, and offers numerous interesting HCI / HITL-AI research questions that will drive large business impact. You’ll set research direction, establish standards for measuring human activity in AI-mediated workflows, publish papers and open-source code, and lead the development of rigorous, scalable benchmarks that connect human work, AI assistance, and real economic value.
Job Responsibility:
Lead high-impact research on jobs-to-be-done benchmarks for AI systems, including: Defining task taxonomies grounded in real professional and economic activities
Identifying what constitutes meaningful task completion, quality, and success
Translating qualitative work understanding into measurable, repeatable benchmarks
Develop methods to measure human activity in AI-mediated workflows
Design benchmarks to assess AI-as-a-collaborator/copilot, rather than autonomous agents / basic Q&A
Design and run empirical studies of how people use AI to solve tasks, including: Controlled experiments and field studies measuring task performance
Instrumentation for capturing fine-grained interaction traces and outcomes
Drive strategy for professional-domain AI benchmarks, focusing on: Understanding domain-specific workflows (e.g., analysis, writing, planning, coordination)
Grounding benchmark design in how work is actually performed, not idealized tasks
Build and prototype AI systems and evaluation infrastructure to support research and Data production, including: LLM-powered copilots and experimental tools used for task-level measurement
Benchmark harnesses that evaluate both model behavior and human outcomes
Data pipelines for analyzing human–AI interaction at scale
The human-in-the-loop experience for Handshake fellows to produce effective evaluations and training data for frontier models, through structured UI/UX interactions with these models
Collaborate closely with User Experience Research (UXR) to: Leverage deep qualitative insights into real user behavior and workflows
Translate ethnographic and observational findings into formal research constructs
Publish and present research that advances the field of human-centered AI benchmarking, with an expectation of regular contributions to top-tier venues such as CHI (Conference on Human Factors in Computing Systems), and related HCI and AI conferences
Requirements:
PhD or equivalent experience in Human–Computer Interaction, Computer Science, Cognitive Science, or a related field, with a strong emphasis on empirical evaluation of interactive AI/LLM systems
3+ years of academic or industry research experience post-PhD, including leadership on complex research initiatives and analyzing data from a real AI product
Strong publication record, with demonstrated impact in top-tier AI (NeurIPS, ICML, ICLR, ACL) and HCI (CHI) venues
Deep expertise in experimental design and measurement, particularly for: Task performance and human activity
Comparative evaluation frameworks
Mixed-methods research grounded in real-world behavior
Strong technical and coding skills, including: Python and data analysis / ML tooling
Experience building experimental systems and benchmark infrastructure
Familiarity working with LLM APIs, agent frameworks, or AI-assisted tooling
Proven ability to define and lead research agendas that connect human work, AI capability, and business or economic impact
Strong collaboration skills, especially working across research, engineering, product, and UXR teams
Nice to have:
Experience developing benchmarks or evaluation frameworks for human–AI systems or AI-assisted productivity tools
Prior work on copilot-style systems, agentic workflows, or automation of professional tasks
Familiarity with workplace studies, CSCW, or socio-technical systems research
Contributions to open-source tools, datasets, or benchmarks related to task-level evaluation
Interest in how AI reshapes labor, productivity, and the future of work