This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We’re looking for a technical, systems-minded operator to build and scale the evaluation engine behind Harvey’s platform. As we expand globally, ensuring our models behave reliably, accurately, and jurisdictionally correctly is mission-critical—and evaluation complexity is increasing 10x. As a member of our Product Operations team, you’ll work closely with Applied Legal Researchers, Product, Engineering, AI Research, and human data providers to operationalize evaluation methodologies and embed them into our product development lifecycle. You’ll create the workflows, systems, and tooling that make evaluation a first-class product capability at Harvey. This is a high-ownership role for someone who thrives in ambiguity, loves building structure from ambiguity, and wants to help scale the evaluation infrastructure of a global AI company.
Job Responsibility:
Build and scale the systems that power model and product evaluations across Harvey
Embed evaluation workflows and readiness checkpoints into the product development lifecycle
Create the single source of truth for evaluation status, results, history, and launch readiness
Turn Expert-designed evaluation methodologies into scalable, repeatable operational processes
Manage relationships with human data vendors and ensure evaluation quality meets legal standards
Work with Engineering and Research to improve evaluation tooling, automation, and dashboards
Drive evaluation readiness for major product and model launches across geographies and jurisdictions
Document and operationalize evaluation governance as complexity increases
Help define how Harvey ensures model accuracy, reliability, and trust at global scale
Requirements:
4–7+ years in technical program management, product operations, research operations, or evaluation/benchmarking roles
Experience working with ML/AI evaluations, benchmarking frameworks, or scientific workflows
Comfort with statistical methodologies and SQL or Python, or similar tools to interpret evaluation data
Ability to work deeply with legal experts and operationalize complex evaluation methodologies
Strong cross-functional coordination skills across Product, Engineering, Research, and data providers/vendors
High attention to detail and a bias toward clarity, rigor, and reproducibility
Ability to navigate extreme ambiguity and bring order to complex systems
Strong communication skills and comfort translating technical nuance for diverse stakeholders
Desire to do whatever it takes to make evaluation systems successful—from writing documentation to diagnosing pipeline issues
Nice to have:
Experience in legal tech or working with domain experts in regulated industries
Experience managing human data providers or human-in-the-loop evaluation pipelines
Background in ML research, data quality management, or evaluation science
Early employee at a hyper-growth startup
Experience at world-class product or platform operations orgs (ex: Stripe, Ramp)