This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
You'll directly impact Replit's AI agent—the core of our product strategy—by defining how we measure success, designing experiments that drive improvements, and turning agent trace data into actionable insights for the AI team and company leadership.
Job Responsibility:
Design and analyze experiments to measure agent improvements—from model changes to UX variations—with statistical rigor and practical tradeoffs
Define success metrics that connect agent trace data (prompts, responses, code changes, execution outcomes) to user outcomes like successful deploys, retention, and revenue
Build the semantic layer for agent data in partnership with data engineering—defining the tables, metrics, and models that enable self-serve analysis across the AI team
Surface insights from trace analysis that identify failure modes, successful patterns, and opportunities to improve agent effectiveness
Partner with AI engineering, product, and leadership to translate data into roadmap decisions
you'll have a seat at the table for critical agent strategy discussions
Create dashboards and reporting that surface agent performance metrics (task completion, latency, quality scores, user satisfaction) for the AI team and executives
Requirements:
5+ years of experience in data science, analytics, or a quantitative role with a focus on product, growth, or experimentation
Deep experimentation expertise: A/B testing, experiment design, power analysis, handling skewed data, interpreting results beyond p-values
Strong SQL skills and experience designing data models for high-volume event data
experience with dbt or similar transformation tools
Proficiency in Python and data science libraries (pandas, scipy, statsmodels, etc.)
Ability to translate ambiguous questions into structured analysis and communicate findings clearly to both technical and non-technical stakeholders
Bias toward action: you ship insights that influence decisions, not just dashboards
Nice to have:
Experience with LLM or AI agent evaluation—understanding of prompt-response patterns, agent evaluation frameworks, or model quality measurement
Background in high-growth SaaS or PLG companies with large-scale event data
Experience with modern data stack (BigQuery, dbt, Fivetran, Segment, Hex)
Familiarity with experimentation platforms (LaunchDarkly, Statsig, Eppo, or similar)
Understanding of developer tools or software engineering workflows
You've built agent or LLM evaluation frameworks from scratch
Experience with causal inference methods (difference-in-differences, synthetic control, CUPED)
Familiarity with real-time data systems or operational analytics for monitoring agent performance
Experience working with trace data, logging systems, or observability tooling