CrawlJobs Logo

Machine Learning Eval Engineer

reducto.ai Logo

Reducto

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

150000.00 - 300000.00 USD / Year

Job Description:

As an ML Eval Engineer, you’ll play a key role in building the evaluation systems and benchmarks that make Reducto’s models better over time. You’ll collaborate closely with our ML, platform, and GTM teams to identify model weaknesses, design strong benchmarks, and create metrics and tooling that surface new failure modes as we scale. This is a high-impact role where you’ll help define how model quality is measured at Reducto and shape the systems we use to improve it.

Job Responsibility:

  • Design, build, and maintain evaluation benchmarks that reveal where our models perform well and where they fail
  • Develop metrics, heuristics, and workflows to automatically identify new failure modes across large and messy real-world datasets
  • Partner closely with other ML engineers to turn evaluation insights into model improvements and better training priorities
  • Work hands-on with unstructured enterprise data, including PDFs, spreadsheets, and other difficult document formats, to uncover edge cases and hard examples
  • Build lightweight internal and user-facing tools, including simple interfaces in Python frameworks like Flask, to help teams inspect results, analyze model behavior, and communicate evaluation outcomes
  • Collaborate with customers and internal teams to understand real-world data needs and create bespoke benchmarks that highlight Reducto’s strengths

Requirements:

  • Hold yourself to a high bar for quality and precision
  • Enjoy solving complex problems and building from first principles
  • Have strong Python skills and can independently build clean, reliable technical solutions
  • Are comfortable working with data infrastructure such as AWS S3 and OLAP or analytics systems like Tinybird
  • Love getting your hands dirty with unstructured data and chasing down difficult failure cases
  • Operate well in fast-changing, high-growth environments
  • Collaborate effectively across technical and non-technical teams
  • Take full ownership from strategy through execution

Nice to have:

  • Bonus points for product and frontend experience
  • Have experience at an early-stage or high-growth startup
  • Have some background in product thinking and can build simple, polished user-facing interfaces
  • Are comfortable working directly with customers to understand their workflows and data needs
  • Have experience in AI/ML, data infrastructure, enterprise software, or document understanding systems
  • Care deeply about combining technical excellence with business impact
What we offer:
  • Unlimited PTO
  • Lunch: Receive a free lunch to eat with your teammates daily at the office
  • Reimbursed Transportation: Provide us with your receipts and we’ll take care of the costs
  • Insurance: Generous health insurance covering medical, dental, and vision
  • Health and Wellness Budget: We provide up to $150/mo reimbursement for health and wellness spending, such as gym memberships, fitness classes, or similar
  • Parental Leave: Work with us to build a leave schedule that works for you and your family

Additional Information:

Job Posted:
March 25, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Machine Learning Eval Engineer

Principal AI Engineer

We are looking for a Principal AI Engineer to lead the design and deployment of ...
Location
Location
United States
Salary
Salary:
200000.00 - 300000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of software engineering experience
  • at least 3 years in applied LLM or agentic AI systems (2023–present)
  • proven success in deploying LLM-powered products used by real users at scale
  • deep backend & systems engineering expertise with Python, distributed systems, and scalable APIs
  • familiarity with LangChain, LlamaIndex, or similar orchestration frameworks
  • experience with RAG pipelines, vector DBs, embedding models, and semantic search tuning
  • experience managing performance across cloud providers (e.g., AWS Bedrock, OpenAI, Anthropic, etc.)
  • demonstrated experience building multi-step agents, planning workflows, chaining reasoning steps, and integrating APIs with agent memory/state
  • comfort with advanced prompting strategies, few-shot and chain-of-thought reasoning, and embedding retrieval setups
  • strong understanding of AI system evaluation, human ratings, A/B experimentation, and feedback loop pipelines
Job Responsibility
Job Responsibility
  • Architect and lead the development of multi-agent systems capable of long-horizon planning, reasoning, and API orchestration
  • build reusable agentic components that integrate deeply into sales and marketing processes
  • own and evolve our in-house platform for scalable, low-latency, and cost-efficient LLM and agent deployments
  • lead design of interfaces powered by natural language understanding and retrieval-augmented generation (RAG)
  • build embedding-based, intent-aware search and personalization systems tuned to business user needs
  • drive innovation in personalized outreach generation using context-aware generation pipelines
  • tune inference pipelines, caching layers, and model selection logic for high-scale, cost-aware performance
  • define and drive robust offline and online testing methodologies (A/B, sandboxing, human evals) across agents and LLM flows
  • architect human-in-the-loop systems and telemetry to improve accuracy, UX, and explainability over time
What we offer
What we offer
  • equity
  • company bonus or sales commissions/bonuses
  • 401(k) plan
  • at least 10 paid holidays per year
  • flex PTO
  • parental leave
  • employee assistance program
  • wellbeing benefits
  • global travel coverage
  • life/AD&D/STD/LTD insurance
  • Fulltime
Read More
Arrow Right

Lead AI Engineer

As a Lead AI Engineer in our Artificial Intelligence Group, you will be working ...
Location
Location
Portugal; United Kingdom
Salary
Salary:
Not provided
outsystems.com Logo
OutSystems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Ph.D. or MSc degree in Computer Science, Electrical Engineering, Artificial Intelligence, Machine Learning, or related technical fields is highly valued
  • 7+ years of work experience in Machine Learning and AI Engineering roles, preferably building with LLMs and/or Agentic Systems, from development to production
  • Hands-on experience with LLM frameworks (e.g. langchain, langgraph, llamaindex, langsmith), AI platforms (e.g. Databricks, mlflow) and containerization solutions (e.g. Docker) is valued
  • Strong software engineering fundamentals and proficiency in Python are a must
  • Be creative, ambitious and curious. Be resourceful and innovative
Job Responsibility
Job Responsibility
  • Stay up to date with the state-of-the-art in Artificial Intelligence and Machine Learning, and apply it to the problem of Enterprise Vibe Coding
  • Make use of techniques such as In-Context Learning (Prompt Engineering & Retrieval Augmented Generation) and Tool Calling to guide Agents to perform as intended
  • Work in an “eval-driven development” methodology, applying evaluation techniques such as model-based critiques, semantic accuracy benchmarks, and tool-use success rates as the basis for developing robust, reliable, and production-ready LLM Agents
  • Work across the entire lifecycle of product development, from ideation, through prototyping and designing to implementation and evaluation
  • Collaborate with cross-functional teams to ship and iterate on production-grade AI products
  • Own the end-to-end development of AI features of a customer-facing product, ensuring software quality, observability and scalability
  • Be a thought leader in the space, advancing the field by putting the most forward ideas into practice in a large-scale project
What we offer
What we offer
  • A company that is always growing, changing, and innovating
  • Real career opportunities
  • Work colleagues that are as smart, hard-working, and driven as you
  • Disrupting the status quo is in our DNA
  • We ask “why” a lot
  • Inclusive culture of diversity
  • Fulltime
Read More
Arrow Right

Senior AI Engineer

As a Senior AI Engineer in our Artificial Intelligence Group, you will be workin...
Location
Location
Portugal; United Kingdom
Salary
Salary:
Not provided
outsystems.com Logo
OutSystems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Ph.D. or MSc degree in Computer Science, Electrical Engineering, Artificial Intelligence, Machine Learning, or related technical fields is highly valued
  • 5+ years of work experience in Machine Learning and AI Engineering roles, preferably building with LLMs and/or Agentic Systems, from development to production
  • Hands-on experience with LLM frameworks (e.g. langchain, langgraph, llamaindex, langsmith), AI platforms (e.g. Databricks, mlflow) and containerization solutions (e.g. Docker) is valued
  • Strong software engineering fundamentals and proficiency in Python are a must
  • Be creative, ambitious and curious. Be resourceful and innovative.
Job Responsibility
Job Responsibility
  • Stay up to date with the state-of-the-art in Artificial Intelligence and Machine Learning, and apply it to the problem of Enterprise Vibe Coding
  • Make use of techniques such as In-Context Learning (Prompt Engineering & Retrieval Augmented Generation) and Tool Calling to guide Agents to perform as intended
  • Work in an “eval-driven development” methodology, applying evaluation techniques such as model-based critiques, semantic accuracy benchmarks, and tool-use success rates as the basis for developing robust, reliable, and production-ready LLM Agents
  • Work across the entire lifecycle of product development, from ideation, through prototyping and designing to implementation and evaluation
  • Collaborate with cross-functional teams to ship and iterate on production-grade AI products
  • Own the end-to-end development of AI features of a customer-facing product, ensuring software quality, observability and scalability
  • Be a thought leader in the space, advancing the field by putting the most forward ideas into practice in a large-scale project.
What we offer
What we offer
  • A company that is always growing, changing, and innovating
  • Real career opportunities
  • Work colleagues that are as smart, hard-working, and driven as you
  • Disrupting the status quo is in our DNA
  • We ask “why” a lot
  • Inclusive culture of diversity
  • Fulltime
Read More
Arrow Right

Applied Research - Evals & Data

Prime Intellect builds the infrastructure that frontier AI labs build internally...
Location
Location
United States , San Francisco
Salary
Salary:
Not provided
Prime Intellect
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong background in machine learning engineering, with experience in post-training, RL, or large-scale model alignment
  • Experience with applied data workflows and evaluation frameworks for large models or agents (e.g., SWE-Bench, HELM, EvalFlow, internal eval pipelines)
  • Deep expertise in distributed training/inference frameworks (e.g., vLLM, sglang, Ray, Accelerate)
  • Experience deploying containerized systems at scale (Docker, Kubernetes, Terraform)
  • Track record of research contributions (publications, open-source contributions, benchmarks) in ML/RL
  • Passion for advancing the state-of-the-art in reasoning, measurement, and building practical, agentic AI systems
Job Responsibility
Job Responsibility
  • Advancing Agent Capabilities: Designing and iterating on next-generation AI agents that tackle real workloads—workflow automation, reasoning-intensive tasks, and decision-making at scale
  • Building Robust Infrastructure: Developing the distributed systems, evaluation pipelines, and coordination frameworks that enable these agents to operate reliably, efficiently, and at massive scale
  • Bridge Between Customers & Research: Translating customer needs and insights from applied data into clear technical requirements that guide product and research priorities
  • Prototype in the Field: Rapidly designing and deploying agents, evals, and harnesses alongside customers to validate solutions
  • Customer-Facing Engineering: Work side-by-side with customers to deeply understand workflows, data sources, and bottlenecks
  • Post-training & Reinforcement Learning: Design and implement novel RL and post-training methods (RLHF, RLVR, GRPO, etc.) to align large models with domain-specific tasks
  • Agent Development & Infrastructure: Rapidly prototype and iterate on AI agents for automation, workflow orchestration, and decision-making
What we offer
What we offer
  • Competitive Compensation + equity incentives
  • Flexible Work (remote or San Francisco)
  • Visa Sponsorship & relocation support
  • Professional Development budget
  • Team Off-sites & conference attendance
  • Fulltime
Read More
Arrow Right

Tech Lead Manager Machine Learning Research Scientist LLM Evals

As the Tech Lead Manager of the LLM Evals Research team, you will lead a talente...
Location
Location
United States , San Francisco; Seattle; New York
Salary
Salary:
280000.00 - 380000.00 USD / Year
scale.com Logo
Scale
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on experience in large language model, NLP, and Transformer modeling, in the setting of both research and engineering development
  • Experience and track of recording in landing major research impacts in a fast-paced environment
  • Experience supporting and leading a team of research scientists and research engineers
  • Excellent written and verbal communication skills
  • Published research in areas of machine learning at major conferences (NeurIPS, ICML, ICLR, ACL, EMNLP, CVPR, etc.) and/or journals
  • Previous experience in a customer facing role
Job Responsibility
Job Responsibility
  • Lead a team of highly effective research scientists and research engineers on LLM evals
  • Conduct research on the effectiveness and limitations of existing LLM evaluation techniques
  • Design and develop novel evaluation benchmarks for large language models, covering areas such as instruction following, factuality, robustness, and fairness
  • Communicate, collaborate, and build relationships with clients and peer teams to facilitate cross-functional projects
  • Collaborate with internal teams and external partners to refine metrics and create standardized evaluation protocols
  • Implement scalable and reproducible evaluation pipelines using modern ML frameworks
  • Publish research findings in top-tier AI conferences and contribute to open-source benchmarking initiatives
  • Remain up-to-date on ongoing research in the team, help work through technical challenges, and be involved in design decisions
  • Remain deeply involved in the research community, both understanding trends, and setting them
  • Thrive in a high-energy, fast-paced startup environment and are ready to dedicate the time and effort needed to drive impactful results
What we offer
What we offer
  • Comprehensive health, dental and vision coverage
  • retirement benefits
  • a learning and development stipend
  • generous PTO
  • Fulltime
Read More
Arrow Right

Senior AI Software Developer

The Senior AI Engineer owns end-to-end delivery of AI features—from design to pr...
Location
Location
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or master’s degree in computer science, engineering, data science, machine learning, artificial intelligence, or closely related quantitative discipline
  • Typically, 7-10 years’ experience
  • LLMs & Agents: Prompt engineering, function/tool calling, orchestration frameworks, RAG
  • ML/DS: Evaluation metrics (precision/recall, BLEU/ROUGE where relevant), error analysis
  • Data/RAG: Embeddings, similarity (cosine/IP), chunking, rerankers, vector DB operations
  • Backend: Python (FastAPI/Flask), microservices patterns
  • MLOps/Infra: Docker, Kubernetes, CI/CD, artifact management, GPU scheduling
  • Observability: Metrics/logging/tracing, dashboards, automated evaluation pipelines
  • Frameworks: PyTorch/TensorFlow, Hugging Face, LangChain/LlamaIndex
  • Data: Pandas, SQL/NoSQL, Parquet/Arrow, Kafka/queues
Job Responsibility
Job Responsibility
  • Translate high-level designs into clear component contracts, APIs, and service boundaries
  • Implement LLM integrations, RAG pipelines, agents, tool/function calling, and prompt strategies
  • Own feature delivery for sprints/releases
  • maintain high code quality and documentation
  • Fine-tune models when needed
  • design evaluation harnesses and metrics
  • Build A/B testing setups
  • track accuracy, latency, robustness, and task success rates
  • Conduct error analysis
  • iterate using feedback efficacy loops and prompt refinement
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
Read More
Arrow Right

Engineering Manager, AI Quality

At Harvey, we’re transforming how legal and professional services operate — not ...
Location
Location
United States , San Francisco
Salary
Salary:
260000.00 - 330000.00 USD / Year
harvey.ai Logo
Harvey
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Significant industry experience leading AI or other related quality efforts (search or ads ranking systems, recommender systems, etc) at industry-leading companies
  • Excellent software engineering skills
  • experience working in teams that built and operated production systems
  • Excellent communication skills, both written and verbal
  • Ability to roll up your sleeves and build team, processes, and technology from the ground up
  • A strong academic or professional background in Machine Learning, Generative AI, Information Retrieval or other related fields
Job Responsibility
Job Responsibility
  • Establish an AI & Results Quality Program at Harvey: Establish offline and online eval processes and tools, and a culture of continuous iteration and experimentation
  • Build out core AI quality building blocks that can be reused across different teams and surface areas
  • Own search & retrieval quality, both for AI applications as well as other use cases
  • Work closely with product engineering to continuously improve the quality and capabilities of our AI & search products
  • Work closely with platform engineering to scale our capabilities
What we offer
What we offer
  • Offers Equity
  • Offers Bonus
  • Comprehensive health, dental and vision coverage
  • retirement benefits (401k match up to 4%)
  • flexible PTO
  • Fulltime
Read More
Arrow Right

Principal Software Engineer, AI

The NL2KQL (Natural Language to Kusto Query Language) team builds AI powered cap...
Location
Location
United States , Bellevue
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Candidates must be able to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Leads by example within the team by producing extensible and maintainable code
  • Drives identification of dependencies and the development of design documents. For areas of dependency and overlap with other teams or team members, drives coordination and communicates across teams and resolves conflicts between teams
  • Leads discussions for the architecture of products/solutions and creates proposals for architecture
  • Innovation through experimentation: Initiate and guide experiments to evaluate new technologies and determine best-fit solutions
  • Support Data Scientists and research members in the team by offering rigid engineering environment to innovate, fail fast and deliver quickly
  • Develop scalable, high-quality solutions: Build software that is reliable, maintainable, and scalable to meet evolving business needs
  • Support and Develop Others: Mentor team members, encourage inclusive engineering practices, and contribute to building a diverse and talented workforce aligned with our mission
  • Embed operational excellence: Incorporate live site readiness, monitoring, and incident response into the development lifecycle
  • AI first development: Employ AI for development cycle, embracing non-deterministic nature of AI products with evals and experimentation
  • Fulltime
Read More
Arrow Right