CrawlJobs Logo

Machine Learning Eval Engineer

United States, San Francisco 150000.00 - 300000.00 USD / Year · Job Posted March 25, 2026
Apply Position
Job Link Share

Job Description

As an ML Eval Engineer, you’ll play a key role in building the evaluation systems and benchmarks that make Reducto’s models better over time. You’ll collaborate closely with our ML, platform, and GTM teams to identify model weaknesses, design strong benchmarks, and create metrics and tooling that surface new failure modes as we scale. This is a high-impact role where you’ll help define how model quality is measured at Reducto and shape the systems we use to improve it.

Job Responsibility

  • Design, build, and maintain evaluation benchmarks that reveal where our models perform well and where they fail
  • Develop metrics, heuristics, and workflows to automatically identify new failure modes across large and messy real-world datasets
  • Partner closely with other ML engineers to turn evaluation insights into model improvements and better training priorities
  • Work hands-on with unstructured enterprise data, including PDFs, spreadsheets, and other difficult document formats, to uncover edge cases and hard examples
  • Build lightweight internal and user-facing tools, including simple interfaces in Python frameworks like Flask, to help teams inspect results, analyze model behavior, and communicate evaluation outcomes
  • Collaborate with customers and internal teams to understand real-world data needs and create bespoke benchmarks that highlight Reducto’s strengths

Requirements

  • Hold yourself to a high bar for quality and precision
  • Enjoy solving complex problems and building from first principles
  • Have strong Python skills and can independently build clean, reliable technical solutions
  • Are comfortable working with data infrastructure such as AWS S3 and OLAP or analytics systems like Tinybird
  • Love getting your hands dirty with unstructured data and chasing down difficult failure cases
  • Operate well in fast-changing, high-growth environments
  • Collaborate effectively across technical and non-technical teams
  • Take full ownership from strategy through execution

Nice to have

  • Bonus points for product and frontend experience
  • Have experience at an early-stage or high-growth startup
  • Have some background in product thinking and can build simple, polished user-facing interfaces
  • Are comfortable working directly with customers to understand their workflows and data needs
  • Have experience in AI/ML, data infrastructure, enterprise software, or document understanding systems
  • Care deeply about combining technical excellence with business impact

What we offer

  • Unlimited PTO
  • Lunch: Receive a free lunch to eat with your teammates daily at the office
  • Reimbursed Transportation: Provide us with your receipts and we’ll take care of the costs
  • Insurance: Generous health insurance covering medical, dental, and vision
  • Health and Wellness Budget: We provide up to $150/mo reimbursement for health and wellness spending, such as gym memberships, fitness classes, or similar
  • Parental Leave: Work with us to build a leave schedule that works for you and your family

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Machine Learning Eval Engineer

8 matching positions

Senior Machine Learning Engineer

LMArena is seeking a Senior Machine Learning Engineer to help scale and strength...
Location
Location
United States , Bay Area
Salary
Salary:
Not provided
arena.ai Logo
Arena Intelligence, Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong programming skills with the ability to work across the stack in a typical recommendation system or LLM stack
  • Experience in deep learning, language models or reward model training
  • Experience in working with LLM for fine tuning, prompt engineering, function calling etc
  • Self-motivated with a willingness to take ownership of tasks
  • A passion for shipping quality products
  • 4+ years of industry experience or relevant projects
  • Solid understanding of statistics, and various tools and methodologies for evaluating uncertainty in a way that is specific to the given product being shipped
Job Responsibility
Job Responsibility
  • Architect and build what will become our core modeling for data and evaluation products
  • Own the full stack data, model training, and eval pipelines
  • Help grow a culture of feedback and rapid product iteration as we build new features as a tight-nit team
  • Conduct research into state-of-the-art evaluation methods and contribute to the long-term vision for a centralized, scalable evaluation platform
What we offer
What we offer
  • Comprehensive health and wellness benefits, including medical, dental, vision, and additional support programs
  • The opportunity to work on cutting-edge AI with a small, mission-driven team
  • A culture that values transparency, trust, and community impact
  • Fulltime
Read More
Arrow Right

Lead Machine Learning Engineer

As a Lead Machine Learning Engineer, you will be the hands-on technical owner of...
Location
Location
India , Mumbai
Salary
Salary:
Not provided
mygwork.com Logo
myGwork - LGBTQ+ Business Community
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's, Master's, or PhD in Computer Science, Mathematics, Data Science, or a related field
  • 5+ years of experience in the ML Engineering and Data Science field, with a focus on LLM and GenAI technologies, particularly in data collection and unstructured data processing
  • 1+ years of experience in technical lead position
  • Strong expertise in NLP and machine learning, with hands-on experience in classifiers, large language models (LLMs), Model Context Protocol (MCP), Agentic AI, and other advanced NLP techniques
  • Extensive experience with data pipeline and messaging technologies such as Apache Kafka, Airflow, and cloud data platforms (e.g., Snowflake)
  • Expert-level proficiency in Python, SQL, and other relevant programming languages and tools
  • Proficiency in Amazon Web Services (AWS) and Google Cloud Platform (GCP)
  • Strong understanding of cloud-native technologies and containerization (e.g., Kubernetes, Docker) with experience in managing these systems globally
  • Demonstrated ability to solve complex technical challenges and deliver scalable solutions
  • Excellent communication skills with a collaborative approach to working with global teams and stakeholders
Job Responsibility
Job Responsibility
  • Convert business goals into a clear AI/ML roadmap for data acquisition, extraction, enrichment, and measurable outcomes
  • Architect and ship scalable ML/NLP/LLM (RAG, embeddings, reranking, Agentic AI, MCP) services with high reliability and efficiency
  • Mentor engineers and data scientists through design/code reviews, setting technical standards and elevating craftsmanship
  • Build and integrate classifiers, transformers, LLMs, and evaluators that process and categorize unstructured data at scale
  • Design, operate, and optimize high-throughput collection pipelines with robust orchestration, messaging, storage, and SLAs
  • Partner with Product, Data Collection Engineering, Platform/SRE, and Security to turn ambiguous needs into phased, observable deliveries
  • Pilot and productionize advances in GenAI, Agentic AI, RAG, and MCP to improve quality, speed, and cost
  • Enforce data governance, privacy, and model transparency with least-privilege IAM, secrets management, and auditability
  • Apply Agile/Lean/Fast-Flow practices to reduce cycle time, raise quality, and remove toil via automation
  • Deliver cloud-native solutions on AWS and GCP using Docker/Kubernetes, autoscaling, and progressive delivery patterns
What we offer
What we offer
  • Hybrid work environment (four days in-office each week in most locations)
  • A range of other benefits are also available to enhance flexibility as needs change
  • Tools and resources to engage meaningfully with your global colleagues
  • Fulltime
Read More
Arrow Right

Staff Machine Learning Research Scientist, LLM Evals

As a Staff Machine Learning Research Scientist on the LLM Evals team, you will l...
Location
Location
United States , San Francisco; Seattle; New York
Salary
Salary:
280000.00 - 380000.00 USD / Year
scale.com Logo
Scale
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on experience in large language model, NLP, and Transformer modeling, in the setting of both research and engineering development
  • Experience and track of recording in landing major research impacts in a fast-paced environment
  • Experience tech leading a team of research scientists and research engineers
  • Excellent written and verbal communication skills
  • Published research in areas of machine learning at major conferences (NeurIPS, ICML, ICLR, ACL, EMNLP, CVPR, etc.) and/or journals
  • Previous experience in a customer facing role.
Job Responsibility
Job Responsibility
  • Drive research on the effectiveness and limitations of existing LLM evaluation techniques
  • Design and develop novel evaluation benchmarks for large language models, covering areas such as instruction following, factuality, robustness, and fairness
  • Communicate, collaborate, and build relationships with clients and peer teams to facilitate cross-functional projects
  • Collaborate with internal teams and external partners to refine metrics and create standardized evaluation protocols
  • Implement scalable and reproducible evaluation pipelines using modern ML frameworks
  • Publish research findings in top-tier AI conferences and contribute to open-source benchmarking initiatives
  • Mentor and guide research scientists and engineers, providing technical leadership across cross-functional projects
  • Stay deeply engaged with the ML research community, tracking emerging work and contributing to the advancement of LLM evaluation science
  • Thrive in a high-energy, fast-paced startup environment and are ready to dedicate the time and effort needed to drive impactful results.
What we offer
What we offer
  • Comprehensive health, dental and vision coverage
  • retirement benefits
  • a learning and development stipend
  • generous PTO
  • equity based compensation
  • commuter stipend (may be eligible).
  • Fulltime
Read More
Arrow Right

Tech Lead Manager Machine Learning Research Scientist LLM Evals

As the Tech Lead Manager of the LLM Evals Research team, you will lead a talente...
Location
Location
United States , San Francisco; Seattle; New York
Salary
Salary:
280000.00 - 380000.00 USD / Year
scale.com Logo
Scale
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on experience in large language model, NLP, and Transformer modeling, in the setting of both research and engineering development
  • Experience and track of recording in landing major research impacts in a fast-paced environment
  • Experience supporting and leading a team of research scientists and research engineers
  • Excellent written and verbal communication skills
  • Published research in areas of machine learning at major conferences (NeurIPS, ICML, ICLR, ACL, EMNLP, CVPR, etc.) and/or journals
  • Previous experience in a customer facing role
Job Responsibility
Job Responsibility
  • Lead a team of highly effective research scientists and research engineers on LLM evals
  • Conduct research on the effectiveness and limitations of existing LLM evaluation techniques
  • Design and develop novel evaluation benchmarks for large language models, covering areas such as instruction following, factuality, robustness, and fairness
  • Communicate, collaborate, and build relationships with clients and peer teams to facilitate cross-functional projects
  • Collaborate with internal teams and external partners to refine metrics and create standardized evaluation protocols
  • Implement scalable and reproducible evaluation pipelines using modern ML frameworks
  • Publish research findings in top-tier AI conferences and contribute to open-source benchmarking initiatives
  • Remain up-to-date on ongoing research in the team, help work through technical challenges, and be involved in design decisions
  • Remain deeply involved in the research community, both understanding trends, and setting them
  • Thrive in a high-energy, fast-paced startup environment and are ready to dedicate the time and effort needed to drive impactful results
What we offer
What we offer
  • Comprehensive health, dental and vision coverage
  • retirement benefits
  • a learning and development stipend
  • generous PTO
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, AI Eval

As a Senior Software Engineer on Sentry’s AI/ML team, you’ll be responsible for ...
Location
Location
United States , San Francisco
Salary
Salary:
240000.00 - 280000.00 USD / Year
sentry.io Logo
Sentry
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum 5+ years of professional experience with a Bachelor’s degree in computer science, machine learning, or a related field
  • Experience building testing, evaluation, or data infrastructure for complex systems (AI/ML experience strongly preferred)
  • Comfort writing production-quality code (we use Python and TypeScript)
  • Experience working with structured and unstructured datasets, labeling workflows, or data quality pipelines
  • Familiarity with modern ML systems and evaluation techniques (e.g., offline metrics, online evaluation, regression testing for models or prompts)
Job Responsibility
Job Responsibility
  • Design and build robust evaluation frameworks to measure accuracy, reliability, regressions, and edge cases in AI systems
  • Create and curate high-quality datasets, golden test cases, and benchmarks grounded in real production data
  • Build automated test harnesses and metrics pipelines to continuously evaluate models, prompts, and agentic workflows
  • Partner closely with applied AI engineers and product leaders to define what “good” looks like and translate it into measurable criteria
  • Own the evaluation lifecycle for major AI initiatives, from early experimentation through production monitoring
What we offer
What we offer
  • incentive compensation
  • equity grants
  • paid time off
  • group health insurance coverage
  • Fulltime
Read More
Arrow Right

Head of Applied AI & Agent Factory - Managing Director

Location
Location
United States , New York
Salary
Salary:
250000.00 - 500000.00 USD / Year
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years of progressive leadership experience, including senior roles building and scaling delivery organizations - consulting partnerships, Forward Deployed Engineering / solutions engineering teams, or comparable in-business technology delivery functions inside complex firms.
  • Builder-operator profile — has personally stood up and scaled delivery organizations from a small core to hundreds of practitioners, ideally inside or alongside large, regulated businesses.
  • Deep current expertise in modern AI - agentic systems, LLM-based applications, AI evaluation and certification, and the practical realities of getting AI agents adopted in production enterprise environments.
  • Strong commercial and storytelling skills - credible peer to business COOs and CEOs
  • able to co-design process redesigns and to sell the change rather than impose it.
  • Bias to action, ruthlessly outcome-focused - relentless on real adoption, usage, and business impact rather than activity metrics
  • comfortable killing projects that aren't landing.
  • Cross-functional influencer - proven ability to operate in a matrixed environment across product, platform, controls, and business stakeholders without direct authority over all of them.
  • Strong financial acumen - fluent in business cases, ROI realization, and the commercial drivers of a complex global firm.
  • Risk & Compliance awareness - understands controls, model risk, and regulatory expectations as they apply to AI agents in global banking.
Job Responsibility
Job Responsibility
  • Operate the Agent Factory: an industrialized capability for the design, build, evaluation, and certification of enterprise AI agents at Citi scale.
  • Define and continuously evolve the firm's standards, patterns, and reference implementations for enterprise agents in close partnership with the Head of Core AI Platform (who provides the agentic runtime, guardrails, and evals) and the Head of Responsible AI (who owns approval and controls).
  • Own the official enterprise agent catalog - the curated, certified set of agents that Citi formally sanctions for production use across the firm - including lifecycle management, versioning, and decommissioning.
  • Build and lead a cadre of Applied AI engineers embedded directly into business lines, partnering with business COOs and process owners to design, deliver, and scale AI agents inside real workflows.
  • Co-design agent-led process redesigns with business COOs - moving beyond point automations to the redesign of end-to-end business processes around AI agents.
  • Operate a deliberate 'delivery-to-platform' feedback loop: ensure reusable assets, patterns, and components emerging from business-specific work are fed back into the Core AI Platform and Enterprise AI Solutions portfolio rather than re-built.
  • Partner with the Head of Enterprise AI Solutions to graduate high-leverage, repeatedly built business solutions into productized, horizontal capabilities.
What we offer
What we offer
  • Medical, dental & vision coverage
  • 401(k)
  • Life, accident, and disability insurance
  • Wellness programs
  • Paid time off packages, including planned time off (vacation), unplanned time off (sick leave), and paid holidays
  • Fulltime
Read More
Arrow Right

Mid Level Genai Engineers

We are currently seeking a Mid level GenAI Engineers to join our team in Bangalo...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor Master Degree or equivalent
  • 5+ years in ML engineering, around 1+ years of hands on with LLMs GenAI and agentic frameworks
  • Experience in shipping production AI systems on at least one hyperscaler Azure AWS GCP
  • Experience delivering end to end GenAI based solutions
  • Strong Python experience to build multiple AI ML GenAI Solutions
  • Should have experience working on Agent orchestration with leading frameworks like LangGraph LangChain Semantic Kernel CrewAI AutoGen
  • Strong experience working on SQL Query Vector DB like Pinecone Qdrant Fiaas
  • Should have experience working on hybrid search and re rankers
  • Should have experience on evaluation observability LangSmith human in the loop workflows
  • Strong experience in using any one of the leading Hyperscaler services from Azure AWS GCP
Job Responsibility
Job Responsibility
  • Build GenAI agentic systems chat copilots workflow graph agents tool use memory
  • Implement chunking hybrid search vector stores re ranking feedback loops and continuous data quality eval
  • Select integrate finetune LLMs multimodal models
  • Apply prompt engineering techniques to specific use cases and types
  • Experience working on solutions based on LLM NLP DL Deep Learning ML Machine Learning object detection classification etc
  • Should have good understanding of DevOps
  • Should have good understanding of LLM evaluation
  • Should have deployed min of 2 models in production MLOps
  • Should have understanding of guardrails policy filters PII redaction runtime monitors and agent observability
  • Unit testing of GenAI Solutions built and documentation of results
  • Fulltime
Read More
Arrow Right

AI Engineer II

The AI Center of Excellence (AI CoE) brings together AI Engineers and Data Scien...
Location
Location
India , Pune
Salary
Salary:
Not provided
rapid7.com Logo
Rapid7
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2–5 years of experience in AI/ML engineering or software engineering with AI focus
  • Foundational hands-on experience with LangChain or similar LLM orchestration frameworks
  • Familiarity with prompt engineering concepts and techniques
  • Basic understanding of RAG pipelines - what they are, how retrieval works, and where they're applied
  • Awareness of agentic AI patterns - tool-calling, agent loops, ReAct
  • Exposure to LLM evaluation - understanding what good vs. bad LLM output looks like and how to measure it
  • Working knowledge of AWS Bedrock and/or SageMaker for AI/ML workloads
  • Strong Python skills and a learning-first mindset
  • Working proficiency with pandas, NumPy, scikit-learn
  • Solid understanding of supervised and unsupervised ML, feature engineering, and model evaluation metrics
Job Responsibility
Job Responsibility
  • Contribute to building agentic AI workflows - tool-calling, basic agent loops, and LLM-driven automation under senior guidance
  • Assist in developing and maintaining RAG pipelines - document ingestion, chunking, embedding, and retrieval
  • Implement and iterate on prompt engineering - few-shot prompting, chain-of-thought, structured outputs
  • Work with LangChain / LangGraph for LLM orchestration and chaining tasks
  • Support LLM evaluation tasks - writing eval datasets, measuring output quality, running benchmarks
  • Contribute to observability and monitoring of LLM systems - latency, token usage, output quality dashboards
  • Deploy and test LLM-powered features on AWS Bedrock, Lambda, and SageMaker
  • Participate in prompt versioning and LLM CI/CD pipelines under guidance of senior engineers
  • Assist with guardrail implementation and output validation for production GenAI systems
  • Learn and apply agentic AI patterns - ReAct, tool-use APIs, and structured output parsing
  • Fulltime
Read More
Arrow Right