CrawlJobs Logo

Software Engineer, Applied Evals

openai.com Logo

OpenAI

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

230000.00 - 325000.00 USD / Year

Job Description:

Applied Evals defines what good looks like for safe, advanced AI systems. We turn complex, high-value workflows into clear, reproducible signals that guide model training and product quality. Our work bridges frontier customers and models, ensuring improvements show up where users experience them. We combine hands-on, unscalable efforts with systems that others can extend, creating a compounding loop of model improvement. We’re hiring product-minded engineers to design and build evals and harnesses that capture real-world quality for advanced AI systems. You’ll own the loop from prototyping with users to building reliable pipelines and integrating signals into training stacks. This role sits at the center of model improvement. The systems you design will directly shape how models behave, accelerate their reliability, and raise the standard for what customers expect. You’ll collaborate closely with research and product teams and work across the stack, from backend pipelines to user-facing interfaces. The work includes evaluating multi-turn and tool-using systems, designing agent harnesses, and applying reinforcement learning and related methods in production settings. Engineers who succeed in this role bring both a builder’s mindset and the judgment to create reusable systems that others can build on. Many thrive here by operating like founders or founding engineers, taking initiative, moving quickly, and creating structure where none exists.

Job Responsibility:

  • Define the core evaluation signals that drive model improvement at OpenAI, turning vague product gaps into crisp, defensible measures of quality
  • Design agents, harnesses, and eval pipelines that are reliable, reproducible, and extendable
  • Prototype solutions with real workflows and convert them into scalable feedback loops
  • Connect evaluation signals directly to research and training systems so product improvements show up in what users experience
  • Shape model interaction paradigms by partnering with engineering, research, and product teams on how models are deployed and measured
  • Build reusable systems and tools that enable contributions from across the company and steadily raise the quality bar

Requirements:

  • 4+ years of experience in software engineering with strong fundamentals and a track record of shipping production systems end-to-end
  • Experience building AI agents or applications, including designing evals and improving performance through prompting or scaffolding
  • Familiarity with evaluation methods for LLMs and have worked with patterns like multi-agent workflows, tool use, or long context
  • Familiarity with deep learning concepts or prior exposure to training models
  • Ability to communicate clearly across technical and non-technical audiences across levels
  • Motivated by high-impact collaboration with research and product teams and thrive in ambiguity
What we offer:
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Software Engineer, Applied Evals

Senior Software Engineer - Studio - Java, AI

As a Senior Software Engineer, you’ll build the backend that powers AI features ...
Location
Location
United States , New York
Salary
Salary:
175000.00 - 240000.00 USD / Year
clearstreet.io Logo
Clear Street
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 7+ years of strong proficiency in enterprise Java
  • Experience designing and deploying AI/ML or LLM-backed systems in production
  • Familiarity with LLM tooling and patterns: (e.g. tool calling, RAG pipelines and knowledge bases, evals, cost/latency tradeoffs, basic red-teaming)
  • Experience in supporting and running systems in a production environment
  • Comfortable working in a dynamic environment, partnering with cross-functional teams, and moving from prototype to reliable production
Job Responsibility
Job Responsibility
  • Design, implement, and productionize reliable AI workflows to augment the Studio trading platform
  • Build tooling to monitor, tune, and evaluate models and workflows, as well as applicable guardrails to ensure outputs meet quality and regulatory requirements
  • Collaborate with technical and non-technical teams across the firm to identify high ROI AI opportunities
  • Build rapid prototypes and translate them into production-grade systems. Utilize the latest AI-powered development tools to iterate quickly
  • Create reusable libraries, SDKs and tooling to enable AI development throughout the firm
  • Stay current on the latest in applied AI. Read papers, evaluate new models, test out new tools
  • Participate in code review and architecture design, manage deployments, and support and contribute to the success of the overall Studio platform
What we offer
What we offer
  • Competitive compensation, benefits, and perks
  • Company equity
  • 401k matching
  • Gender neutral parental leave
  • Full medical, dental and vision insurance
  • Lunch stipends
  • Fully stocked kitchens
  • Happy hours
  • Fulltime
Read More
Arrow Right

Senior Software Engineer

As a Senior Research Engineer at Microsoft, you will advance Microsoft’s mission...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, Mathematics, Statistics, Physics, or a related field and 4 or more years in applied ML or AI research and product engineering
  • Master’s degree and 3 or more years in applied ML or AI research and product engineering
  • PhD in a relevant field and 2 or more years with generative AI, LLMs, or related ML algorithms
  • Proficiency in Python and at least one deep learning framework such as PyTorch, JAX, or TensorFlow
  • Experience deploying Fine Tuned LLMs or multimodal models in live production environments
  • Experience shipping and maintaining production AI systems
  • Ability to meet Microsoft, customer, and government security screening requirements
  • Microsoft Cloud Background Check upon hire or transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Bringing State-of-the-Art Research to Products
  • Design and implement AI systems using foundation models, prompt engineering, retrieval-augmented generation, multi-agent architectures, and classic ML
  • Fine-tune large language models on domain-specific data and evaluate via offline and online methods such as A/B testing, telemetry, and shadow deployments
  • Build and harden prototypes into production-ready services using robust software engineering and MLOps practices
  • Drive original research and thought leadership (whitepapers, internal notes, patents)
  • convert insights into shipped capabilities
  • Research Translation: Continuously review emerging work
  • identify high-potential methods and adapt them to Microsoft problem spaces
  • End-to-End System Development
  • ML Design & Architecture: Own end-to-end pipeline from data prep, training, evaluation, deployment, and feedback loops
  • Fulltime
Read More
Arrow Right

Principal AI Engineer

We are looking for a Principal AI Engineer to lead the design and deployment of ...
Location
Location
United States
Salary
Salary:
200000.00 - 300000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of software engineering experience
  • at least 3 years in applied LLM or agentic AI systems (2023–present)
  • proven success in deploying LLM-powered products used by real users at scale
  • deep backend & systems engineering expertise with Python, distributed systems, and scalable APIs
  • familiarity with LangChain, LlamaIndex, or similar orchestration frameworks
  • experience with RAG pipelines, vector DBs, embedding models, and semantic search tuning
  • experience managing performance across cloud providers (e.g., AWS Bedrock, OpenAI, Anthropic, etc.)
  • demonstrated experience building multi-step agents, planning workflows, chaining reasoning steps, and integrating APIs with agent memory/state
  • comfort with advanced prompting strategies, few-shot and chain-of-thought reasoning, and embedding retrieval setups
  • strong understanding of AI system evaluation, human ratings, A/B experimentation, and feedback loop pipelines
Job Responsibility
Job Responsibility
  • Architect and lead the development of multi-agent systems capable of long-horizon planning, reasoning, and API orchestration
  • build reusable agentic components that integrate deeply into sales and marketing processes
  • own and evolve our in-house platform for scalable, low-latency, and cost-efficient LLM and agent deployments
  • lead design of interfaces powered by natural language understanding and retrieval-augmented generation (RAG)
  • build embedding-based, intent-aware search and personalization systems tuned to business user needs
  • drive innovation in personalized outreach generation using context-aware generation pipelines
  • tune inference pipelines, caching layers, and model selection logic for high-scale, cost-aware performance
  • define and drive robust offline and online testing methodologies (A/B, sandboxing, human evals) across agents and LLM flows
  • architect human-in-the-loop systems and telemetry to improve accuracy, UX, and explainability over time
What we offer
What we offer
  • equity
  • company bonus or sales commissions/bonuses
  • 401(k) plan
  • at least 10 paid holidays per year
  • flex PTO
  • parental leave
  • employee assistance program
  • wellbeing benefits
  • global travel coverage
  • life/AD&D/STD/LTD insurance
  • Fulltime
Read More
Arrow Right
New

Senior Software Engineer, AI Evals

As a Senior Software Engineer on Sentry’s AI/ML team, you’ll be responsible for ...
Location
Location
United States , San Francisco
Salary
Salary:
240000.00 - 280000.00 USD / Year
sentry.io Logo
Sentry
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum 5+ years of professional experience with a Bachelor’s degree in computer science, machine learning, or a related field
  • Experience building testing, evaluation, or data infrastructure for complex systems (AI/ML experience strongly preferred)
  • Comfort writing production-quality code (we use Python and TypeScript)
  • Experience working with structured and unstructured datasets, labeling workflows, or data quality pipelines
  • Familiarity with modern ML systems and evaluation techniques (e.g., offline metrics, online evaluation, regression testing for models or prompts)
Job Responsibility
Job Responsibility
  • Design and build robust evaluation frameworks to measure accuracy, reliability, regressions, and edge cases in AI systems
  • Create and curate high-quality datasets, golden test cases, and benchmarks grounded in real production data
  • Build automated test harnesses and metrics pipelines to continuously evaluate models, prompts, and agentic workflows
  • Partner closely with applied AI engineers and product leaders to define what “good” looks like and translate it into measurable criteria
  • Own the evaluation lifecycle for major AI initiatives, from early experimentation through production monitoring
What we offer
What we offer
  • Offers Equity
  • incentive compensation
  • equity grants
  • paid time off
  • group health insurance coverage
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, AI Eval

As a Senior Software Engineer on Sentry’s AI/ML team, you’ll be responsible for ...
Location
Location
United States , San Francisco
Salary
Salary:
240000.00 - 280000.00 USD / Year
sentry.io Logo
Sentry
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum 5+ years of professional experience with a Bachelor’s degree in computer science, machine learning, or a related field
  • Experience building testing, evaluation, or data infrastructure for complex systems (AI/ML experience strongly preferred)
  • Comfort writing production-quality code (we use Python and TypeScript)
  • Experience working with structured and unstructured datasets, labeling workflows, or data quality pipelines
  • Familiarity with modern ML systems and evaluation techniques (e.g., offline metrics, online evaluation, regression testing for models or prompts)
Job Responsibility
Job Responsibility
  • Design and build robust evaluation frameworks to measure accuracy, reliability, regressions, and edge cases in AI systems
  • Create and curate high-quality datasets, golden test cases, and benchmarks grounded in real production data
  • Build automated test harnesses and metrics pipelines to continuously evaluate models, prompts, and agentic workflows
  • Partner closely with applied AI engineers and product leaders to define what “good” looks like and translate it into measurable criteria
  • Own the evaluation lifecycle for major AI initiatives, from early experimentation through production monitoring
What we offer
What we offer
  • incentive compensation
  • equity grants
  • paid time off
  • group health insurance coverage
  • Fulltime
Read More
Arrow Right
New

Principal Software Engineer - Teams AI Features

We are looking for a Principal Software Engineer to join our team to drive all a...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 2+ years of experience on engineering tooling or eval development
  • 2+ years experience in working on services at scale
  • 1+ years experience in driving fundamentals for AI features within web apps
Job Responsibility
Job Responsibility
  • Define the vision, strategy, and roadmap for how to evaluate AI features for good fundamentals at scale across Teams
  • Lead end-to-end science and technical design for evaluating LLM-powered agents on real-time and batch workloads: designing evaluation frameworks, metrics, and pipelines that capture planning quality, tool use, retrieval, safety, and end-user outcomes, and partnering with engineering for robust, low-latency deployment
  • Establish rigorous evaluation and reliability practices for LLM/agent systems: from offline benchmarks and scenario-based evals to online experiments and production monitoring, defining guardrails and policies that balance quality, cost, and latency at scale
  • Collaborate with PM, Engineering, and UX to translate evaluation insights into customer-visible improvements, shaping product requirements, de-risking launches, and iterating quickly based on telemetry, user feedback, and real-world failure modes
  • Collaborate and mentor across product, research, and engineering teams, sharing best practices on eval design, LLM-as-judge usage, and Responsible AI, and providing code reviews and guidance that raise the bar for the AI features
  • Provide technical leadership and mentorship within the applied science and engineering community, fostering inclusive, responsible-AI practices in agent evaluation, and influencing roadmap, platform investments, and cross-team evaluation strategy across Fabric
  • Fulltime
Read More
Arrow Right

Member of Technical Staff - AI Engineer

Join our AI Engineering team to explore the boundaries of LLMs and AI, shaping T...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
tessl.ai Logo
Tessl
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience as a Software Engineer
  • Equally comfortable contributing to a mature codebase with strict CI criteria or hacking up a quick notebook to prove/disprove something
  • Proven experience collaborating with researchers and bridging between research-focused and engineering-focused teams
  • Experience with the applied use of data and statistics, likely to spot and avoid bad data when you see it
  • Deeply curious about AI and excited about its potential to transform software engineering
Job Responsibility
Job Responsibility
  • Use a bit of jq and grep to quickly navigate a dataset, but recognise when it’s time to use a more robust approach and move the team to something like dbt or duckdb
  • Tune a prompt in our generation workflow, eval the results and write an experiment report on your findings. Leave the eval tooling better than when you found it
  • Rapidly prototype a new language integration for our code generation pipeline, then develop a plan for a scalable implementation
  • Factor out a piece of our pipeline to use FaaS, unlocking 1,000x larger evals to run in nearly constant time
  • Add support for a new model in our elegant model abstraction library, or rewrite it when new model capabilities prove our existing design wrong
  • Work with our platform team on the next generation of eval facilities, based on your understanding of what researchers need and where the platform is heading
What we offer
What we offer
  • 25 days holiday
  • health insurance, including dental and vision, which extends to partners and dependents
  • company-matched pension
  • commuting stipend for those who live outside London
  • cycle to work scheme
  • Fulltime
Read More
Arrow Right
New

Software Engineer (Full-stack) - Python/React

Axios HQ is an AI-powered software that helps organizations of all sizes plan, w...
Location
Location
Salary
Salary:
Not provided
g2i.co Logo
G2i Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of software engineering
  • 1–2 years building LLM/GenAI features
  • Strong in Python and TypeScript
  • comfortable full-stack
  • Excels in applying AI-aided development practices in development workflows
  • Experience with LangChain (or similar) and vector DBs (FAISS/Pinecone/pgvector)
  • Solid REST API chops (FastAPI/Flask) and modern frontend (React + TS)
  • Familiar with AI evaluation (RAGAS/DeepEval/FMeval) and data pipelines
  • Experience building data pipelines
Job Responsibility
Job Responsibility
  • Build and evolve our flagship Axios HQ product end-to-end
  • Ship clean, well-tested code and iterate quickly with product/design
  • Design, deploy, and operate LLM/agent applications (RAG, semantic chunking)
  • Integrate models via AWS Bedrock, Azure OpenAI, Vertex AI, etc.
  • Deliver production-ready UIs and APIs (React + TypeScript, FastAPI/Django Rest Framework)
  • Apply LLMOps best practices (versioning, evals, monitoring, cost control)
Read More
Arrow Right