CrawlJobs Logo

Software Engineer, Applied Evals

United States, San Francisco 230000.00 - 325000.00 USD / Year · Job Posted February 21, 2026
Apply Position
Job Link Share

Job Description

Applied Evals defines what good looks like for safe, advanced AI systems. We turn complex, high-value workflows into clear, reproducible signals that guide model training and product quality. Our work bridges frontier customers and models, ensuring improvements show up where users experience them. We combine hands-on, unscalable efforts with systems that others can extend, creating a compounding loop of model improvement. We’re hiring product-minded engineers to design and build evals and harnesses that capture real-world quality for advanced AI systems. You’ll own the loop from prototyping with users to building reliable pipelines and integrating signals into training stacks. This role sits at the center of model improvement. The systems you design will directly shape how models behave, accelerate their reliability, and raise the standard for what customers expect. You’ll collaborate closely with research and product teams and work across the stack, from backend pipelines to user-facing interfaces. The work includes evaluating multi-turn and tool-using systems, designing agent harnesses, and applying reinforcement learning and related methods in production settings. Engineers who succeed in this role bring both a builder’s mindset and the judgment to create reusable systems that others can build on. Many thrive here by operating like founders or founding engineers, taking initiative, moving quickly, and creating structure where none exists.

Job Responsibility

  • Define the core evaluation signals that drive model improvement at OpenAI, turning vague product gaps into crisp, defensible measures of quality
  • Design agents, harnesses, and eval pipelines that are reliable, reproducible, and extendable
  • Prototype solutions with real workflows and convert them into scalable feedback loops
  • Connect evaluation signals directly to research and training systems so product improvements show up in what users experience
  • Shape model interaction paradigms by partnering with engineering, research, and product teams on how models are deployed and measured
  • Build reusable systems and tools that enable contributions from across the company and steadily raise the quality bar

Requirements

  • 4+ years of experience in software engineering with strong fundamentals and a track record of shipping production systems end-to-end
  • Experience building AI agents or applications, including designing evals and improving performance through prompting or scaffolding
  • Familiarity with evaluation methods for LLMs and have worked with patterns like multi-agent workflows, tool use, or long context
  • Familiarity with deep learning concepts or prior exposure to training models
  • Ability to communicate clearly across technical and non-technical audiences across levels
  • Motivated by high-impact collaboration with research and product teams and thrive in ambiguity

What we offer

  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Software Engineer, Applied Evals

8 matching positions

Staff Software Engineer – Applied AI

Lead the design and delivery of end-to-end AI applications, from discovery and p...
Location
Location
United Arab Emirates , Dubai
Salary
Salary:
Not provided
weareorbis.com Logo
Orbis Consultants
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years’ experience building production-grade software
  • Strong backend capability (Python preferred, but stack-agnostic mindset)
  • Hands-on experience or strong interest in LLMs / GenAI (LangChain, vector DBs, model tooling, eval frameworks etc.)
  • Comfortable owning projects end-to-end and interacting directly with technical stakeholders
  • Startup mentality – high ownership, adaptable, and excited by ambiguity
Job Responsibility
Job Responsibility
  • Architecting and deploying custom AI solutions (automation, agents, evaluation frameworks, internal AI tooling)
  • Working directly with senior stakeholders (including CTO-level) on requirements and trade-offs
  • Leading technical direction across projects
  • Shaping engineering standards and culture as the team scales
What we offer
What we offer
  • Front-row seat to real-world enterprise AI deployment
  • Exposure to a wide range of industries and use cases
  • Senior, high-calibre engineering environment
  • Opportunity to shape a new regional presence in Dubai
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, AI Evals

As a Senior Software Engineer on Sentry’s AI/ML team, you’ll be responsible for ...
Location
Location
United States , San Francisco
Salary
Salary:
240000.00 - 280000.00 USD / Year
sentry.io Logo
Sentry
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum 5+ years of professional experience with a Bachelor’s degree in computer science, machine learning, or a related field
  • Experience building testing, evaluation, or data infrastructure for complex systems (AI/ML experience strongly preferred)
  • Comfort writing production-quality code (we use Python and TypeScript)
  • Experience working with structured and unstructured datasets, labeling workflows, or data quality pipelines
  • Familiarity with modern ML systems and evaluation techniques (e.g., offline metrics, online evaluation, regression testing for models or prompts)
Job Responsibility
Job Responsibility
  • Design and build robust evaluation frameworks to measure accuracy, reliability, regressions, and edge cases in AI systems
  • Create and curate high-quality datasets, golden test cases, and benchmarks grounded in real production data
  • Build automated test harnesses and metrics pipelines to continuously evaluate models, prompts, and agentic workflows
  • Partner closely with applied AI engineers and product leaders to define what “good” looks like and translate it into measurable criteria
  • Own the evaluation lifecycle for major AI initiatives, from early experimentation through production monitoring
What we offer
What we offer
  • Offers Equity
  • incentive compensation
  • equity grants
  • paid time off
  • group health insurance coverage
  • Fulltime
Read More
Arrow Right

Senior Software Engineer

The Content Services Verticals team is seeking a Senior Software Engineer to dri...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 4+ years experience with designing, building, and maintaining complex backend systems
  • 4+ years experience in developing and optimizing RESTful APIs and microservices. Proven ability to create and maintain efficient, reliable backend services.
Job Responsibility
Job Responsibility
  • Ship high-quality, well-tested, secure, and maintainable code
  • Develop and maintain robust, scalable, and efficient full-stack applications
  • Collaborate closely with cross-functional teams, including product owners, designers, and other engineers, to understand and address business requirements effectively
  • Participate in code reviews, providing constructive feedback and ensuring code quality and adherence to coding standards
  • Contribute ideas for continuous improvement of the tech stack, tools, and development processes
  • Ensure seamless integration of front-end and back-end components, focusing on optimal performance and user satisfaction
  • Work within a world-class engineering team comprising engineers, architects, scientists, and leadership
  • Contribute to a positive and innovative team culture
  • Work closely with the leadership and product owner to help address business needs while maintaining engineering standards and paying down technical debt
  • Experiment with and recommend new technologies that simplify or improve our stack
  • Fulltime
Read More
Arrow Right

Software Engineer, AI

At Monarch, AI is the engine powering intelligent, personalized financial experi...
Location
Location
United States
Salary
Salary:
Not provided
monarchmoney.com Logo
Monarch Money
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in software engineering
  • at least 2 years focused on building and operating production ML/AI systems
  • proven track record of shipping LLM-powered features
  • deep, hands-on expertise in prompt engineering, RAG systems, and evaluation techniques
  • strong fundamentals in machine learning: embeddings, similarity search, classification, and probabilistic reasoning
  • demonstrated experience building and using AI evaluation tooling (e.g., golden sets, rubric scoring, LLM-as-judge)
  • excellent Python skills
  • history of building production-grade AI features and services
  • strong collaboration and communication skills with a sharp product sensibility
  • strategic mindset, comfortable making build-vs-buy decisions and designing features for long-term reliability
Job Responsibility
Job Responsibility
  • Apply AI to Real Financial Problems: Use GenAI and ML to help users make sense of their money, understand spending patterns, surface actionable insights, or automate tedious financial tasks
  • Choose the Right Tool for Each Problem: Navigate the AI toolkit thoughtfully, know when a well-crafted prompt suffices, when retrieval systems add value, and when custom models are worth the investment
  • Ship with Confidence: Leverage and enhance our sophisticated evaluation framework to ensure AI quality, design test datasets, implement new scorers, and use our Braintrust-based eval system to validate changes before they reach users
  • AI feature development, agent design and orchestration, ML model improvements, evaluation datasets and scorers, prompt engineering, and feature-level quality
What we offer
What we offer
  • Work wherever you want! As a fully remote company
  • Competitive cash and equity compensation
  • Stipend to set-up your ideal working environment
  • Competitive Benefit Plans for employees based on your location (e.g. in the US we offer: Medical, dental and vision benefits and the ability to contribute to a 401k plan)
  • Unlimited PTO
  • 3 day weekend every month! We take off the “First Friday” every month
  • Fulltime
Read More
Arrow Right

Software Engineer (Full-stack) - Python/React

Axios HQ is an AI-powered software that helps organizations of all sizes plan, w...
Location
Location
Salary
Salary:
Not provided
g2i.co Logo
G2i Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of software engineering
  • 1–2 years building LLM/GenAI features
  • Strong in Python and TypeScript
  • comfortable full-stack
  • Excels in applying AI-aided development practices in development workflows
  • Experience with LangChain (or similar) and vector DBs (FAISS/Pinecone/pgvector)
  • Solid REST API chops (FastAPI/Flask) and modern frontend (React + TS)
  • Familiar with AI evaluation (RAGAS/DeepEval/FMeval) and data pipelines
  • Experience building data pipelines
Job Responsibility
Job Responsibility
  • Build and evolve our flagship Axios HQ product end-to-end
  • Ship clean, well-tested code and iterate quickly with product/design
  • Design, deploy, and operate LLM/agent applications (RAG, semantic chunking)
  • Integrate models via AWS Bedrock, Azure OpenAI, Vertex AI, etc.
  • Deliver production-ready UIs and APIs (React + TypeScript, FastAPI/Django Rest Framework)
  • Apply LLMOps best practices (versioning, evals, monitoring, cost control)
Read More
Arrow Right

Principal Software Engineer - Teams AI Features

We are looking for a Principal Software Engineer to join our team to drive all a...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 2+ years of experience on engineering tooling or eval development
  • 2+ years experience in working on services at scale
  • 1+ years experience in driving fundamentals for AI features within web apps
Job Responsibility
Job Responsibility
  • Define the vision, strategy, and roadmap for how to evaluate AI features for good fundamentals at scale across Teams
  • Lead end-to-end science and technical design for evaluating LLM-powered agents on real-time and batch workloads: designing evaluation frameworks, metrics, and pipelines that capture planning quality, tool use, retrieval, safety, and end-user outcomes, and partnering with engineering for robust, low-latency deployment
  • Establish rigorous evaluation and reliability practices for LLM/agent systems: from offline benchmarks and scenario-based evals to online experiments and production monitoring, defining guardrails and policies that balance quality, cost, and latency at scale
  • Collaborate with PM, Engineering, and UX to translate evaluation insights into customer-visible improvements, shaping product requirements, de-risking launches, and iterating quickly based on telemetry, user feedback, and real-world failure modes
  • Collaborate and mentor across product, research, and engineering teams, sharing best practices on eval design, LLM-as-judge usage, and Responsible AI, and providing code reviews and guidance that raise the bar for the AI features
  • Provide technical leadership and mentorship within the applied science and engineering community, fostering inclusive, responsible-AI practices in agent evaluation, and influencing roadmap, platform investments, and cross-team evaluation strategy across Fabric
  • Fulltime
Read More
Arrow Right

Senior Software Engineer

As a Senior Research Engineer at Microsoft, you will advance Microsoft’s mission...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, Mathematics, Statistics, Physics, or a related field and 4 or more years in applied ML or AI research and product engineering
  • Master’s degree and 3 or more years in applied ML or AI research and product engineering
  • PhD in a relevant field and 2 or more years with generative AI, LLMs, or related ML algorithms
  • Proficiency in Python and at least one deep learning framework such as PyTorch, JAX, or TensorFlow
  • Experience deploying Fine Tuned LLMs or multimodal models in live production environments
  • Experience shipping and maintaining production AI systems
  • Ability to meet Microsoft, customer, and government security screening requirements
  • Microsoft Cloud Background Check upon hire or transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Bringing State-of-the-Art Research to Products
  • Design and implement AI systems using foundation models, prompt engineering, retrieval-augmented generation, multi-agent architectures, and classic ML
  • Fine-tune large language models on domain-specific data and evaluate via offline and online methods such as A/B testing, telemetry, and shadow deployments
  • Build and harden prototypes into production-ready services using robust software engineering and MLOps practices
  • Drive original research and thought leadership (whitepapers, internal notes, patents)
  • convert insights into shipped capabilities
  • Research Translation: Continuously review emerging work
  • identify high-potential methods and adapt them to Microsoft problem spaces
  • End-to-End System Development
  • ML Design & Architecture: Own end-to-end pipeline from data prep, training, evaluation, deployment, and feedback loops
  • Fulltime
Read More
Arrow Right

Principal Software Engineer - Copilot Security

Copilot Security is at the core of Microsoft’s mission to deliver trusted, human...
Location
Location
United States , Redmond, WA
Salary
Salary:
163000.00 - 296400.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, Go, or Python OR equivalent experience
  • 8+ years in technical engineering roles building large-scale services
  • 8+ years hands-on experience designing and operating security-critical or AI-powered systems at scale, including agentic AI, secure orchestration, or advanced threat defenses
  • Proven ability to design, build, and ship agentic AI features or frameworks
  • Ability to clearly explain complex systems and security concepts to technical and non-technical stakeholders and influence cross-org roadmaps
  • Experience building production agent systems using frameworks such as LangGraph, Amazon Strands SDK, or similar platforms
  • familiarity with agentic design patterns including tool calling, multi-agent coordination, and secure delegation patterns
  • Hands-on experience with distributed training frameworks (Ray, Slurm, HPC), containerization and orchestration technologies (Docker, Kubernetes) for ML model deployment, and ML lifecycle management in production environments
  • Experience designing evaluation frameworks for LLM-based applications and implementing observability for agent systems using tools such as Phoenix, MLFlow, LangFuse, or custom eval harnesses
  • understanding of AI safety evaluation methodologies including adversarial testing and red-teaming
Job Responsibility
Job Responsibility
  • Develop and ship agentic AI-powered security features that protect users from threats such as prompt injection, adversarial manipulation, and abuse of agentic workflows
  • Design and implement secure orchestration frameworks that enable Copilot to safely delegate, coordinate, and execute actions across devices, services, and platforms
  • Invent and apply new intelligent agents that leverage information flow analysis and apply common sense and judgement guardrails for security and privacy
  • Collaborate with product, engineering, security, privacy, and AI teams to drive adoption of agentic security patterns and best practices across Copilot and MAI
  • Monitor key metrics for agentic AI security and innovation, using data-driven insights to improve defenses and enablement
  • Align with central Microsoft security and AI roadmaps, landing platform capabilities in Copilot and MAI consumer scenarios
  • Document secure agentic AI patterns, ensuring they address novel risks, support safe delegation, and enable responsible orchestration of actions
  • Fulltime
Read More
Arrow Right