CrawlJobs Logo

Member of Technical Staff, LLM Evaluation

https://www.microsoft.com/ Logo

Microsoft Corporation

Location Icon

Location:
United States , Mountain View

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

139900.00 - 274800.00 USD / Year

Job Description:

As a Member of Technical Staff, LLM Evaluation, you will develop and implement cutting-edge methodologies to help us evaluate how well Copilot performs in real-world usage scenarios. Users turn to Copilot for all types of endeavors, making it critical that we ensure our AI systems effectively help them meet their needs. Our vision for meeting user needs is expansive and includes not only task completion, but also affective aspects of the experience. You will be responsible for developing new methods to evaluate LLMs, train classifiers, experimenting with data collection techniques, and implementing methodologies to provide real-time signals on Copilot performance. We're looking for outstanding individuals with experience in the social sciences, machine learning, and analysis of natural language. The right candidate is a creative problem solver who will work closely with user researchers and product leaders to build automated evaluation frameworks that help us drive improvements in Copilot.

Job Responsibility:

  • Leverage expertise to measure the performance of Copilot, identify failure modes and novel mitigation strategies, including data mining, prompt engineering, LLM as a judge, and classifier training
  • Creative problem solving, navigating complexity with clarity, independently shaping direction and delivering results even when the path isn’t obvious
  • Create and implement comprehensive evaluation frameworks across diverse scenarios, edge cases, and potential failure modes
  • Build automated testing systems, generalize solutions into repeatable frameworks, and write efficient code for model pipelines and intervention systems
  • Maintain a user-oriented perspective by understanding needs from user perspectives, validating approaches through user research, and serving as a trusted advisor on AI matters
  • Track advances in research, identify relevant state-of-the-art techniques, and adapt algorithms to drive innovation in production systems serving millions of users

Requirements:

  • Doctorate in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 5+ years data-science experience
  • OR Master's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 7+ years data-science experience
  • OR Bachelor's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 10+ years data science experience
  • OR equivalent experience
  • Experience prompting and working with large language models
  • Experience writing production-quality Python code
  • Demonstrated interest in Responsible AI

Nice to have:

  • Doctorate in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 8+ years data-science experience
  • OR Master's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 10+ years data-science experience
  • OR Bachelor's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 12+ years data-science experience
  • OR equivalent experience

Additional Information:

Job Posted:
February 10, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Member of Technical Staff, LLM Evaluation

Member of Technical Staff, Research

As a Member of Technical Staff on the Research team, you’ll push the boundaries ...
Location
Location
United States , San Mateo
Salary
Salary:
175000.00 - 240000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Research background in Artificial Intelligence, Machine Learning, Physics, or similar field
  • Experience solving analytical problems using analytic and quantitative approaches
  • Experience communicating research to audiences with different backgrounds
  • Experience coding in C/C++, Python, or other similar languages
Job Responsibility
Job Responsibility
  • Conduct foundational research to advance the capabilities, efficiency, and reliability of LLMs and multimodal systems
  • Design, implement, and evaluate novel model architectures, training methods, and optimization techniques
  • Collaborate with engineering teams to transition research prototypes into production-grade systems
  • Analyze empirical results, identify performance bottlenecks, and iterate quickly to improve model quality
  • Contribute to internal research strategy by identifying high-impact opportunities and emerging trends in AI
What we offer
What we offer
  • Meaningful equity in a fast-growing startup
  • Competitive salary
  • Comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Member of Technical Staff – Model Training

At Inflection AI, our public benefit mission is to harness the power of AI to im...
Location
Location
United States , Palo Alto
Salary
Salary:
175000.00 - 350000.00 USD / Year
inflection.ai Logo
Inflection AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Have hands-on experience training and fine-tuning large transformer models on multi-GPU / multi-node clusters
  • Are fluent in PyTorch and its ecosystem tools (Torchtune, FSDP, DeepSpeed) and enjoy digging into distributed-training internals, mixed precision, and memory-efficiency tricks
  • Have shipped or published work in RLHF, DPO, GRPO, or RLAIF and understand their practical trade-offs
  • Care deeply about training tools, pipelines, and reproducibility—you automate the boring parts so you can iterate on the fun parts
  • Balance research curiosity with product pragmatism—you know when to run an ablation and when to ship
  • Communicate crisply with both technical and non-technical teammates
  • Have a bachelor’s degree or equivalent in a related field to the offered position requirements
Job Responsibility
Job Responsibility
  • Contribute to end-to-end post-training workflows—dataset curation, hyper-parameter search, evaluation, and rollout—using PyTorch, Torchtune, FSDP/DeepSpeed, and our internal orchestration stack
  • Prototype and compare alignment techniques (e.g., curriculum RL, multi-objective reward modeling, tool-use fine-tuning) and push the best ideas into production
  • Automate training at scale: build robust pipeline components, tools, scripts, and dashboards so experiments are reproducible and easy to trace
  • Define the metrics that matter
  • run A/B tests and iterate quickly to meet aggressive quality targets
  • Collaborate with inference, safety, and product teams to land improvements in customer-facing systems
What we offer
What we offer
  • Diverse medical, dental and vision options
  • 401k matching program
  • Unlimited paid time off
  • Parental leave and flexibility for all parents and caregivers
  • Support of country-specific visa needs for international employees living in the Bay Area
  • Competitive stock options
Read More
Arrow Right

Senior Research Engineer, Model Evaluation

Evaluation is critical to making progress in scaling intelligence. As models con...
Location
Location
United States; Canada; United Kingdom , Toronto; New York; Seattle; San Francisco; London
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You enjoy pushing the limits of what LLMs are capable of, and you have built high-quality evaluation resources to measure those capabilities (datasets, simulators, environments, etc.)
  • You have a track record of developing new methods and/or data to evaluate LLMs, e.g. publications at top-tier conferences, popular benchmarks, etc.
  • You have deep experience building with and around LLMs, and you have built tools for analyzing and understanding their performance
  • You have strong software engineering skills
Job Responsibility
Job Responsibility
  • Develop evaluation benchmarks, datasets, and environments for measuring the bleeding edge of model capabilities
  • Conduct research to push the state-of-the-art in LLM evaluation methods, including training LLM judges
  • improving evaluation efficiency
  • and scalably building high-quality datasets
  • Build scalable tools for investigating and understanding evaluation results that are used by all members of technical staff at Cohere, as well as leadership and our CEO
  • Learn from and work with the best researchers and engineers in the field
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Data Analysis and Evaluation

As a Member of Technical Staff in Data Analysis and Evaluation, you will play a ...
Location
Location
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extremely strong software engineering skills
  • Strong expertise in designing and conducting data collection tasks, including working with human annotators
  • Strong statistical skills and experience evaluating scientific experiments related to data collection and model performance
  • Experience analysing datasets with respect to their quality, biases, and suitability for training ML models
  • Hands-on experience training large language models (LLMs) on distributed training infrastructures
  • Familiarity with evaluating and improving the generalisability and robustness of ML systems
  • Proficiency in programming languages such as Python and ML frameworks (e.g., PyTorch, TensorFlow, JAX)
  • Excellent communication skills to collaborate effectively with cross-functional teams and present findings
  • One or more papers at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP)
Job Responsibility
Job Responsibility
  • Design and oversee data collection tasks, including supporting human annotators and ensuring data quality
  • Develop and apply statistical methods to evaluate the quality and reliability of datasets
  • Analyse and assess the generalisability and robustness of ML systems across diverse use cases
  • Collaborate with teams to improve dataset quality and model performance
  • Train and fine-tune large language models (LLMs) on distributed training infrastructures
  • Conduct experiments to evaluate model performance and identify areas for improvement
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right

Member of Technical Staff - Post Training, Applied

This is a rare chance to sit at the intersection of frontier foundation models a...
Location
Location
United States , San Francisco; Boston
Salary
Salary:
Not provided
liquid.ai Logo
Liquid AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Hands-on experience with data generation and evaluation for LLM post-training
  • Experience training or fine-tuning models using SFT, preference alignment, and/or RL
  • Strong intuition for data quality and evaluation design
  • Familiarity with alignment or RL techniques beyond basic supervised fine-tuning
Job Responsibility
Job Responsibility
  • Act as the technical owner for enterprise customer post-training engagements
  • Translate customer requirements into concrete post-training specifications and workflows
  • Design and execute data generation, filtering, and quality assessment processes
  • Run supervised fine-tuning, preference alignment, and reinforcement learning workflows
  • Design task-specific evaluations, interpret results, and feed learnings back into core post-training pipelines
What we offer
What we offer
  • Competitive base salary with equity in a unicorn-stage company
  • We pay 100% of medical, dental, and vision premiums for employees and dependents
  • 401(k) matching up to 4% of base pay
  • Unlimited PTO plus company-wide Refill Days throughout the year
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Applied AI Engineer

We’re hiring a Applied AI Engineer to join a fast‑moving, high‑ownership team bu...
Location
Location
United States , Mountain View
Salary
Salary:
100600.00 - 199000.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field OR Master's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 1+ year(s) data-science experience (e.g., managing structured and unstructured data, applying statistical techniques and reporting results) or consulting experience OR Bachelor's Degree in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 2+ years data-science experience (e.g., managing structured and unstructured data, applying statistical techniques and reporting results) OR equivalent experience.
  • 2+ years shipping production-level code, models, or data analysis.
  • 1+ years using AI-assisted coding and analysis techniques.
  • Experience working on small teams and mid-stage startup environments.
  • Experience working on AI products.
  • PhD in engineering, applied math, statistics, or related analytical field.
  • 4+ years shipping production-level code, models, or data analysis.
  • Deep experience building from zero-to-one.
  • Hands on work hillclimbing AI evaluations.
Job Responsibility
Job Responsibility
  • LLM Feature & Agent Development
  • Design and ship LLM‑powered assistant features, including conversational flows, agentic behaviors, retrieval pipelines, and multimodal interactions.
  • Build prompt architectures, system instructions, and orchestration logic that ensure reliability, grounding, and personality consistency.
  • Prototype new capabilities rapidly and iterate based on user signals and evaluation data.
  • Evaluation, Hillclimbing & Quality Systems
  • Build and maintain evaluation frameworks for correctness, safety, grounding, and UX quality.
  • Run hillclimbing loops across prompts, models, and tool‑use strategies to continuously improve assistant performance.
  • Analyze failure modes, design mitigations, and drive systematic improvements across the stack.
  • LLM Tooling & Internal Infrastructure
  • Develop internal tools for prompt experimentation, model comparison telemetry and debugging automated eval pipelines
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Senior/Staff MLE

This is not a typical “Applied Scientist” or “ML Engineer” role. As a Member of ...
Location
Location
United States; Canada , San Francisco; New York; Toronto; Montreal
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong ML fundamentals and the ability to frame complex, ambiguous problems as ML solutions
  • Fluency with Python and core ML/LLM frameworks
  • Experience working with large-scale datasets and distributed training or inference pipelines
  • Understanding of LLM architectures, tuning techniques (CPT, post-training), and evaluation methodologies
  • Demonstrated ability to meaningfully shape LLM performance
  • Experience engaging directly with customers or stakeholders to design and deliver ML-powered solutions
  • A track record of technical leadership at a team level
  • A broad view of the ML research landscape and a desire to push the state of the art
  • Bias toward action, high ownership, and comfort with ambiguity
  • Humility and strong collaboration instincts
Job Responsibility
Job Responsibility
  • Lead the design and delivery of custom LLM solutions for enterprise customers
  • Translate ambiguous business problems into well-framed ML problems with clear success criteria and evaluation methodologies
  • Build custom models using Cohere’s foundation model stack, CPT recipes, post-training pipelines (including RLVR), and data assets
  • Develop SOTA modeling techniques that directly enhance model performance for customer use-cases
  • Contribute improvements back to the foundation-model stack — including new capabilities, tuning strategies, and evaluation frameworks
  • Work closely with enterprise customers to identify high-value opportunities where LLMs can unlock transformative impact
  • Provide technical leadership across discovery, scoping, modeling, deployment, agent workflows, and post-deployment iteration
  • Establish evaluation frameworks and success metrics for custom modeling engagements
  • Mentor engineers across distributed teams
  • Drive clarity in ambiguous situations, build alignment, and raise engineering and modeling quality across the organization
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, MLE

This is not a typical “Applied Scientist” or “ML Engineer” role. As a Member of ...
Location
Location
Singapore
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong ML fundamentals and the ability to frame complex, ambiguous problems as ML solutions
  • Fluency with Python and core ML/LLM frameworks
  • Experience working with (or the ability to learn) large-scale datasets and distributed training or inference pipelines
  • Understanding of LLM architectures, tuning techniques (CPT, post-training), and evaluation methodologies
  • Demonstrated ability to meaningfully shape LLM performance
  • A broad view of the ML research landscape and a desire to push the state of the art
  • Bias toward action, high ownership, and comfort with ambiguity
  • Humility and strong collaboration instincts
  • A deep conviction that AI should meaningfully empower people and organizations
Job Responsibility
Job Responsibility
  • Contribute to the design and delivery of custom LLM solutions for enterprise customers
  • Translate ambiguous business problems into well-framed ML problems with clear success criteria and evaluation methodologies
  • Build custom models using Cohere’s foundation model stack, CPT recipes, post-training pipelines (including RLVR), and data assets
  • Develop SOTA modeling techniques that directly enhance model performance for customer use-cases
  • Contribute improvements back to the foundation-model stack — including new capabilities, tuning strategies, and evaluation frameworks
  • Work as part of Cohere’s customer facing MLE team to identify high-value opportunities where LLMs can unlock transformative impact to our enterprise customers
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right