This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
You will safeguard the quality of our AI and GenAI features by evaluating model outputs, creating “golden” datasets, and guiding continuous improvements in collaboration with data scientists and engineers. Be the guide to the team as the team creates a robust methodology and framework that will drive evaluation of hundreds of AI agents.
Job Responsibility:
Evaluation Frameworks – Develop reusable, automated evaluation pipelines using frameworks such as Raagas
integrate LLM-as-a-judge methods for scalable assessments
Golden Datasets – Build and maintain high-quality benchmark datasets in collaboration with subject matter experts
AI Output Validation – Evaluate results across text, documents, audio, and video, using both automated metrics and human-in-the-loop judgment
Metric Evaluation – Implement and track metrics such as precision, recall, F1 score, relevance scoring, and hallucination penalties
RAG & Embeddings – Design and evaluate retrieval-augmented generation (RAG) pipelines, vector embedding similarity, and semantic search quality
Error & Bias Analysis – Investigate recurring errors, biases, and inconsistencies in model outputs
propose solutions
Framework & Tooling Development – Build tools that enable large-scale model evaluation across hundreds of AI agents
Cross-Functional Collaboration – Partner with ML engineers, product managers, and QA peers to integrate evaluation frameworks into product pipelines
Requirements:
4+ years of experience as a Software Development Engineer in AI/ML systems
Strong coding skills in Python (evaluation pipelines, data processing, metrics computation)
Hands-on experience with evaluation frameworks (Ragas or equivalent)
Knowledge of vector embeddings, similarity search, and RAG evaluation
Familiarity with evaluation metrics (precision, recall, F1, relevance, hallucination detection)
Understanding of LLM-as-a-judge evaluation approaches
Strong analytical and problem-solving skills
ability to combine human judgment with automated evaluations
Bachelor’s or Master’s degree in Computer Science, Data Science, or related field
Strong English written and verbal communication skills
Nice to have:
Experience in data quality, annotation workflows, dataset curation, or golden set preparation
Welcome to CrawlJobs.com – Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.
We use cookies to enhance your experience, analyze traffic, and serve personalized content. By clicking “Accept”, you agree to the use of cookies.