This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We’re hiring a Sr. AI Systems Engineer to help support our emerging product, Night Shift, an AI research assistant that amplifies the impact of investigators by automating the tedious, repetitive steps involved in working a case. This role sits within the Machine Learning team and will work closely with partners in Engineering (Backend, Frontend, and Design) in a fast-paced environment. You will be one of the earliest technical contributors to our system architecture for agentic AI, and will own our AI evaluation framework. The outcome we’re after is clear and ambitious: measurably faster, more accurate leads for every officer and every shift.
Job Responsibility:
Immerse yourself in the current system design and agent/tooling landscape. Understand the core customer use cases and data flows
Support the team by shipping a few quick wins (e.g., refining tool APIs, prompt engineering, fixing bugs)
Stand up the foundational eval and observability scaffolding (datasets, metrics, KPIs, reporting)
Propose a technical architecture and implementation plan for an agent evaluation framework
Deliver the MVP evaluation harness to produce initial metrics, enable debugging and perform regression testing
Take on a system feature that offers demonstrated improvement against your MVP evaluation suite
Productionize the evaluation and observability platform and make it the source of truth for quality and safety. (e.g. Online/offline tracing, alerting, dashboards, evaluations and PR-gated regression suite)
Own the roadmap for evolving the agent evaluation platform
Lead deeper R&D threads (e.g., lightweight fine-tuned projection layers, specialized embeddings, multimodal understanding) that can improve system performance on core metrics
Requirements:
Familiarity with Agentic Systems: Hands-on experience with LLM agents including: LLM API use (e.g. LangChain/LangGraph, vLLM, OpenAI/Gemini/Anthropic APIs)
Agent Design: tool use (e.g. via MCP), retrieval, memory, grounding/attribution for claims, and guardrails
Architectural patterns: planning and hand-off for multi-agent systems, context management
Experience with LLM Evaluations at scale: You’ve built offline/online eval harnesses and are familiar with the methodologies and metrics to measure: Search, retrieval, and recommendation performance