Job Description
Infer is building the operating system for insurance agencies. We make AI agents (including voice agents) that handle the work agencies have always done by hand: qualifying inbound leads, helping producers during live calls, auditing calls after, running renewals, and bringing churned customers back. Our long bet is that AI eventually sells insurance directly. Agencies are the wedge because that is where the work, the data, and the customer relationships actually live. Get good there, and the rest follows. We are a YC company and have raised from Stellaris Venture partners and others. Founders are: Vaibhav, Urvin and Suneel. Vaibhav was an architect and AI researcher(at Purdue) now a licensed insurance agent. Urvin worked at BCG, is a surfer with six pack abs. Suneel is an IITian and a philomath. About the role We're hiring an Applied AI Engineer to own the system that tells us whether our voice agents are getting better, and to keep them getting better on their own. Voice quality is the product. If an agent stutters, hallucinates a quote, or misses a disclosure, we lose trust, deals, and sometimes compliance footing. The system that catches all of that before customers do is the most important infrastructure we will build this year. Today we run thousands of conversations a day with real prospects. We need a harness that scores every change end to end, a benchmark suite that runs against any new model the day it drops, a red-team pipeline that probes our agents for failure modes, and self-improvement loops that feed production failures back into the eval set. This is an evals and infrastructure role with deep LLM work. You will touch audio, but the center of gravity is the harness and the loops around it. Think of the harness as CI for voice conversations: it runs synthetic and real calls through our stack and scores agent behavior at every layer (STT, LLM, tools, TTS, full call outcomes), so we catch regressions before customers do. New models are coming out every few weeks, so the question is not just whether ours is good today, but whether we can tell within a week if a new open source release should replace it. What success looks like Day 30 You understand how our agents work across prompts, tools, evals, telephony, and customer systems. You have shipped a v1 of evals with at least one end-to-end metric the team trusts. You are sitting in on customer call reviews and tagging failure modes by hand to learn where the real problems live. You have one new model (open or closed) benchmarked against our production stack with numbers we can defend. Day 60 The eval system runs on updates and blocks merges that regress on a known set of cases. We have a first red-team suite covering at least three classes of failure modes (jailbreaks, hallucinated quotes, compliance), running on a schedule. Hard-case mining from production calls is automated, so the eval set grows without anyone triaging every example by hand. At least one open source model (Qwen, DeepSeek, or similar) is benchmarked against our production stack with a defensible recommendation on whether to switch. Day 90 We can swap in any new LLM and have a numbers-backed answer on whether to ship it within a week. DSPy or GEPA-style prompt optimization is running over at least one production voice flow, and you have shown measurable lift. Self-improvement v1 is live for at least one failure pattern. The same problem does not get solved twice because the system feeds the fix back into the platform. You are spotting failure patterns across customer accounts and turning them into product fixes the rest of the team builds on.