This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As an AI Engineer: Voice Designer, you’ll own the back-end implementation and linguistic optimization of the Text-to-Speech (TTS) layer for our next-generation AI voice agents. You’ll work squarely within our Speech Team—a high-impact R&D and engineering group focused on speech recognition, enhancement, and synthesis. You will bridge the gap between core speech science and product engineering, ensuring our voice agents sound human, context-aware, and trustworthy. You’ll also help create the systems that manage voice personas, tone, and conversational fillers, eventually exposing these as tweakable parameters to our customer-facing UI.
Job Responsibility
Own the back-end implementation and linguistic optimization of the Text-to-Speech (TTS) layer for our next-generation AI voice agents
Own the integration and optimization of multiple TTS vendor APIs while leading research and prototyping for open-source or in-house TTS architectures
Apply expertise in phonetics and sociolinguistics to ensure TTS input is formatted for maximum naturalness, including SSML orchestration and pronunciation handling
Craft context-specific utterances to optimize turn handling and build caller trust during agentic thought processes
Design and manage LLM and TTS prompts and parameters to define and refine agent personalities across different industry verticals
Architect the logic to expose voice attributes (speed, pitch, tone, style) to the product UI, allowing customers to customize their agent’s voice profile
Partner with ASR and Audio AI engineers to ensure end-to-end voice quality and minimize latency in the ASR to LLM to TTS pipeline
Requirements
Strong Python programming skills and experience with deep learning frameworks (e.g. PyTorch)
3+ years of experience in Speech Synthesis (TTS) or Voice Design, including hands-on work with frameworks like NVIDIA NeMo, ESPnet, or Coqui, and hands-on experience with major TTS APIs such as ElevenLabs, Rime, and Cartesia
Degree in Computational Linguistics, Computer Science, or AI/ML with a deep understanding of phonetics, prosody, and syntax
Proven experience crafting and evaluating LLM prompts (system, few-shot) and managing structured prompt templates
Experience building production-grade APIs and integrating multi-vendor services in a cloud environment (GCP preferred)
Knowledge of speech quality metrics (MOS, intelligibility, latency) and the ability to design rigorous A/B tests for voice personas