Filters

Countries

United States (4)

Work Mode

On-site work (3)

Multimodal Speech Engineer Jobs

8 Job Offers

Filters

New

Principal Software Engineer, CoreAI

Location

United States , Redmond

Salary

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Senior Data Engineer - AI Focused

Location

France , Paris

Salary

Not provided

Doctolib

Expiration Date

Until further notice

Research Intern - GenAI

Location

Australia , Chatswood, Sydney

Salary

Not provided

Appen

Expiration Date

Until further notice

Multimodal Speech Engineer, AI Companion

Location

United States , Palo Alto

Salary

150000.00 - 250000.00 USD / Year

1X Technologies

Expiration Date

Until further notice

Senior Data Scientist

Location

Salary

Not provided

Beyond Limits

Expiration Date

Until further notice

Senior Data Scientist

Location

Taiwan

Salary

Not provided

Beyond Limits

Expiration Date

Until further notice

Full-Stack Engineer, AI Companion

Location

United States , Palo Alto

Salary

150000.00 - 250000.00 USD / Year

1X Technologies

Expiration Date

Until further notice

Multimodal Speech Engineer

Location

United States , Palo Alto

Salary

150000.00 - 250000.00 USD / Year

1X Technologies

Expiration Date

Until further notice

Explore the cutting-edge field of Multimodal Speech Engineering, a pivotal role at the intersection of artificial intelligence, human-computer interaction, and robotics. Multimodal Speech Engineer jobs are central to creating the next generation of intelligent systems that communicate naturally with humans. Professionals in this domain develop sophisticated AI that doesn't just process spoken words but understands and generates speech within a rich context of visual cues, environmental sounds, and physical expression. Their work is fundamental to building lifelike digital assistants, advanced robotics, immersive entertainment, and accessible technologies. Typically, a Multimodal Speech Engineer focuses on designing and implementing complex AI models that integrate multiple data streams. Common responsibilities include architecting and training neural networks that fuse audio (speech recognition and synthesis), visual data (from cameras or visual context), and sometimes other sensory inputs like spatial audio or motion data. They build systems where speech generation is dynamically influenced by what the AI "sees" and "hears" in its environment, enabling appropriate and context-aware responses. A key aspect of the role involves synchronizing generated speech with non-verbal elements, such as realistic lip movements on an avatar or expressive gestures in a robot, to create coherent and believable interactions. Engineers in this field also spend significant time constructing large-scale, multimodal data pipelines for training, continuously iterating on models to improve their naturalness, emotional resonance, and reliability. The typical skill set for these roles is highly interdisciplinary. A strong foundation in deep learning, with specific expertise in speech processing (ASR, TTS) and computer vision, is essential. Proficiency in frameworks like PyTorch or TensorFlow and experience with multimodal fusion techniques (e.g., cross-modal attention, transformer architectures) are standard requirements. Software engineering best practices are crucial for deploying real-time, low-latency systems. Furthermore, successful candidates often possess a creative problem-solving mindset, as they tackle open-ended challenges in making interactions feel intuitive and engaging. An understanding of conversational AI principles, linguistics, or human-robot interaction can be a significant advantage. The demand for Multimodal Speech Engineer jobs is rapidly growing within industries focused on AI companions, social robotics, automotive voice interfaces, virtual reality, and next-generation customer service platforms. It is a career for those passionate about dissolving the barrier between humans and machines, crafting interactions that are not just functional but truly natural and empathetic. If you are driven to build the future of communication where AI understands tone, context, and unspoken cues, exploring opportunities in Multimodal Speech Engineering is your next step.

We use cookies to enhance your experience, analyze traffic, and serve personalized content. By clicking “Accept”, you agree to the use of cookies.

Filters

Countries

United States (4)
Australia (1)
France (1)
Taiwan (1)

Location

Work Mode

All (8)

Hybrid work (4)

On-site work (3)

Remote work (1)

Salary

All (8)

Specified salary (4)