Staff MLOps Engineer Job at Inworld AI (Vancouver)

Staff MLOps Engineer

At Inworld, we’re building the AI framework behind the next generation of real-t...

Location

United States , Mountain View

Salary:

180000.00 - 280000.00 USD / Year

Inworld AI

Expiration Date

Until further notice

Requirements

7+ years of software engineering experience, with 5+ years of infrastructure-as-code
Proficiency in managing Kubernetes clusters and applications, including creating Helm charts/Kustomize manifests for new applications
Experience in creating and maintaining CI/CD pipelines for both applications and infrastructure deployments (using tools like Terraform/Terragrunt, ArgoCD, GitHub Actions, Ansible, etc.)
Deep knowledge of at least one major cloud provider (Google Cloud Platform, Microsoft Azure, Oracle Cloud)
Proficient in at least one backend programming/scripting languages such as Golang, Python, and Bash
Knowledge of SLURM or similar job schedulers for distributed training
Experience with data pipeline and workflow management tools
Desire to work at a fast-growing Series A startup, comfortable with uncertainty, owning and scaling new products, and embracing an experimental and iterative development process
In-office location: Mountain View, CA, United States. You must be available for hybrid work

Job Responsibility

Build and scale MLOps systems to streamline the end-to-end ML model lifecycle on the Inworld AI platform, from training to deployment
Design and implement robust model training, evaluation, and release pipelines
Collaborate cross-functionally with ML and backend teams to design, deploy, and maintain scalable secure infrastructure for Inworld’s AI Engine and Studio
Facilitate a "you build it, you run it" culture by providing the necessary tools and processes for monitoring the reliability, availability, and performance of services
Manage CI/CD pipelines to ensure smooth and efficient code integration and deployment
Identify and implement opportunities to enhance engineering speed and efficiency
Provide technical leadership in ML engineering best practices, raise the technical bar, and mentor junior engineers in MLOps principles

What we offer

equity and benefits

Fulltime

Sr Staff ML Engineer - Production & MLOps Focus - GenAI Security Platform

Join our team building a cutting-edge multi-tenanted GenAI Security Platform tha...

Location

India , Bengaluru

Salary:

Not provided

Palo Alto Networks

Expiration Date

Until further notice

Requirements

4+ years of ML engineering experience with hands-on LLM/NLP work
Practical experience building LLM-based applications (agents, multi-turn systems, evaluators)
Understanding of model fine-tuning, embedding optimization, and prompt engineering
Experience with LLM APIs (OpenAI, Anthropic, AWS Bedrock, Azure OpenAI)
Knowledge of LLM orchestration frameworks ( LangChain, LlamaIndex, Pydantic AI, custom solutions)
Familiarity with model architectures and when to fine-tune vs prompt engineer
Strong experience deploying ML models to production at scale
Experience with Model serving frameworks (vLLM preferred
TensorRT-LLM, Ray Serve, or similar a plus)
Kubernetes and Docker proficiency for ML workload orchestration

Job Responsibility

Build and deploy LLM-based agents and multi-step evaluation workflows
Fine-tune models, optimize embeddings, and manage model weights and artifacts
Deploy and scale ML services on Kubernetes with proper monitoring and resource management
Implement experiment tracking, model versioning, and deployment automation
Develop observability dashboards for ML metrics, costs, latency, and quality
Optimize LLM API usage through caching, batching, and intelligent routing strategies
Manage vector database infrastructure and semantic search systems
Create CI/CD pipelines for ML artifacts and automated testing frameworks
Collaborate with ML researchers to productionize prototypes and scale experiments

Fulltime

AI Staff Engineer

At Multiverse, we believe technology should empower everyone to achieve their po...

Location

United Kingdom , London

Salary:

Not provided

Multiverse

Expiration Date

Until further notice

Requirements

Proven Staff-Level experience: 6-7 years in software engineering with strong understanding of Applied AI fundamentals
Experience working in cross-functional product teams
LLM Expertise: Experienced in working with large language models (e.g., GPT, Claude, Gemini Pro) for production use cases, including prompt engineering, evaluation, and safety & inclusivity considerations
Strong Engineering Skills: Proficient in Python and TypeScript, with experience building APIs, microservices, and cloud-native applications
Experience with AI Tools: Familiarity with emerging AI tooling platforms such as Cursor and Gemini is highly desirable
Cloud & MLOps: Practical experience deploying AI solutions on AWS, with a strong grasp of version control, observability, and evaluation pipelines
Data Skills: Skilled at working with structured and unstructured data, applying preprocessing and feature engineering techniques
User Focus: You can translate complex AI capabilities into product experiences that feel effortless and intuitive
Collaborative Approach: You work best in creative and cross-functional teams and thrive when building together
Growth Mindset: You’re curious, open to feedback, and excited to share what you learn while contributing to an inclusive, high-performing culture

Job Responsibility

Design, Architect & Deliver AI Solutions: Partner with Product, Design, and Data teams to shape and deliver AI-powered features that generate real impact to our learners, value for our customers, and align with Multiverse’s mission
Establish LLM Best Practices: Define the technical standards and governance for leveraging Large Language Models (LLMs), including design, fine-tuning, and integration for high-impact production use cases such as content generation, semantic search, summarisation, and personalised learning experiences
Build & Integrate Models: Develop, fine-tune, and embed machine learning models into production systems using tools like Cursor and Gemini, ensuring they are fast, scalable, and dependable
Own the End-to-End Lifecycle: Take responsibility for the journey from raw data through experimentation, deployment to users, and continuous iteration
Measure What Matters: Track the performance, accuracy, and adoption of AI features, and use those insights to drive constant improvement
Mentor and Scale Expertise: Mentor and coach engineers across teams, sharing your deep expertise to make AI approachable and set the direction for best practices, significantly elevating the team's capabilities
Lead in MLOps & Cloud Infrastructure: Build robust pipelines for deployment, and monitoring using AWS cloud services and modern MLOps best practices
Champion Innovation: Keep us ahead of the curve by exploring new AI tools, including Cursor and Gemini, and applying them to create exceptional user experiences
Drive Organisational Adoption: Champion new technologies and approaches (including AI-assisted tools), driving their successful adoption across multiple teams while balancing experimentation with pragmatic delivery
Cross-Team Influence: Act as a key technical advisor and connector across product, design, and engineering, ensuring alignment on strategic initiatives

What we offer

Time off - 27 days holiday, plus 5 additional days off: 1 life event day, 2 volunteer days, 2 company-wide wellbeing days (M-Powered Weekend) and 8 bank holidays per year
Health & Wellness- private medical Insurance with Bupa, a medical cashback scheme, life insurance, gym membership & wellness resources through Wellhub and access to Spill - all in one mental health support
Hybrid work offering - for most roles we collaborate in the office three days per week with the exception of Coaches and Instructors who collaborate in the office once a month
Work-from-anywhere scheme - you'll have the opportunity to work from anywhere, up to 10 days per year
Space to connect: Beyond the desk, we make time for weekly catch-ups, seasonal celebrations, and have a kitchen that’s always stocked!

Fulltime

LLM - Senior Staff Engineer - Python + Machine Learning

AquSag is seeking a hands-on Machine Learning Senior Staff Engineer to lead cros...

Location

Salary:

40.00 - 60.00 USD / Hour

AquSag Technologies

Expiration Date

Until further notice

Requirements

9+ yrs of strong background in Machine Learning, NLP, and modern deep learning architectures (Transformers, LLMs)
Hands-on experience with frameworks such as PyTorch, TensorFlow, Hugging Face, or DeepSpeed
Hands-on experience in Docker for Production deployment
Proven experience managing teams delivering ML/LLM models in production environments
Knowledge of distributed training, GPU/TPU optimization, and cloud platforms (AWS, GCP, Azure)
Familiarity with MLOps tools like MLflow, Kubeflow, or Vertex AI for scalable ML pipelines
Excellent leadership, communication, and cross-functional collaboration skills
Bachelor’s or Master’s in Computer Science, Engineering, or related field (PhD preferred)
Overlap of 6 hours with PST time zone is mandatory
Commitments Required: 8 hours per day with overlap of 6 hours with PST

Job Responsibility

Lead and mentor a cross-functional team of ML engineers, data scientists, and MLOps professionals
Oversee the full lifecycle of LLM and ML projects — from data collection to training, evaluation, and deployment
Collaborate with Research, Product, and Infrastructure teams to define goals, milestones, and success metrics
Provide technical direction on large-scale model training, fine-tuning, and distributed systems design
Implement best practices in MLOps, model governance, experiment tracking, and CI/CD for ML
Manage compute resources, budgets, and ensure compliance with data security and responsible AI standards
Communicate progress, risks, and results to stakeholders and executives effectively

Fulltime

Staff, Software Engineer - Backend

Walmart's Enterprise Business Services (EBS) is a powerhouse of seven exceptiona...

Location

United States , Bentonville

Salary:

110000.00 - 220000.00 USD / Year

Walmart

Expiration Date

Until further notice

Requirements

Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 4 years' experience in software engineering or related area
6 years' experience in software engineering or related area
Python guru with a proven track record of writing high-performing, production-quality code
Hands-on experience designing and building Python-based web services in a production setting (FastAPI experience preferred)
Deep familiarity with version control using Git in collaborative team environments
Comfortable working with Linux environments and containerization technologies such as Docker
4+ years of industry experience with demonstrated ownership and delivery of software products
Hands-on experience developing or deploying GenAI-based applications
Experience working with or integrating open-source and/or commercial GenAI libraries/frameworks such as Hugging Face Transformers, LangChain, OpenAI API, or similar
Ability to productionize and evaluate GenAI models

Job Responsibility

Design and develop platform features enabling advanced semantic routing for GenAI-powered services
Build and maintain evaluation pipelines for semantic router data
Collaborate with applied researchers and data scientists to continuously improve semantic routing algorithms
Develop and implement agent-to-agent (A2A) communication protocols
Contribute to the design and development of platform features using microservices (FastAPI) and event-driven architecture (Kafka, SSE, WebSocket)—all in Python
Uphold engineering and operational excellence standards
Support operational excellence for semantic routing and agent communication systems
Stay current with GenAI and multi-agent system best practices
Be an active member of a dynamic team
Support production operations by participating in on-call rotations

What we offer

Medical coverage
Vision coverage
Dental coverage
401(k) match
Stock purchase plan
Paid maternity and parental leave
PTO
Short-term disability
Long-term disability
Company discounts

Fulltime

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...

Location

United States , Chevy Chase; New York City; Palo Alto

Salary:

115000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
3+ years of hands-on experience with machine learning infrastructure and deployment at scale
2+ years of experience working with Large Language Models and transformer architectures
Proficient in Python
strong skills in Go, Rust, or Java preferred
Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)

Job Responsibility

Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
Design, implement, and maintain feature stores for ML model training and inference pipelines
Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

Staff Software Engineer - AI/ML Platform

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...

Location

United States , Chevy Chase; New York City; Palo Alto

Salary:

115000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
3+ years of hands-on experience with machine learning infrastructure and deployment at scale
2+ years of experience working with Large Language Models and transformer architectures
Proficient in Python
strong skills in Go, Rust, or Java preferred
Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)

Job Responsibility

Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
Design, implement, and maintain feature stores for ML model training and inference pipelines
Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...

Location

United States , Palo Alto

Salary:

90000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
3+ years of hands-on experience with machine learning infrastructure and deployment at scale
2+ years of experience working with Large Language Models and transformer architectures
Proficient in Python
strong skills in Go, Rust, or Java preferred
Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)

Job Responsibility

Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
Design, implement, and maintain feature stores for ML model training and inference pipelines
Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

Select Country

Staff MLOps Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Staff MLOps Engineer

Staff MLOps Engineer

Sr Staff ML Engineer - Production & MLOps Focus - GenAI Security Platform

AI Staff Engineer

LLM - Senior Staff Engineer - Python + Machine Learning

Staff, Software Engineer - Backend

Staff Software Engineer - AI/ML Infra

Staff Software Engineer - AI/ML Platform

Staff Software Engineer - AI/ML Infra

Our AI answers in your language