CrawlJobs Logo

Staff MLOps Engineer

Canada, Vancouver 190000.00 - 240000.00 CAD / Year · Job Posted December 09, 2025
Apply Position
Job Link Share

Job Description

At Inworld, we’re building the AI framework behind the next generation of real-time, immersive applications. As a Staff MLOps Engineer, you’ll design, build and scale the infrastructure that powers intelligent AI agents across massive consumer experiences while ensuring performance, reliability, and speed at every level.

Job Responsibility

  • Build and scale MLOps systems to streamline the end-to-end ML model lifecycle on the Inworld AI platform, from training to deployment
  • Design and implement robust model training, evaluation, and release pipelines
  • Collaborate cross-functionally with ML and backend teams to design, deploy, and maintain scalable secure infrastructure for Inworld’s AI Engine and Studio
  • Facilitate a "you build it, you run it" culture by providing the necessary tools and processes for monitoring the reliability, availability, and performance of services
  • Manage CI/CD pipelines to ensure smooth and efficient code integration and deployment
  • Identify and implement opportunities to enhance engineering speed and efficiency
  • Provide technical leadership in ML engineering best practices, raise the technical bar, and mentor junior engineers in MLOps principles

Requirements

  • 7+ years of software engineering experience
  • 5+ years of infrastructure-as-code
  • Proficiency in managing Kubernetes clusters and applications, including creating Helm charts/Kustomize manifests for new applications
  • Experience in creating and maintaining CI/CD pipelines for both applications and infrastructure deployments (using tools like Terraform/Terragrunt, ArgoCD, GitHub Actions, Ansible, etc.)
  • Deep knowledge of at least one major cloud provider (Google Cloud Platform, Microsoft Azure, Oracle Cloud)
  • Proficient in at least one backend programming/scripting languages such as Golang, Python, and Bash
  • Knowledge of SLURM or similar job schedulers for distributed training
  • Experience with data pipeline and workflow management tools
  • Desire to work at a fast-growing Series A startup, comfortable with uncertainty, owning and scaling new products, and embracing an experimental and iterative development process

Nice to have

  • Familiarity with open source LLM and open source serving solution (e.g. vLLM or llama.cpp, kserve, etc)
  • Experience with bare metal GPUs

What we offer

  • equity
  • benefits

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Staff MLOps Engineer

8 matching positions

Staff MLOps Engineer

At Inworld, we’re building the AI framework behind the next generation of real-t...
Location
Location
United States , Mountain View
Salary
Salary:
180000.00 - 280000.00 USD / Year
inworld.ai Logo
Inworld AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of software engineering experience, with 5+ years of infrastructure-as-code
  • Proficiency in managing Kubernetes clusters and applications, including creating Helm charts/Kustomize manifests for new applications
  • Experience in creating and maintaining CI/CD pipelines for both applications and infrastructure deployments (using tools like Terraform/Terragrunt, ArgoCD, GitHub Actions, Ansible, etc.)
  • Deep knowledge of at least one major cloud provider (Google Cloud Platform, Microsoft Azure, Oracle Cloud)
  • Proficient in at least one backend programming/scripting languages such as Golang, Python, and Bash
  • Knowledge of SLURM or similar job schedulers for distributed training
  • Experience with data pipeline and workflow management tools
  • Desire to work at a fast-growing Series A startup, comfortable with uncertainty, owning and scaling new products, and embracing an experimental and iterative development process
  • In-office location: Mountain View, CA, United States. You must be available for hybrid work
Job Responsibility
Job Responsibility
  • Build and scale MLOps systems to streamline the end-to-end ML model lifecycle on the Inworld AI platform, from training to deployment
  • Design and implement robust model training, evaluation, and release pipelines
  • Collaborate cross-functionally with ML and backend teams to design, deploy, and maintain scalable secure infrastructure for Inworld’s AI Engine and Studio
  • Facilitate a "you build it, you run it" culture by providing the necessary tools and processes for monitoring the reliability, availability, and performance of services
  • Manage CI/CD pipelines to ensure smooth and efficient code integration and deployment
  • Identify and implement opportunities to enhance engineering speed and efficiency
  • Provide technical leadership in ML engineering best practices, raise the technical bar, and mentor junior engineers in MLOps principles
What we offer
What we offer
  • equity and benefits
  • Fulltime
Read More
Arrow Right

Sr Staff ML Engineer - Production & MLOps Focus - GenAI Security Platform

Join our team building a cutting-edge multi-tenanted GenAI Security Platform tha...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of ML engineering experience with hands-on LLM/NLP work
  • Practical experience building LLM-based applications (agents, multi-turn systems, evaluators)
  • Understanding of model fine-tuning, embedding optimization, and prompt engineering
  • Experience with LLM APIs (OpenAI, Anthropic, AWS Bedrock, Azure OpenAI)
  • Knowledge of LLM orchestration frameworks ( LangChain, LlamaIndex, Pydantic AI, custom solutions)
  • Familiarity with model architectures and when to fine-tune vs prompt engineer
  • Strong experience deploying ML models to production at scale
  • Experience with Model serving frameworks (vLLM preferred
  • TensorRT-LLM, Ray Serve, or similar a plus)
  • Kubernetes and Docker proficiency for ML workload orchestration
Job Responsibility
Job Responsibility
  • Build and deploy LLM-based agents and multi-step evaluation workflows
  • Fine-tune models, optimize embeddings, and manage model weights and artifacts
  • Deploy and scale ML services on Kubernetes with proper monitoring and resource management
  • Implement experiment tracking, model versioning, and deployment automation
  • Develop observability dashboards for ML metrics, costs, latency, and quality
  • Optimize LLM API usage through caching, batching, and intelligent routing strategies
  • Manage vector database infrastructure and semantic search systems
  • Create CI/CD pipelines for ML artifacts and automated testing frameworks
  • Collaborate with ML researchers to productionize prototypes and scale experiments
  • Fulltime
Read More
Arrow Right

AI Staff Engineer

At Multiverse, we believe technology should empower everyone to achieve their po...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
multiverse.io Logo
Multiverse
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven Staff-Level experience: 6-7 years in software engineering with strong understanding of Applied AI fundamentals
  • Experience working in cross-functional product teams
  • LLM Expertise: Experienced in working with large language models (e.g., GPT, Claude, Gemini Pro) for production use cases, including prompt engineering, evaluation, and safety & inclusivity considerations
  • Strong Engineering Skills: Proficient in Python and TypeScript, with experience building APIs, microservices, and cloud-native applications
  • Experience with AI Tools: Familiarity with emerging AI tooling platforms such as Cursor and Gemini is highly desirable
  • Cloud & MLOps: Practical experience deploying AI solutions on AWS, with a strong grasp of version control, observability, and evaluation pipelines
  • Data Skills: Skilled at working with structured and unstructured data, applying preprocessing and feature engineering techniques
  • User Focus: You can translate complex AI capabilities into product experiences that feel effortless and intuitive
  • Collaborative Approach: You work best in creative and cross-functional teams and thrive when building together
  • Growth Mindset: You’re curious, open to feedback, and excited to share what you learn while contributing to an inclusive, high-performing culture
Job Responsibility
Job Responsibility
  • Design, Architect & Deliver AI Solutions: Partner with Product, Design, and Data teams to shape and deliver AI-powered features that generate real impact to our learners, value for our customers, and align with Multiverse’s mission
  • Establish LLM Best Practices: Define the technical standards and governance for leveraging Large Language Models (LLMs), including design, fine-tuning, and integration for high-impact production use cases such as content generation, semantic search, summarisation, and personalised learning experiences
  • Build & Integrate Models: Develop, fine-tune, and embed machine learning models into production systems using tools like Cursor and Gemini, ensuring they are fast, scalable, and dependable
  • Own the End-to-End Lifecycle: Take responsibility for the journey from raw data through experimentation, deployment to users, and continuous iteration
  • Measure What Matters: Track the performance, accuracy, and adoption of AI features, and use those insights to drive constant improvement
  • Mentor and Scale Expertise: Mentor and coach engineers across teams, sharing your deep expertise to make AI approachable and set the direction for best practices, significantly elevating the team's capabilities
  • Lead in MLOps & Cloud Infrastructure: Build robust pipelines for deployment, and monitoring using AWS cloud services and modern MLOps best practices
  • Champion Innovation: Keep us ahead of the curve by exploring new AI tools, including Cursor and Gemini, and applying them to create exceptional user experiences
  • Drive Organisational Adoption: Champion new technologies and approaches (including AI-assisted tools), driving their successful adoption across multiple teams while balancing experimentation with pragmatic delivery
  • Cross-Team Influence: Act as a key technical advisor and connector across product, design, and engineering, ensuring alignment on strategic initiatives
What we offer
What we offer
  • Time off - 27 days holiday, plus 5 additional days off: 1 life event day, 2 volunteer days, 2 company-wide wellbeing days (M-Powered Weekend) and 8 bank holidays per year
  • Health & Wellness- private medical Insurance with Bupa, a medical cashback scheme, life insurance, gym membership & wellness resources through Wellhub and access to Spill - all in one mental health support
  • Hybrid work offering - for most roles we collaborate in the office three days per week with the exception of Coaches and Instructors who collaborate in the office once a month
  • Work-from-anywhere scheme - you'll have the opportunity to work from anywhere, up to 10 days per year
  • Space to connect: Beyond the desk, we make time for weekly catch-ups, seasonal celebrations, and have a kitchen that’s always stocked!
  • Fulltime
Read More
Arrow Right

LLM - Senior Staff Engineer - Python + Machine Learning

AquSag is seeking a hands-on Machine Learning Senior Staff Engineer to lead cros...
Location
Location
Salary
Salary:
40.00 - 60.00 USD / Hour
aqusag.com Logo
AquSag Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 9+ yrs of strong background in Machine Learning, NLP, and modern deep learning architectures (Transformers, LLMs)
  • Hands-on experience with frameworks such as PyTorch, TensorFlow, Hugging Face, or DeepSpeed
  • Hands-on experience in Docker for Production deployment
  • Proven experience managing teams delivering ML/LLM models in production environments
  • Knowledge of distributed training, GPU/TPU optimization, and cloud platforms (AWS, GCP, Azure)
  • Familiarity with MLOps tools like MLflow, Kubeflow, or Vertex AI for scalable ML pipelines
  • Excellent leadership, communication, and cross-functional collaboration skills
  • Bachelor’s or Master’s in Computer Science, Engineering, or related field (PhD preferred)
  • Overlap of 6 hours with PST time zone is mandatory
  • Commitments Required: 8 hours per day with overlap of 6 hours with PST
Job Responsibility
Job Responsibility
  • Lead and mentor a cross-functional team of ML engineers, data scientists, and MLOps professionals
  • Oversee the full lifecycle of LLM and ML projects — from data collection to training, evaluation, and deployment
  • Collaborate with Research, Product, and Infrastructure teams to define goals, milestones, and success metrics
  • Provide technical direction on large-scale model training, fine-tuning, and distributed systems design
  • Implement best practices in MLOps, model governance, experiment tracking, and CI/CD for ML
  • Manage compute resources, budgets, and ensure compliance with data security and responsible AI standards
  • Communicate progress, risks, and results to stakeholders and executives effectively
  • Fulltime
Read More
Arrow Right

Staff, Software Engineer - Backend

Walmart's Enterprise Business Services (EBS) is a powerhouse of seven exceptiona...
Location
Location
United States , Bentonville
Salary
Salary:
110000.00 - 220000.00 USD / Year
walmart.com Logo
Walmart
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 4 years' experience in software engineering or related area
  • 6 years' experience in software engineering or related area
  • Python guru with a proven track record of writing high-performing, production-quality code
  • Hands-on experience designing and building Python-based web services in a production setting (FastAPI experience preferred)
  • Deep familiarity with version control using Git in collaborative team environments
  • Comfortable working with Linux environments and containerization technologies such as Docker
  • 4+ years of industry experience with demonstrated ownership and delivery of software products
  • Hands-on experience developing or deploying GenAI-based applications
  • Experience working with or integrating open-source and/or commercial GenAI libraries/frameworks such as Hugging Face Transformers, LangChain, OpenAI API, or similar
  • Ability to productionize and evaluate GenAI models
Job Responsibility
Job Responsibility
  • Design and develop platform features enabling advanced semantic routing for GenAI-powered services
  • Build and maintain evaluation pipelines for semantic router data
  • Collaborate with applied researchers and data scientists to continuously improve semantic routing algorithms
  • Develop and implement agent-to-agent (A2A) communication protocols
  • Contribute to the design and development of platform features using microservices (FastAPI) and event-driven architecture (Kafka, SSE, WebSocket)—all in Python
  • Uphold engineering and operational excellence standards
  • Support operational excellence for semantic routing and agent communication systems
  • Stay current with GenAI and multi-agent system best practices
  • Be an active member of a dynamic team
  • Support production operations by participating in on-call rotations
What we offer
What we offer
  • Medical coverage
  • Vision coverage
  • Dental coverage
  • 401(k) match
  • Stock purchase plan
  • Paid maternity and parental leave
  • PTO
  • Short-term disability
  • Long-term disability
  • Company discounts
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Platform

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Palo Alto
Salary
Salary:
90000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right