CrawlJobs Logo

Staff MLOps Engineer

United States, Mountain View 180000.00 - 280000.00 USD / Year · Job Posted December 09, 2025
Apply Position
Job Link Share

Job Description

At Inworld, we’re building the AI framework behind the next generation of real-time, immersive applications. As a Staff MLOps Engineer, you’ll design, build and scale the infrastructure that powers intelligent AI agents across massive consumer experiences while ensuring performance, reliability, and speed at every level.

Job Responsibility

  • Build and scale MLOps systems to streamline the end-to-end ML model lifecycle on the Inworld AI platform, from training to deployment
  • Design and implement robust model training, evaluation, and release pipelines
  • Collaborate cross-functionally with ML and backend teams to design, deploy, and maintain scalable secure infrastructure for Inworld’s AI Engine and Studio
  • Facilitate a "you build it, you run it" culture by providing the necessary tools and processes for monitoring the reliability, availability, and performance of services
  • Manage CI/CD pipelines to ensure smooth and efficient code integration and deployment
  • Identify and implement opportunities to enhance engineering speed and efficiency
  • Provide technical leadership in ML engineering best practices, raise the technical bar, and mentor junior engineers in MLOps principles

Requirements

  • 7+ years of software engineering experience, with 5+ years of infrastructure-as-code
  • Proficiency in managing Kubernetes clusters and applications, including creating Helm charts/Kustomize manifests for new applications
  • Experience in creating and maintaining CI/CD pipelines for both applications and infrastructure deployments (using tools like Terraform/Terragrunt, ArgoCD, GitHub Actions, Ansible, etc.)
  • Deep knowledge of at least one major cloud provider (Google Cloud Platform, Microsoft Azure, Oracle Cloud)
  • Proficient in at least one backend programming/scripting languages such as Golang, Python, and Bash
  • Knowledge of SLURM or similar job schedulers for distributed training
  • Experience with data pipeline and workflow management tools
  • Desire to work at a fast-growing Series A startup, comfortable with uncertainty, owning and scaling new products, and embracing an experimental and iterative development process
  • In-office location: Mountain View, CA, United States. You must be available for hybrid work

Nice to have

  • Familiarity with open source LLM and open source serving solution (e.g. vLLM or llama.cpp, kserve, etc) is a plus
  • Experience with bare metal GPUs (optional)

What we offer

equity and benefits

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Staff MLOps Engineer

8 matching positions

Staff MLOps Engineer

At Inworld, we’re building the AI framework behind the next generation of real-t...
Location
Location
Canada , Vancouver
Salary
Salary:
190000.00 - 240000.00 CAD / Year
inworld.ai Logo
Inworld AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of software engineering experience
  • 5+ years of infrastructure-as-code
  • Proficiency in managing Kubernetes clusters and applications, including creating Helm charts/Kustomize manifests for new applications
  • Experience in creating and maintaining CI/CD pipelines for both applications and infrastructure deployments (using tools like Terraform/Terragrunt, ArgoCD, GitHub Actions, Ansible, etc.)
  • Deep knowledge of at least one major cloud provider (Google Cloud Platform, Microsoft Azure, Oracle Cloud)
  • Proficient in at least one backend programming/scripting languages such as Golang, Python, and Bash
  • Knowledge of SLURM or similar job schedulers for distributed training
  • Experience with data pipeline and workflow management tools
  • Desire to work at a fast-growing Series A startup, comfortable with uncertainty, owning and scaling new products, and embracing an experimental and iterative development process
Job Responsibility
Job Responsibility
  • Build and scale MLOps systems to streamline the end-to-end ML model lifecycle on the Inworld AI platform, from training to deployment
  • Design and implement robust model training, evaluation, and release pipelines
  • Collaborate cross-functionally with ML and backend teams to design, deploy, and maintain scalable secure infrastructure for Inworld’s AI Engine and Studio
  • Facilitate a "you build it, you run it" culture by providing the necessary tools and processes for monitoring the reliability, availability, and performance of services
  • Manage CI/CD pipelines to ensure smooth and efficient code integration and deployment
  • Identify and implement opportunities to enhance engineering speed and efficiency
  • Provide technical leadership in ML engineering best practices, raise the technical bar, and mentor junior engineers in MLOps principles
What we offer
What we offer
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

Sr Staff ML Engineer - Production & MLOps Focus - GenAI Security Platform

Join our team building a cutting-edge multi-tenanted GenAI Security Platform tha...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of ML engineering experience with hands-on LLM/NLP work
  • Practical experience building LLM-based applications (agents, multi-turn systems, evaluators)
  • Understanding of model fine-tuning, embedding optimization, and prompt engineering
  • Experience with LLM APIs (OpenAI, Anthropic, AWS Bedrock, Azure OpenAI)
  • Knowledge of LLM orchestration frameworks ( LangChain, LlamaIndex, Pydantic AI, custom solutions)
  • Familiarity with model architectures and when to fine-tune vs prompt engineer
  • Strong experience deploying ML models to production at scale
  • Experience with Model serving frameworks (vLLM preferred
  • TensorRT-LLM, Ray Serve, or similar a plus)
  • Kubernetes and Docker proficiency for ML workload orchestration
Job Responsibility
Job Responsibility
  • Build and deploy LLM-based agents and multi-step evaluation workflows
  • Fine-tune models, optimize embeddings, and manage model weights and artifacts
  • Deploy and scale ML services on Kubernetes with proper monitoring and resource management
  • Implement experiment tracking, model versioning, and deployment automation
  • Develop observability dashboards for ML metrics, costs, latency, and quality
  • Optimize LLM API usage through caching, batching, and intelligent routing strategies
  • Manage vector database infrastructure and semantic search systems
  • Create CI/CD pipelines for ML artifacts and automated testing frameworks
  • Collaborate with ML researchers to productionize prototypes and scale experiments
  • Fulltime
Read More
Arrow Right

AI Staff Engineer

At Multiverse, we believe technology should empower everyone to achieve their po...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
multiverse.io Logo
Multiverse
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven Staff-Level experience: 6-7 years in software engineering with strong understanding of Applied AI fundamentals
  • Experience working in cross-functional product teams
  • LLM Expertise: Experienced in working with large language models (e.g., GPT, Claude, Gemini Pro) for production use cases, including prompt engineering, evaluation, and safety & inclusivity considerations
  • Strong Engineering Skills: Proficient in Python and TypeScript, with experience building APIs, microservices, and cloud-native applications
  • Experience with AI Tools: Familiarity with emerging AI tooling platforms such as Cursor and Gemini is highly desirable
  • Cloud & MLOps: Practical experience deploying AI solutions on AWS, with a strong grasp of version control, observability, and evaluation pipelines
  • Data Skills: Skilled at working with structured and unstructured data, applying preprocessing and feature engineering techniques
  • User Focus: You can translate complex AI capabilities into product experiences that feel effortless and intuitive
  • Collaborative Approach: You work best in creative and cross-functional teams and thrive when building together
  • Growth Mindset: You’re curious, open to feedback, and excited to share what you learn while contributing to an inclusive, high-performing culture
Job Responsibility
Job Responsibility
  • Design, Architect & Deliver AI Solutions: Partner with Product, Design, and Data teams to shape and deliver AI-powered features that generate real impact to our learners, value for our customers, and align with Multiverse’s mission
  • Establish LLM Best Practices: Define the technical standards and governance for leveraging Large Language Models (LLMs), including design, fine-tuning, and integration for high-impact production use cases such as content generation, semantic search, summarisation, and personalised learning experiences
  • Build & Integrate Models: Develop, fine-tune, and embed machine learning models into production systems using tools like Cursor and Gemini, ensuring they are fast, scalable, and dependable
  • Own the End-to-End Lifecycle: Take responsibility for the journey from raw data through experimentation, deployment to users, and continuous iteration
  • Measure What Matters: Track the performance, accuracy, and adoption of AI features, and use those insights to drive constant improvement
  • Mentor and Scale Expertise: Mentor and coach engineers across teams, sharing your deep expertise to make AI approachable and set the direction for best practices, significantly elevating the team's capabilities
  • Lead in MLOps & Cloud Infrastructure: Build robust pipelines for deployment, and monitoring using AWS cloud services and modern MLOps best practices
  • Champion Innovation: Keep us ahead of the curve by exploring new AI tools, including Cursor and Gemini, and applying them to create exceptional user experiences
  • Drive Organisational Adoption: Champion new technologies and approaches (including AI-assisted tools), driving their successful adoption across multiple teams while balancing experimentation with pragmatic delivery
  • Cross-Team Influence: Act as a key technical advisor and connector across product, design, and engineering, ensuring alignment on strategic initiatives
What we offer
What we offer
  • Time off - 27 days holiday, plus 5 additional days off: 1 life event day, 2 volunteer days, 2 company-wide wellbeing days (M-Powered Weekend) and 8 bank holidays per year
  • Health & Wellness- private medical Insurance with Bupa, a medical cashback scheme, life insurance, gym membership & wellness resources through Wellhub and access to Spill - all in one mental health support
  • Hybrid work offering - for most roles we collaborate in the office three days per week with the exception of Coaches and Instructors who collaborate in the office once a month
  • Work-from-anywhere scheme - you'll have the opportunity to work from anywhere, up to 10 days per year
  • Space to connect: Beyond the desk, we make time for weekly catch-ups, seasonal celebrations, and have a kitchen that’s always stocked!
  • Fulltime
Read More
Arrow Right

LLM - Senior Staff Engineer - Python + Machine Learning

AquSag is seeking a hands-on Machine Learning Senior Staff Engineer to lead cros...
Location
Location
Salary
Salary:
40.00 - 60.00 USD / Hour
aqusag.com Logo
AquSag Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 9+ yrs of strong background in Machine Learning, NLP, and modern deep learning architectures (Transformers, LLMs)
  • Hands-on experience with frameworks such as PyTorch, TensorFlow, Hugging Face, or DeepSpeed
  • Hands-on experience in Docker for Production deployment
  • Proven experience managing teams delivering ML/LLM models in production environments
  • Knowledge of distributed training, GPU/TPU optimization, and cloud platforms (AWS, GCP, Azure)
  • Familiarity with MLOps tools like MLflow, Kubeflow, or Vertex AI for scalable ML pipelines
  • Excellent leadership, communication, and cross-functional collaboration skills
  • Bachelor’s or Master’s in Computer Science, Engineering, or related field (PhD preferred)
  • Overlap of 6 hours with PST time zone is mandatory
  • Commitments Required: 8 hours per day with overlap of 6 hours with PST
Job Responsibility
Job Responsibility
  • Lead and mentor a cross-functional team of ML engineers, data scientists, and MLOps professionals
  • Oversee the full lifecycle of LLM and ML projects — from data collection to training, evaluation, and deployment
  • Collaborate with Research, Product, and Infrastructure teams to define goals, milestones, and success metrics
  • Provide technical direction on large-scale model training, fine-tuning, and distributed systems design
  • Implement best practices in MLOps, model governance, experiment tracking, and CI/CD for ML
  • Manage compute resources, budgets, and ensure compliance with data security and responsible AI standards
  • Communicate progress, risks, and results to stakeholders and executives effectively
  • Fulltime
Read More
Arrow Right

Staff, Software Engineer - Backend

Walmart's Enterprise Business Services (EBS) is a powerhouse of seven exceptiona...
Location
Location
United States , Bentonville
Salary
Salary:
110000.00 - 220000.00 USD / Year
walmart.com Logo
Walmart
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 4 years' experience in software engineering or related area
  • 6 years' experience in software engineering or related area
  • Python guru with a proven track record of writing high-performing, production-quality code
  • Hands-on experience designing and building Python-based web services in a production setting (FastAPI experience preferred)
  • Deep familiarity with version control using Git in collaborative team environments
  • Comfortable working with Linux environments and containerization technologies such as Docker
  • 4+ years of industry experience with demonstrated ownership and delivery of software products
  • Hands-on experience developing or deploying GenAI-based applications
  • Experience working with or integrating open-source and/or commercial GenAI libraries/frameworks such as Hugging Face Transformers, LangChain, OpenAI API, or similar
  • Ability to productionize and evaluate GenAI models
Job Responsibility
Job Responsibility
  • Design and develop platform features enabling advanced semantic routing for GenAI-powered services
  • Build and maintain evaluation pipelines for semantic router data
  • Collaborate with applied researchers and data scientists to continuously improve semantic routing algorithms
  • Develop and implement agent-to-agent (A2A) communication protocols
  • Contribute to the design and development of platform features using microservices (FastAPI) and event-driven architecture (Kafka, SSE, WebSocket)—all in Python
  • Uphold engineering and operational excellence standards
  • Support operational excellence for semantic routing and agent communication systems
  • Stay current with GenAI and multi-agent system best practices
  • Be an active member of a dynamic team
  • Support production operations by participating in on-call rotations
What we offer
What we offer
  • Medical coverage
  • Vision coverage
  • Dental coverage
  • 401(k) match
  • Stock purchase plan
  • Paid maternity and parental leave
  • PTO
  • Short-term disability
  • Long-term disability
  • Company discounts
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Platform

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Palo Alto
Salary
Salary:
90000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right