CrawlJobs Logo

Staff ML Infrastructure Engineer

180000.00 - 240000.00 USD / Year · Job Posted February 21, 2026
Apply Position
Job Link Share

Job Description

Playlab seeks a Staff Machine Learning Engineer to join our growing Engineering team. As a Staff ML Infrastructure Engineer, you'll be designing the systems that keep AI accessible as we grow - balancing cutting-edge capabilities with cost efficiency, powering research into what works in educational AI, and building toward a future where sophisticated AI can run anywhere in the world.

Job Responsibility

  • Design, build, and maintain production ML infrastructure that balances performance, cost, and reliability
  • Own data quality and research dataset creation - ensure data is properly scrubbed, documented, and useful for research partners
  • Stay on top of ML infrastructure technologies and techniques - from model serving to cost optimization to observability tools
  • Work cross-functionally with ML engineers, backend engineers, and product to ensure infrastructure supports real needs
  • Balance innovation with operational excellence - experiment with new approaches while maintaining system reliability and data quality
  • Mentor engineers on ML operations, cost optimization, and production ML best practices

Requirements

  • 7+ years building production ML/data systems, with experience in ML operations and infrastructure
  • Strong experience with model serving, orchestration, and optimization in production environments
  • Proficient in Python and data pipeline technologies (Airflow, ETL tools, etc.)
  • Experience with cloud infrastructure (AWS preferred) and containerization (Kubernetes, Docker)
  • Experience with cost optimization strategies for LLM-based systems
  • Thrive in high-agency, high collaboration cultures
  • Great communication that makes working remote-first work

Nice to have

  • Experience in education or building in edtech
  • Experience with educational technology or mission-driven organizations
  • Experience with designing creative platforms
  • Experience with LiteLLM or similar model routing frameworks
  • Background in privacy-preserving ML or PII handling
  • Experience building research data infrastructure
  • Contributions to open source ML infrastructure projects

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Staff ML Infrastructure Engineer

8 matching positions

Engineering Manager, Infrastructure

As an Engineering Manager for the Infrastructure team, you’ll lead the engineers...
Location
Location
Canada; United States
Salary
Salary:
195000.00 - 285000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on software or infrastructure engineering experience
  • 2+ years of experience leading teams of senior and staff-level engineers in platform, SRE, or infrastructure domains
  • Proven ability to design and operate large-scale distributed systems in cloud environments (preferably GCP or AWS)
  • Expertise with Kubernetes, Docker, Terraform, Ubuntu, and CI/CD pipelines
  • Familiarity with observability tools (Grafana, Prometheus, ELK, Datadog, NewRelic) and performance tuning
  • Strong grounding in networking, security, and reliability principles
  • Experience managing infrastructure costs, availability SLAs, and high-throughput systems at scale
Job Responsibility
Job Responsibility
  • Lead, coach, and grow a distributed team of high-impact Infrastructure Engineers
  • Partner with senior engineering leadership on strategic initiatives such as cloud migration, infrastructure scaling, platform reliability, and cost efficiency
  • Define and implement modern operational excellence practices, including SLOs, error budgets, incident reviews, and performance monitoring
  • Guide technical decision-making across key areas like Kubernetes, GCP, observability, networking, CI/CD, and IaC (Terraform, Ansible)
  • Collaborate with AI, Data, and Product Engineering teams to ensure infrastructure scalability for ML and AI-native workloads
  • Run effective 1:1s, career development conversations, and quarterly performance reviews
  • Support recruiting efforts to attract top engineering talent across time zones
What we offer
What we offer
  • Equity
  • Company bonus or sales commissions/bonuses
  • 401(k) plan
  • At least 10 paid holidays per year
  • Flex PTO
  • Parental leave
  • Employee assistance program and wellbeing benefits
  • Global travel coverage
  • Life/AD&D/STD/LTD insurance
  • FSA/HSA and medical, dental, and vision benefits
  • Fulltime
Read More
Arrow Right

Staff Platform Engineer

Join our dynamic team as a Compute Platform Engineer and play a pivotal role in ...
Location
Location
Canada , Vancouver
Salary
Salary:
190000.00 - 240000.00 CAD / Year
inworld.ai Logo
Inworld AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7 years of experience in software engineering
  • 5 years of experience with infrastructure-as-code
  • Proficiency in managing Kubernetes clusters and applications, including creating Kustomize manifests/Helm charts for new applications
  • Experience in creating and maintaining CI/CD pipelines for both applications and infrastructure deployments (using tools like Terraform/Terragrunt, ArgoCD, GitHub Actions, Ansible, etc.)
  • Deep knowledge of at least one major cloud provider (Google Cloud Platform, Microsoft Azure, Oracle Cloud)
  • Proficient in at least one backend programming/scripting languages such as Golang, Python, and Bash
Job Responsibility
Job Responsibility
  • Work closely with backend and ML engineering teams to design, deploy, and maintain reliable, high-performance, and secure cloud infrastructure for our AI engine and Studio
  • Facilitate a "you build it, you run it" culture by providing the necessary tools and processes for monitoring the reliability, availability, and performance of services
  • Manage CI/CD pipelines to ensure smooth and efficient code integration and deployment
  • Identify and implement opportunities to enhance engineering speed and efficiency
  • Conduct root cause analysis to identify critical issues and develop automated solutions to prevent recurrence
  • Develop and share best practices to improve automation and efficiency across our engineering teams
What we offer
What we offer
  • bonus
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Cloud Infrastructure

As a Software Engineer on our Cloud Infrastructure team, you'll be at the forefr...
Location
Location
United States , New York, NY; San Mateo, CA; Redwood City, CA
Salary
Salary:
175000.00 - 220000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • 5+ years of experience designing and building backend infrastructure in cloud environments (e.g., AWS, GCP, Azure)
  • Proven experience in ML infrastructure and tooling (e.g., PyTorch, TensorFlow, Vertex AI, SageMaker, Kubernetes, etc.)
  • Strong software development skills in languages like Python, or C++
  • Deep understanding of distributed systems fundamentals: scheduling, orchestration, storage, networking, and compute optimization
Job Responsibility
Job Responsibility
  • Architect and build scalable, resilient, and high-performance backend infrastructure to support distributed training, inference, and data processing pipelines
  • Lead technical design discussions, mentor other engineers, and establish best practices for building and operating large-scale ML infrastructure
  • Design and implement core backend services (e.g., job schedulers, resource managers, autoscalers, model serving layers) with a focus on efficiency and low latency
  • Drive infrastructure optimization initiatives, including compute cost reduction, storage lifecycle management, and network performance tuning
  • Collaborate cross-functionally with ML, DevOps, and product teams to translate research and product needs into robust infrastructure solutions
  • Continuously evaluate and integrate cloud-native and open-source technologies (e.g., Kubernetes, Ray, Kubeflow, MLFlow) to enhance our platform’s capabilities and reliability
  • Own end-to-end systems from design to deployment and observability, with a strong emphasis on reliability, fault tolerance, and operational excellence
What we offer
What we offer
  • Meaningful equity in a fast-growing startup
  • Competitive salary
  • Comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Staff Platform Engineer

Join our dynamic team as a Compute Platform Engineer and play a pivotal role in ...
Location
Location
United States , Mountain View, California
Salary
Salary:
180000.00 - 280000.00 USD / Year
inworld.ai Logo
Inworld AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7 years of experience in software engineering
  • 5 years of experience with infrastructure-as-code
  • Proficiency in managing Kubernetes clusters and applications, including creating Kustomize manifests/Helm charts for new applications
  • Experience in creating and maintaining CI/CD pipelines for both applications and infrastructure deployments (using tools like Terraform/Terragrunt, ArgoCD, GitHub Actions, Ansible, etc.)
  • Deep knowledge of at least one major cloud provider (Google Cloud Platform, Microsoft Azure, Oracle Cloud)
  • Proficient in at least one backend programming/scripting languages such as Golang, Python, and Bash
  • Candidates must be based in the SF Bay Area or willing to relocate (you will be working on-site in our South Bay office a few days a week)
Job Responsibility
Job Responsibility
  • Work closely with backend and ML engineering teams to design, deploy, and maintain reliable, high-performance, and secure cloud infrastructure for our AI engine and Studio
  • Facilitate a "you build it, you run it" culture by providing the necessary tools and processes for monitoring the reliability, availability, and performance of services
  • Manage CI/CD pipelines to ensure smooth and efficient code integration and deployment
  • Identify and implement opportunities to enhance engineering speed and efficiency
  • Conduct root cause analysis to identify critical issues and develop automated solutions to prevent recurrence
  • Develop and share best practices to improve automation and efficiency across our engineering teams
What we offer
What we offer
  • equity and benefits
  • Fulltime
Read More
Arrow Right

Staff Backend Engineer

Kalepa is looking for a Staff Backend Engineer to work on its AI Copilot platfor...
Location
Location
Salary
Salary:
145000.00 - 185000.00 USD / Year
kalepa.com Logo
Kalepa
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of relevant software engineering experience
  • Excellent development skills including design, debugging and problem solving
  • Bachelors or master's degree in computer science or a related field
  • Experience with Python3 or other OO languages (Java, C++, C#, etc.)
  • Experience with AWS (EC2, Lambda, etc.) and serverless technologies
  • Experience with relational databases, preference for PostgreSQL
  • Experience working on distributed systems creating scalable, fault-tolerant infrastructure
  • Experience building data driven microservices leveraging RESTful API's
  • Experience with tools such as Docker, Git, GitHub, Flask, NumPy, Pandas
Job Responsibility
Job Responsibility
  • Work on advanced systems including NLP, firmographic data, entity resolution
  • Solve problems at the intersection of large and performant data pipelines, distributed systems, machine learning models, and robust infrastructure
  • Collaborate with a global team of full-stack, data, ML, and DevOps engineers
  • Build scalable and reliable backend solutions
What we offer
What we offer
  • Competitive salary (based on experience level)
  • Significant equity options package
  • 20 days of PTO a year
  • Global team offsites
  • Healthy living/gym stipend
  • Mobile phone bill stipend
  • Continuing education credits
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, Backend

The Staff Engineer will work closely with AI/ML engineers, product managers, app...
Location
Location
United States , NYC
Salary
Salary:
160000.00 - 190000.00 USD / Year
conductor.com Logo
Conductor
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Completed studies in Computer Science, Mathematics, engineering or a related field or equivalent professional experience
  • 8+ years of experience in software development, with experience in product-driven companies
  • Strong expertise in system design, distributed computing, and scalable architecture patterns for handling large datasets and high-throughput applications
  • Proficiency in multiple programming languages with strong Python coding skills. Experience with Java is highly valued
  • Strong database experience including both SQL and NoSQL systems, with knowledge of data modeling and optimization techniques
  • Experience with AI/ML technologies including LLMs, vector databases (e.g., Milvus), embeddings, and ML frameworks
  • Knowledge of MLOps practices, model deployment, and AI system integration in production environments
  • Experience working across the full software development lifecycle including CI/CD, monitoring, testing, and production deployment
  • Proven track record of technical leadership, mentoring engineers, and driving engineering excellence within teams
  • Up-to-date with rapidly-evolving technologies and demonstrated ability to evaluate and adopt new tools and frameworks
Job Responsibility
Job Responsibility
  • Lead the technical architecture, design, and implementation of large-scale distributed systems and data platforms to support customer needs and business growth
  • Oversee the planning, execution, and successful delivery of complex engineering projects, ensuring adherence to engineering best practices and quality standards
  • Design and build scalable, high-performance backend systems and APIs that handle millions of requests and large datasets efficiently
  • Architect robust data processing pipelines and ETL workflows using modern cloud technologies and distributed computing frameworks
  • Drive technical decision-making across the engineering organization, evaluating trade-offs and establishing engineering standards and practices
  • Lead cross-functional collaboration with product, AI/ML engineering, data engineering, and infrastructure teams to deliver comprehensive solutions
  • Build and maintain CI/CD pipelines, monitoring systems, and deployment automation to ensure reliable software delivery
  • Implement AI/ML capabilities including LLM integration, vector databases, and intelligent content processing workflows
  • Mentor senior and junior engineers, fostering technical excellence and knowledge sharing within the engineering organization
What we offer
What we offer
  • 100% covered employee medical plan
  • a dental & vision plans
  • 401(k) with employer contribution
  • an unlimited vacation policy
  • 10 sick days
  • short-term disability
  • long-term disability
  • generous paid parental leave
  • employee assistance program
  • flexible savings accounts
  • Fulltime
Read More
Arrow Right

Staff Software Engineer

As a Staff Forward Deployed Engineer (FDE) at Invisible, you'll lead high-impact...
Location
Location
United States , Austin; New York; San Francisco Bay Area; Washington DC–Baltimore
Salary
Salary:
213000.00 - 300000.00 USD / Year
invisible.co Logo
Invisible Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of software engineering experience, including significant time spent building data, ML, or backend systems
  • Deep proficiency in Python with hands-on experience using Hugging Face, LangChain, OpenAI, Pinecone, and related ecosystems
  • Skilled in full-stack and API-based deployment patterns, including Docker, FastAPI, Kubernetes, and cloud environments (GCP, AWS)
  • Experienced with workflow orchestration libraries, pub/sub systems (Kafka), and schema governance
  • Expertise in data governance and operations, including Unity Catalog and policy management, cluster/job orchestration, data contracts and quality enforcement, Delta/ETL pipelines, and replay processes
  • Strong product and system design instincts — you understand business needs and how to translate them into technical architecture
  • Experience building usable systems from messy data and ambiguous requirements
  • Excellent communication and client-facing skills
  • you’ve led conversations with technical and non-technical stakeholders alike
  • Proven experience owning projects from scoping through deployment in ambiguous, high-stakes environments
Job Responsibility
Job Responsibility
  • Partner with delivery and executive stakeholders to scope, design, and lead implementation of AI-driven solutions
  • Identify transformational opportunities in messy, ambiguous workflows and turn them into repeatable systems
  • Lead architecture design and trade-off discussions across performance, scalability, cost, and reliability
  • Own projects from first discovery call through full deployment — including client-facing delivery, internal coordination, and post-launch iteration
  • Build shared infrastructure, reusable components, and internal playbooks to level-up the team
  • Coach and mentor mid-level engineers and help shape the culture of forward-deployed AI engineering at Invisible
What we offer
What we offer
  • bonus
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

Staff MLOps Engineer

At Inworld, we’re building the AI framework behind the next generation of real-t...
Location
Location
Canada , Vancouver
Salary
Salary:
190000.00 - 240000.00 CAD / Year
inworld.ai Logo
Inworld AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of software engineering experience
  • 5+ years of infrastructure-as-code
  • Proficiency in managing Kubernetes clusters and applications, including creating Helm charts/Kustomize manifests for new applications
  • Experience in creating and maintaining CI/CD pipelines for both applications and infrastructure deployments (using tools like Terraform/Terragrunt, ArgoCD, GitHub Actions, Ansible, etc.)
  • Deep knowledge of at least one major cloud provider (Google Cloud Platform, Microsoft Azure, Oracle Cloud)
  • Proficient in at least one backend programming/scripting languages such as Golang, Python, and Bash
  • Knowledge of SLURM or similar job schedulers for distributed training
  • Experience with data pipeline and workflow management tools
  • Desire to work at a fast-growing Series A startup, comfortable with uncertainty, owning and scaling new products, and embracing an experimental and iterative development process
Job Responsibility
Job Responsibility
  • Build and scale MLOps systems to streamline the end-to-end ML model lifecycle on the Inworld AI platform, from training to deployment
  • Design and implement robust model training, evaluation, and release pipelines
  • Collaborate cross-functionally with ML and backend teams to design, deploy, and maintain scalable secure infrastructure for Inworld’s AI Engine and Studio
  • Facilitate a "you build it, you run it" culture by providing the necessary tools and processes for monitoring the reliability, availability, and performance of services
  • Manage CI/CD pipelines to ensure smooth and efficient code integration and deployment
  • Identify and implement opportunities to enhance engineering speed and efficiency
  • Provide technical leadership in ML engineering best practices, raise the technical bar, and mentor junior engineers in MLOps principles
What we offer
What we offer
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right