CrawlJobs Logo

Senior ML Infrastructure Engineer

United States, San Francisco 150000.00 - 210000.00 USD / Year · Job Posted January 16, 2026
Apply Position
Job Link Share

Job Description

Parametric is building robots to reliably automate physical labor in the real world. As a Senior ML Infrastructure Engineer, you'll build the systems that power our entire autonomy stack. You'll design the infrastructure that enables our ML team to move fast, from data ingestion and model training to evaluation and deployment. Your work will directly determine how quickly we can iterate on models and ship improvements to robots in the field. This is an early-stage role where you'll define our ML infrastructure from the ground up. You'll work closely with research and systems engineers to build tooling that scales as we grow.

Job Responsibility

  • Design and implement robust ML infrastructure for dataset management, model training and evaluation, and deployment
  • Collaborate with ML engineers to gather requirements and develop plans
  • Build and operate cloud infrastructure (e.g. AWS, GCP) for machine learning workloads for experiments and production
  • Automate model evaluation, selection, and deployment

Requirements

  • Three or more years (or equivalent) working in devops, ML infrastructure, or platform engineering roles
  • Experience designing and implementing production-grade AI infrastructure
  • Deep understanding of the ML lifecycle: data pipelines, distributed training, model evaluation, and deployment
  • Strong proficiency with cloud platforms (AWS, GCP, or Azure) and infrastructure-as-code tools
  • Experience building CI/CD pipelines with tools like GitHub Actions, Jenkins, or similar
  • Comfortable with Python, bash, and infrastructure scripting

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior ML Infrastructure Engineer

8 matching positions

Senior ML Infrastructure / ML DevOps Engineer

We are looking for a Senior ML Infrastructure / DevOps Engineer who loves Linux,...
Location
Location
Salary
Salary:
Not provided
Pathway
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Former or current Linux / systems / network administrator comfortable living in the shell and debugging at OS and network layers (systemd, filesystems, iptables/security groups, DNS, TLS, routing)
  • 5+ years of experience in DevOps/SRE/Platform/Infrastructure roles running production systems, ideally with high‑performance or ML workloads
  • Deep familiarity with Linux as a daily driver, including shell scripting and configuration of clusters and services
  • Strong experience with workload management, containerization, and orchestration (Slurm, Docker, Kubernetes) in production environments
  • Solid understanding of CI/CD tools and workflows (GitHub Actions, GitLab CI, Jenkins, etc.), including building pipelines from scratch
  • Hands-on cloud infrastructure experience (AWS, GCP, Azure), especially around GPU instances, VPC/networking, storage, and managed ML services (e.g., SageMaker HyperPod, Vertex AI)
  • Proficiency with infrastructure as code (Terraform, CloudFormation, or similar) and a bias toward automation over manual operations
  • Experience with monitoring and logging stacks (Grafana, Prometheus, Loki, CloudWatch, or equivalents)
  • Familiarity with ML pipeline and experiment orchestration tools (MLflow, Kubeflow, Airflow, Metaflow, etc.) and with model/version management
  • Solid programming skills in Python, plus the ability to read and debug code that uses common ML libraries (PyTorch, TensorFlow) even if you are not a full‑time model developer
Job Responsibility
Job Responsibility
  • Design, operate, and scale GPU and CPU clusters for ML training and inference (Slurm, Kubernetes, autoscaling, queueing, quota management)
  • Automate infrastructure provisioning and configuration using infrastructure‑as‑code (Terraform, CloudFormation, cluster‑tooling) and configuration management
  • Build and maintain robust ML pipelines (data ingestion, training, evaluation, deployment) with strong guarantees around reproducibility, traceability, and rollback
  • Implement and evolve ML‑centric CI/CD: testing, packaging, deployment of models and services
  • Own monitoring, logging, and alerting across training and serving: GPU/CPU utilization, latency, throughput, failures, and data/model drift (Grafana, Prometheus, Loki, CloudWatch)
  • Work with terabyte‑scale datasets and the associated storage, networking, and performance challenges
  • Partner closely with ML engineers and researchers to productionize their work, translating experimental setups into robust, scalable systems
  • Participate in on‑call rotation for critical ML infrastructure and lead incident response and post‑mortems when things break
What we offer
What we offer
  • Intellectually stimulating work environment
  • Be a pioneer: you get to work with realtime data processing & AI
  • Work in one of the hottest AI startups, with exciting career prospects
  • Team members are distributed across the world
  • Responsibilities and ability to make significant contribution to the company’s success
  • Inclusive workplace culture
  • Fulltime
Read More
Arrow Right

Senior ML Infrastructure Engineer, Inference Platform

About the Team: The ML Inference Platform is part of the AV ML Infrastructure or...
Location
Location
United States , Austin, Texas; Mountain View, California; Sunnyvale, California
Salary
Salary:
155420.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of industry experience, with focus on machine learning systems or high performance backend services
  • Expertise in either Python, C++ or other relevant coding languages
  • Expertise in ML inference, model serving frameworks (triton, rayserve, vLLM etc)
  • Strong communication skills and a proven ability to drive cross-functional initiatives
  • Ability to thrive in a dynamic, multi-tasking environment with ever-evolving priorities
Job Responsibility
Job Responsibility
  • Design and implement core platform backend software components
  • Collaborate with ML engineers and researchers to understand critical workflows, parse them to platform requirements, and deliver incremental value
  • Lead technical decision-making on model serving strategies, orchestration, caching, model versioning, and auto-scaling mechanisms for highly optimized use of accelerators
  • Drive the development of monitoring, observability, and metrics to ensure reliability, performance, and resource optimization of inference services
  • Proactively research and integrate state-of-the-art model serving frameworks, hardware accelerators, and distributed computing techniques
  • Lead technical initiatives across GM’s ML ecosystem
  • Raise the engineering bar through technical leadership, establishing best practices
  • Contribute to open source projects
  • represent GM in relevant communities
What we offer
What we offer
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • Fulltime
Read More
Arrow Right

Senior ML Infrastructure Engineer - Embodied AI

At General Motors, our product teams are redefining mobility. Through a human-ce...
Location
Location
United States , Sunnyvale
Salary
Salary:
153200.00 - 234100.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of experience working on large-scale distributed systems, applications, or ML infrastructure
  • Experience designing robust services or frameworks with durable, well-designed APIs
  • Solid understanding of machine learning workflows and hands-on experience applying ML systems in production environments
  • Experience building reliable, high-performance, and cost-efficient systems on modern cloud infrastructure
  • Practical experience across the ML development lifecycle, including model training, deployment, and MLOps practices
  • Strong cross-functional collaboration skills across teams and organizations
  • Strong coding skills in Python or C++
  • Interest in autonomous driving and large-scale ML systems
  • BS, MS, or PhD in Computer Science, Mathematics, or equivalent practical experience
Job Responsibility
Job Responsibility
  • Design, implement, and deploy scalable platforms and tools supporting machine learning training and evaluation workflows across GM
  • Drive complex technical projects with strong ownership of implementation, code quality, and system reliability
  • Contribute to technical design discussions and architectural decisions while collaborating with senior engineers and technical leads
  • Work closely with partner teams to ensure platforms meet real-world ML development needs and maximize adoption
  • Identify technical improvements and help prioritize platform investments to improve performance, reliability, and developer productivity
  • Contribute to a strong engineering culture through high-quality code reviews, documentation, and operational excellence
  • Support onboarding and mentoring of junior engineers and interns
What we offer
What we offer
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - ML Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , New York
Salary
Salary:
190800.00 - 286800.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of industry experience as a software engineer, with strong focus on ML/AI infrastructure or large-scale distributed systems
  • Hands-on expertise in building and operating ML platforms (e.g., feature stores, data pipelines, training/inference frameworks)
  • Proven experience delivering reliable and scalable infrastructure in production
  • Solid understanding of ML Ops concepts and tooling, as well as best practices for observability, security, and reliability
  • Strong communication skills and ability to collaborate across teams
Job Responsibility
Job Responsibility
  • Design and implement large-scale ML infrastructure, including feature stores, pipelines, deployment tooling, and inference systems
  • Drive the rollout of Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Help define and evangelize an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines and services, including reliability, scalability, performance, and cost efficiency
  • Collaborate with ML product teams to understand requirements and deliver solutions that accelerate experimentation and iteration
  • Contribute to technical strategy and architecture discussions within the team
  • Mentor and support other engineers through code reviews, design discussions, and technical guidance
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, ML Infrastructure

LMArena is seeking a Senior Software Engineer (Infrastructure) to lead the desig...
Location
Location
United States , Bay Area
Salary
Salary:
Not provided
arena.ai Logo
Arena Intelligence, Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in software engineering, with a focus on infrastructure or large-scale data and ML systems
  • Deep expertise in distributed systems, stream processing, and scalable backend architecture
  • Proven ability to design and operate low-latency, high-throughput, and fault-tolerant systems
  • Strong foundation in systems design, performance tuning, and building reliable, fault-tolerant services
  • Comfortable in a dynamic, high-ownership, fast-growth environment
Job Responsibility
Job Responsibility
  • Architect and scale high-performance, real-time API and data systems
  • Design and implement low-latency pipelines to process and analyze large-scale event streams
  • Ensure reliability through robust data integrity, availability, and consistency mechanisms
  • Mentor and guide engineers on infrastructure best practices, architecture, and performance tuning
  • Collaborate cross-functionally with AI researchers, product leaders, and engineers to anticipate evolving infrastructure needs and deliver resilient, extensible systems
What we offer
What we offer
  • Comprehensive health and wellness benefits, including medical, dental, vision, and additional support programs.
  • The opportunity to work on cutting-edge AI with a small, mission-driven team
  • A culture that values transparency, trust, and community impact
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - ML Infrastructure

We are seeking a Senior Software Engineer to design and build the infrastructure...
Location
Location
United States , Boston
Salary
Salary:
152000.00 - 224000.00 USD / Year
simplisafe.com Logo
SimpliSafe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience building software systems and infrastructure
  • 3+ years of experience deploying and supporting production solutions on AWS
  • Experience building and operating production applications on Kubernetes
  • Experience with AWS data services such as Athena, Glue and Kinesis
  • Familiarity with AWS services such as Lambda, Dynamodb and IAM
  • Expertise in containers, infrastructure automation, and CI/CD tooling
Job Responsibility
Job Responsibility
  • Design, build, and maintain software systems and infrastructure that support the end-to-end ML lifecycle
  • Support the development and operation of production-grade machine learning solutions
  • Develop and operate microservices in a public cloud environment (AWS, Azure, or GCP)
  • Collaborate cross-functionally with ML and platform teams to deliver scalable solutions
  • Provide technical guidance and mentorship to engineers
  • Promote and practice high engineering standards, including unit, integration, and mock testing
  • Contribute to cloud infrastructure automation, CI/CD pipelines, and containerized deployments
  • Take ownership of projects with a proactive, “can-do” mindset
What we offer
What we offer
  • A mission- and values-driven culture and a safe, inclusive environment where you can build, grow and thrive
  • A comprehensive total rewards package that supports your wellness and provides security for SimpliSafers and their families
  • Free SimpliSafe system and professional monitoring for your home
  • Employee Resource Groups (ERGs) that bring people together, give opportunities to network, mentor and develop, and advocate for change
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - ML Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 270000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of industry experience as a software engineer, with strong focus on ML/AI infrastructure or large-scale distributed systems
  • Hands-on expertise in building and operating ML platforms (e.g., feature stores, data pipelines, training/inference frameworks)
  • Proven experience delivering reliable and scalable infrastructure in production
  • Solid understanding of ML Ops concepts and tooling, as well as best practices for observability, security, and reliability
  • Strong communication skills and ability to collaborate across teams
Job Responsibility
Job Responsibility
  • Design and implement large-scale ML infrastructure, including feature stores, pipelines, deployment tooling, and inference systems
  • Drive the rollout of Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Help define and evangelize an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines and services, including reliability, scalability, performance, and cost efficiency
  • Collaborate with ML product teams to understand requirements and deliver solutions that accelerate experimentation and iteration
  • Contribute to technical strategy and architecture discussions within the team
  • Mentor and support other engineers through code reviews, design discussions, and technical guidance
What we offer
What we offer
  • medical, dental, vision, and 401(k)
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer - ML Training Infrastructure

We are seeking an experienced, technical oriented, impact delivering-driven expe...
Location
Location
United States , Mountain View
Salary
Salary:
170000.00 - 240000.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelors degree or higher in Computer Science or equivalent major OR equivalent relevant experience
  • 3+ years professional software engineering experience
  • 2+ years specialized experience in AI/ML infrastructure, e.g., enabling distributed training for scaling large ML models
  • Strong programming skills in Python, with proficiency in frameworks such as, PyTorch (preferred), TensorFlow, or similar
  • Experience with distributed computing, GPU computing, and cloud environments (AWS, GCP, Azure)
  • Willingness to travel to Sunnyvale, CA as needed
  • Comfortable working in highly ambiguous and dynamic environments
Job Responsibility
Job Responsibility
  • Design and development of scalable, reliable, high-performance ML framework to support model training at scale
  • Model training performance analysis and optimization solutions to scale distributed training workflows and maximize resource utilization across heterogeneous hardware environments, and save cost
  • Raise the bar on system observability, debuggability, and operational excellence, and user experience
  • Collaborate with cross-functional teams to integrate new features and technologies into the platform
What we offer
What we offer
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • Fulltime
Read More
Arrow Right