Senior ML Infrastructure Engineer Job at YC Work at a Startup (San Francisco)

Senior ML Infrastructure / ML DevOps Engineer

We are looking for a Senior ML Infrastructure / DevOps Engineer who loves Linux,...

Location

Salary:

Not provided

Pathway

Expiration Date

Until further notice

Requirements

Former or current Linux / systems / network administrator comfortable living in the shell and debugging at OS and network layers (systemd, filesystems, iptables/security groups, DNS, TLS, routing)
5+ years of experience in DevOps/SRE/Platform/Infrastructure roles running production systems, ideally with high‑performance or ML workloads
Deep familiarity with Linux as a daily driver, including shell scripting and configuration of clusters and services
Strong experience with workload management, containerization, and orchestration (Slurm, Docker, Kubernetes) in production environments
Solid understanding of CI/CD tools and workflows (GitHub Actions, GitLab CI, Jenkins, etc.), including building pipelines from scratch
Hands-on cloud infrastructure experience (AWS, GCP, Azure), especially around GPU instances, VPC/networking, storage, and managed ML services (e.g., SageMaker HyperPod, Vertex AI)
Proficiency with infrastructure as code (Terraform, CloudFormation, or similar) and a bias toward automation over manual operations
Experience with monitoring and logging stacks (Grafana, Prometheus, Loki, CloudWatch, or equivalents)
Familiarity with ML pipeline and experiment orchestration tools (MLflow, Kubeflow, Airflow, Metaflow, etc.) and with model/version management
Solid programming skills in Python, plus the ability to read and debug code that uses common ML libraries (PyTorch, TensorFlow) even if you are not a full‑time model developer

Job Responsibility

Design, operate, and scale GPU and CPU clusters for ML training and inference (Slurm, Kubernetes, autoscaling, queueing, quota management)
Automate infrastructure provisioning and configuration using infrastructure‑as‑code (Terraform, CloudFormation, cluster‑tooling) and configuration management
Build and maintain robust ML pipelines (data ingestion, training, evaluation, deployment) with strong guarantees around reproducibility, traceability, and rollback
Implement and evolve ML‑centric CI/CD: testing, packaging, deployment of models and services
Own monitoring, logging, and alerting across training and serving: GPU/CPU utilization, latency, throughput, failures, and data/model drift (Grafana, Prometheus, Loki, CloudWatch)
Work with terabyte‑scale datasets and the associated storage, networking, and performance challenges
Partner closely with ML engineers and researchers to productionize their work, translating experimental setups into robust, scalable systems
Participate in on‑call rotation for critical ML infrastructure and lead incident response and post‑mortems when things break

What we offer

Intellectually stimulating work environment
Be a pioneer: you get to work with realtime data processing & AI
Work in one of the hottest AI startups, with exciting career prospects
Team members are distributed across the world
Responsibilities and ability to make significant contribution to the company’s success
Inclusive workplace culture

Fulltime

Senior ML Infrastructure Engineer, Inference Platform

About the Team: The ML Inference Platform is part of the AV ML Infrastructure or...

Location

United States , Austin, Texas; Mountain View, California; Sunnyvale, California

Salary:

155420.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

5+ years of industry experience, with focus on machine learning systems or high performance backend services
Expertise in either Python, C++ or other relevant coding languages
Expertise in ML inference, model serving frameworks (triton, rayserve, vLLM etc)
Strong communication skills and a proven ability to drive cross-functional initiatives
Ability to thrive in a dynamic, multi-tasking environment with ever-evolving priorities

Job Responsibility

Design and implement core platform backend software components
Collaborate with ML engineers and researchers to understand critical workflows, parse them to platform requirements, and deliver incremental value
Lead technical decision-making on model serving strategies, orchestration, caching, model versioning, and auto-scaling mechanisms for highly optimized use of accelerators
Drive the development of monitoring, observability, and metrics to ensure reliability, performance, and resource optimization of inference services
Proactively research and integrate state-of-the-art model serving frameworks, hardware accelerators, and distributed computing techniques
Lead technical initiatives across GM’s ML ecosystem
Raise the engineering bar through technical leadership, establishing best practices
Contribute to open source projects
represent GM in relevant communities

What we offer

medical
dental
vision
Health Savings Account
Flexible Spending Accounts
retirement savings plan
sickness and accident benefits
life insurance
paid vacation & holidays
tuition assistance programs

Fulltime

Senior ML Infrastructure Engineer - Embodied AI

At General Motors, our product teams are redefining mobility. Through a human-ce...

Location

United States , Sunnyvale

Salary:

153200.00 - 234100.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

3+ years of experience working on large-scale distributed systems, applications, or ML infrastructure
Experience designing robust services or frameworks with durable, well-designed APIs
Solid understanding of machine learning workflows and hands-on experience applying ML systems in production environments
Experience building reliable, high-performance, and cost-efficient systems on modern cloud infrastructure
Practical experience across the ML development lifecycle, including model training, deployment, and MLOps practices
Strong cross-functional collaboration skills across teams and organizations
Strong coding skills in Python or C++
Interest in autonomous driving and large-scale ML systems
BS, MS, or PhD in Computer Science, Mathematics, or equivalent practical experience

Job Responsibility

Design, implement, and deploy scalable platforms and tools supporting machine learning training and evaluation workflows across GM
Drive complex technical projects with strong ownership of implementation, code quality, and system reliability
Contribute to technical design discussions and architectural decisions while collaborating with senior engineers and technical leads
Work closely with partner teams to ensure platforms meet real-world ML development needs and maximize adoption
Identify technical improvements and help prioritize platform investments to improve performance, reliability, and developer productivity
Contribute to a strong engineering culture through high-quality code reviews, documentation, and operational excellence
Support onboarding and mentoring of junior engineers and interns

What we offer

medical
dental
vision
Health Savings Account
Flexible Spending Accounts
retirement savings plan
sickness and accident benefits
life insurance
paid vacation & holidays
tuition assistance programs

Fulltime

Senior Software Engineer - ML Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...

Location

United States , New York

Salary:

190800.00 - 286800.00 USD / Year

Plaid

Expiration Date

Until further notice

Requirements

5+ years of industry experience as a software engineer, with strong focus on ML/AI infrastructure or large-scale distributed systems
Hands-on expertise in building and operating ML platforms (e.g., feature stores, data pipelines, training/inference frameworks)
Proven experience delivering reliable and scalable infrastructure in production
Solid understanding of ML Ops concepts and tooling, as well as best practices for observability, security, and reliability
Strong communication skills and ability to collaborate across teams

Job Responsibility

Design and implement large-scale ML infrastructure, including feature stores, pipelines, deployment tooling, and inference systems
Drive the rollout of Plaid’s next-generation feature store to improve reliability and velocity of model development
Help define and evangelize an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
Ensure operational excellence of ML pipelines and services, including reliability, scalability, performance, and cost efficiency
Collaborate with ML product teams to understand requirements and deliver solutions that accelerate experimentation and iteration
Contribute to technical strategy and architecture discussions within the team
Mentor and support other engineers through code reviews, design discussions, and technical guidance

Fulltime

Senior Software Engineer, ML Infrastructure

LMArena is seeking a Senior Software Engineer (Infrastructure) to lead the desig...

Location

United States , Bay Area

Salary:

Not provided

Arena Intelligence, Inc.

Expiration Date

Until further notice

Requirements

5+ years of experience in software engineering, with a focus on infrastructure or large-scale data and ML systems
Deep expertise in distributed systems, stream processing, and scalable backend architecture
Proven ability to design and operate low-latency, high-throughput, and fault-tolerant systems
Strong foundation in systems design, performance tuning, and building reliable, fault-tolerant services
Comfortable in a dynamic, high-ownership, fast-growth environment

Job Responsibility

Architect and scale high-performance, real-time API and data systems
Design and implement low-latency pipelines to process and analyze large-scale event streams
Ensure reliability through robust data integrity, availability, and consistency mechanisms
Mentor and guide engineers on infrastructure best practices, architecture, and performance tuning
Collaborate cross-functionally with AI researchers, product leaders, and engineers to anticipate evolving infrastructure needs and deliver resilient, extensible systems

What we offer

Comprehensive health and wellness benefits, including medical, dental, vision, and additional support programs.
The opportunity to work on cutting-edge AI with a small, mission-driven team
A culture that values transparency, trust, and community impact

Fulltime

Senior Software Engineer - ML Infrastructure

We are seeking a Senior Software Engineer to design and build the infrastructure...

Location

United States , Boston

Salary:

152000.00 - 224000.00 USD / Year

SimpliSafe

Expiration Date

Until further notice

Requirements

5+ years of experience building software systems and infrastructure
3+ years of experience deploying and supporting production solutions on AWS
Experience building and operating production applications on Kubernetes
Experience with AWS data services such as Athena, Glue and Kinesis
Familiarity with AWS services such as Lambda, Dynamodb and IAM
Expertise in containers, infrastructure automation, and CI/CD tooling

Job Responsibility

Design, build, and maintain software systems and infrastructure that support the end-to-end ML lifecycle
Support the development and operation of production-grade machine learning solutions
Develop and operate microservices in a public cloud environment (AWS, Azure, or GCP)
Collaborate cross-functionally with ML and platform teams to deliver scalable solutions
Provide technical guidance and mentorship to engineers
Promote and practice high engineering standards, including unit, integration, and mock testing
Contribute to cloud infrastructure automation, CI/CD pipelines, and containerized deployments
Take ownership of projects with a proactive, “can-do” mindset

What we offer

A mission- and values-driven culture and a safe, inclusive environment where you can build, grow and thrive
A comprehensive total rewards package that supports your wellness and provides security for SimpliSafers and their families
Free SimpliSafe system and professional monitoring for your home
Employee Resource Groups (ERGs) that bring people together, give opportunities to network, mentor and develop, and advocate for change

Fulltime

Senior Software Engineer - ML Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...

Location

United States , San Francisco

Salary:

180000.00 - 270000.00 USD / Year

Plaid

Expiration Date

Until further notice

Requirements

5+ years of industry experience as a software engineer, with strong focus on ML/AI infrastructure or large-scale distributed systems
Hands-on expertise in building and operating ML platforms (e.g., feature stores, data pipelines, training/inference frameworks)
Proven experience delivering reliable and scalable infrastructure in production
Solid understanding of ML Ops concepts and tooling, as well as best practices for observability, security, and reliability
Strong communication skills and ability to collaborate across teams

Job Responsibility

Design and implement large-scale ML infrastructure, including feature stores, pipelines, deployment tooling, and inference systems
Drive the rollout of Plaid’s next-generation feature store to improve reliability and velocity of model development
Help define and evangelize an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
Ensure operational excellence of ML pipelines and services, including reliability, scalability, performance, and cost efficiency
Collaborate with ML product teams to understand requirements and deliver solutions that accelerate experimentation and iteration
Contribute to technical strategy and architecture discussions within the team
Mentor and support other engineers through code reviews, design discussions, and technical guidance

What we offer

medical, dental, vision, and 401(k)

Fulltime

Senior Machine Learning Engineer - ML Training Infrastructure

We are seeking an experienced, technical oriented, impact delivering-driven expe...

Location

United States , Mountain View

Salary:

170000.00 - 240000.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

Bachelors degree or higher in Computer Science or equivalent major OR equivalent relevant experience
3+ years professional software engineering experience
2+ years specialized experience in AI/ML infrastructure, e.g., enabling distributed training for scaling large ML models
Strong programming skills in Python, with proficiency in frameworks such as, PyTorch (preferred), TensorFlow, or similar
Experience with distributed computing, GPU computing, and cloud environments (AWS, GCP, Azure)
Willingness to travel to Sunnyvale, CA as needed
Comfortable working in highly ambiguous and dynamic environments

Job Responsibility

Design and development of scalable, reliable, high-performance ML framework to support model training at scale
Model training performance analysis and optimization solutions to scale distributed training workflows and maximize resource utilization across heterogeneous hardware environments, and save cost
Raise the bar on system observability, debuggability, and operational excellence, and user experience
Collaborate with cross-functional teams to integrate new features and technologies into the platform

What we offer

medical
dental
vision
Health Savings Account
Flexible Spending Accounts
retirement savings plan
sickness and accident benefits
life insurance
paid vacation & holidays
tuition assistance programs

Fulltime

Select Country

Senior ML Infrastructure Engineer

Job Description

Job Responsibility

Requirements

Looking for more opportunities?