CrawlJobs Logo

Senior ML Infrastructure Engineer, Inference Platform

United States, Austin, Texas 155420.00 USD / Year · Job Posted March 03, 2026
Apply Position
Job Link Share

Job Description

About the Team: The ML Inference Platform is part of the AV ML Infrastructure organization. Our team owns the cloud-agnostic, reliable, and cost-efficient platform that powers GM’s AI efforts. We’re proud to serve teams developing autonomous vehicles (L3/L4/L5), as well as other groups building AI-driven products for GM and its customers. We enable rapid innovation and feature development by optimizing for high-priority, ML-centric use cases. Our platform supports the serving of state-of-the-art (SOTA) machine learning models for experimental, online and bulk inference, with a focus on performance, availability, concurrency, and scalability. We’re committed to maximizing GPU utilization across platforms (B200, H100, A100, and more) while maintaining reliability and cost efficiency. About the Role: We are seeking a Senior ML Infrastructure engineer to help build and scale robust platforms for ML Inference workflows. In this role, you’ll work closely with ML engineers and researchers to ensure efficient model serving and inference in production, for workflows such as data mining, labeling, model distillation, evaluations, simulations and more. This is a high-impact opportunity to influence the future of AI infrastructure at GM. You will play a key role in shaping the architecture, roadmap and user-experience of a robust ML inference service supporting real-time, batch, and experimental inference needs. The ideal candidate brings experience in designing distributed systems for ML, strong problem-solving skills, and a product mindset focused on platform usability and reliability.

Job Responsibility

  • Design and implement core platform backend software components
  • Collaborate with ML engineers and researchers to understand critical workflows, parse them to platform requirements, and deliver incremental value
  • Lead technical decision-making on model serving strategies, orchestration, caching, model versioning, and auto-scaling mechanisms for highly optimized use of accelerators
  • Drive the development of monitoring, observability, and metrics to ensure reliability, performance, and resource optimization of inference services
  • Proactively research and integrate state-of-the-art model serving frameworks, hardware accelerators, and distributed computing techniques
  • Lead technical initiatives across GM’s ML ecosystem
  • Raise the engineering bar through technical leadership, establishing best practices
  • Contribute to open source projects
  • represent GM in relevant communities

Requirements

  • 5+ years of industry experience, with focus on machine learning systems or high performance backend services
  • Expertise in either Python, C++ or other relevant coding languages
  • Expertise in ML inference, model serving frameworks (triton, rayserve, vLLM etc)
  • Strong communication skills and a proven ability to drive cross-functional initiatives
  • Ability to thrive in a dynamic, multi-tasking environment with ever-evolving priorities

Nice to have

  • Deep expertise building zero-to-one ML infrastructure platforms
  • Experience working with or designing interfaces, apis and clients for ML workflows
  • Experience with Ray framework, and/or vLLM
  • Experience with distributed systems, and handling large-scale data processing
  • Familiarity with telemetry, and other feedback loops to inform product improvements
  • Familiarity with hardware acceleration (GPUs) and optimizations for inference workloads

What we offer

  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • employee assistance program
  • GM vehicle discounts
  • relocation benefits

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior ML Infrastructure Engineer, Inference Platform

8 matching positions

Senior ML Inference Engineer - Platform

The Model Deployment & Inference Solutions team in GM AV deploys machine learnin...
Location
Location
United States , Austin; Mountain View
Salary
Salary:
128700.00 - 261300.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS, MS, or PhD in Computer Science or a related technical field
  • 3+ years of relevant industry experience
  • Strong fundamentals and excellent coding ability in Python
  • Experience building or operating production platform or infrastructure systems where reliability, observability, and extensibility matter
  • Experience with ML model deployment, inference integration, model optimization workflows, or model serving infrastructure, with at least one prior context where you owned the path from a trained model to a running inference workload
  • Experience using coding agents (Cursor, Claude Code, GitHub Copilot, or equivalent) as part of your engineering workflow
  • Experience designing clean, well-tested software with clear interfaces and good abstractions
  • Strong cross-team collaboration skills
Job Responsibility
Job Responsibility
  • Design, build, and operate the ML deployment platform that automates the path from trained model to on-vehicle inference
  • Drive cross-organization model deployments to the autonomous vehicle stack, partnering with model development teams to take high-value models from training to production on-vehicle
  • Build agentic tools that diagnose and fix deployment-blocking issues, automating workflows currently performed manually by engineers
  • Build the developer experience that ML model development teams use day to day: tooling, dashboards, automation, and observability
  • Drive shift-left validation that surfaces deployment risk (compile, runtime, parity, latency) early in the model development cycle
  • Build platform tools that integrate the work of our sister teams (kernels, compiler, reduced precision and parity) so their optimization wins land directly in the deployment workflow
  • Partner with the team's Performance pillar and model development teams across the AV organization
What we offer
What we offer
  • Medical
  • Dental
  • Vision
  • Health Savings Account
  • Flexible Spending Accounts
  • Retirement savings plan
  • Sickness and accident benefits
  • Life insurance
  • Paid vacation & holidays
  • Tuition assistance programs
  • Fulltime
Read More
Arrow Right

Senior ML Platform Engineer, AI Platform

We are seeking a skilled and passionate ML Platform Engineer to join our team an...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
airwallex.com Logo
Airwallex
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years in backend software development
  • at least 2+ years focus on AI/ML Platform or MLOps infrastructure
  • deep expertise in MLOps practices, including automated deployment pipelines, model optimization, and production lifecycle management
  • proven experience designing and implementing low-latency model serving solutions
  • proficiency in Python
  • skill in writing high-quality, maintainable code
  • experience in design and development of large-scale distributed, high concurrency, low-latency inference, high availability systems
  • excellent communication and mentoring abilities
  • a relevant degree in Computer Science, Mathematics or related fields
Job Responsibility
Job Responsibility
  • Platform Development: Design, build, and maintain the end-to-end MLOps platform using Kubernetes and Cloud Services
  • Infrastructure as Code (IaC): Use Terraform or similar tools to manage, provision, and scale all ML-related infrastructure securely and efficiently
  • Pipeline Automation: Implement and optimize CI/CD/CT (Continuous Integration, Delivery, Training) pipelines to automate model training, testing, packaging, and deployment using tools like Argo and Kubeflow Pipelines
  • Serving Infrastructure: Build highly available, low-latency, and high-throughput model serving infrastructure
  • Observability: Implement robust monitoring, alerting, and logging solutions to track infrastructure health, model performance, and data/model drift
  • Tooling & Support: Evaluate, integrate, and support ML tools such as Feature Stores and distributed model training pipelines
  • Security & Compliance: Ensure platform security, implement RBAC (Role-Based Access Control), and manage secrets for sensitive data and production environments
  • Collaboration: Work closely with Data Scientists and ML Engineers to understand their needs and provide technical guidance on best practices for scaling their models
  • Fulltime
Read More
Arrow Right

Senior ML Platform Engineer

At WHOOP, we're on a mission to unlock human performance and healthspan. WHOOP e...
Location
Location
United States , Boston
Salary
Salary:
150000.00 - 210000.00 USD / Year
whoop.com Logo
Whoop
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s Degree in Computer Science, Engineering, or a related field
  • or equivalent practical experience
  • 5+ years of experience in software engineering with a focus on ML infrastructure, cloud platforms, or MLOps
  • Strong programming skills in Python, with experience in building distributed systems and REST/gRPC APIs
  • Deep knowledge of cloud-native services and infrastructure-as-code (e.g., AWS CDK, Terraform, CloudFormation)
  • Hands-on experience with model deployment platforms such as AWS SageMaker, Vertex AI, or Kubernetes-based serving stacks
  • Proficiency in ML lifecycle tools (MLflow, Weights & Biases, BentoML) and containerization strategies (Docker, Kubernetes)
  • Understanding of data engineering and ingestion pipelines, with ability to interface with data lakes, feature stores, and streaming systems
  • Proven ability to work cross-functionally with Data Science, Data Platform, and Software Engineering teams, influencing decisions and driving alignment
  • Passion for AI and automation to solve real-world problems and improve operational workflows
Job Responsibility
Job Responsibility
  • Architect, build, own, and operate scalable ML infrastructure in cloud environments (e.g., AWS), optimizing for speed, observability, cost, and reproducibility
  • Create, support, and maintain core MLOps infrastructure (e.g., MLflow, feature store, experiment tracking, model registry), ensuring reliability, scalability, and long-term sustainability
  • Develop, evolve, and operate MLOps platforms and frameworks that standardize model deployment, versioning, drift detection, and lifecycle management at scale
  • Implement and continuously maintain end-to-end CI/CD pipelines for ML models using orchestration tools (e.g., Prefect, Airflow, Argo Workflows), ensuring robust testing, reproducibility, and traceability
  • Partner closely with Data Science, Sensor Intelligence, and Data Platform teams to operationalize and support model development, deployment, and monitoring workflows
  • Build, manage, and maintain both real-time and batch inference infrastructure, supporting diverse use cases from physiological analytics to personalized feedback loops for WHOOP members
  • Design, implement, and own automated observability tooling (e.g., for model latency, data drift, accuracy degradation), integrating metrics, logging, and alerting with existing platforms
  • Leverage AI-powered tools and automation to reduce operational overhead, enhance developer productivity, and accelerate model release cycles
  • Contribute to and maintain internal platform documentation, SDKs, and training materials, enabling self-service capabilities for model deployment and experimentation
  • Continuously evaluate and integrate emerging technologies and deployment strategies, influencing WHOOP’s roadmap for AI-driven platform efficiency, reliability, and scale
What we offer
What we offer
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - ML Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , New York
Salary
Salary:
190800.00 - 286800.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of industry experience as a software engineer, with strong focus on ML/AI infrastructure or large-scale distributed systems
  • Hands-on expertise in building and operating ML platforms (e.g., feature stores, data pipelines, training/inference frameworks)
  • Proven experience delivering reliable and scalable infrastructure in production
  • Solid understanding of ML Ops concepts and tooling, as well as best practices for observability, security, and reliability
  • Strong communication skills and ability to collaborate across teams
Job Responsibility
Job Responsibility
  • Design and implement large-scale ML infrastructure, including feature stores, pipelines, deployment tooling, and inference systems
  • Drive the rollout of Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Help define and evangelize an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines and services, including reliability, scalability, performance, and cost efficiency
  • Collaborate with ML product teams to understand requirements and deliver solutions that accelerate experimentation and iteration
  • Contribute to technical strategy and architecture discussions within the team
  • Mentor and support other engineers through code reviews, design discussions, and technical guidance
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, ML Platform

We’re looking for a software engineer to join Parafin’s Infrastructure team and ...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 265000.00 USD / Year
parafin.com Logo
Parafin
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of software engineering experience, including experience on ML platform/MLOps systems (training, deployment, and/or feature pipelines)
  • Strong Python
  • solid software design and testing fundamentals
  • Proficiency with SQL
  • hands-on Spark/PySpark experience
  • Knowledge of ML fundamentals—probability & statistics, supervised vs. unsupervised learning, bias/variance & regularization, feature engineering, model evaluation metrics, validation strategies, and production concerns like drift, stability, and monitoring
  • Expertise with modern data/ML stacks—AWS, Databricks (workflows, lakehouse, MLflow/registry, Model Serving), and Airflow (or equivalent orchestration)
  • Experience building real-time systems (service design, caching, rate limiting, backpressure) and batch pipelines at scale
  • Practical knowledge of feature-store concepts (offline/online stores, backfills, point-in-time correctness), model registries, experiment tracking, and evaluation frameworks
  • Strong problem-solving skills and a proactive attitude toward ownership and platform health
Job Responsibility
Job Responsibility
  • Turn notebooks into software
  • Decompose data scientist training/inference notebooks into reusable, tested components (libraries, pipelines, templates) with clear interfaces and documentation
  • Create developer-friendly ML abstractions
  • Build SDKs, CLIs, and templates that make it simple to define features, train/evaluate models, and deploy to batch or real-time targets with minimal boilerplate
  • Build our real-time ML inference platform
  • Stand up and scale low-latency model serving
  • Expand batch ML inference
  • Improve scheduling, parallelism, cost controls, observability, and failure/rollback for large-scale batch scoring and post-processing
  • Own and expand the feature store
  • Design offline/online feature definitions, high read/write throughput, and consistent offline/online semantics
What we offer
What we offer
  • Equity grant
  • Medical, dental & vision insurance
  • Work from home flexibility
  • Unlimited PTO
  • Commuter benefits
  • Free lunches
  • Paid parental leave
  • 401(k)
  • Employee assistance program
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - ML Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 270000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of industry experience as a software engineer, with strong focus on ML/AI infrastructure or large-scale distributed systems
  • Hands-on expertise in building and operating ML platforms (e.g., feature stores, data pipelines, training/inference frameworks)
  • Proven experience delivering reliable and scalable infrastructure in production
  • Solid understanding of ML Ops concepts and tooling, as well as best practices for observability, security, and reliability
  • Strong communication skills and ability to collaborate across teams
Job Responsibility
Job Responsibility
  • Design and implement large-scale ML infrastructure, including feature stores, pipelines, deployment tooling, and inference systems
  • Drive the rollout of Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Help define and evangelize an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines and services, including reliability, scalability, performance, and cost efficiency
  • Collaborate with ML product teams to understand requirements and deliver solutions that accelerate experimentation and iteration
  • Contribute to technical strategy and architecture discussions within the team
  • Mentor and support other engineers through code reviews, design discussions, and technical guidance
What we offer
What we offer
  • medical, dental, vision, and 401(k)
  • Fulltime
Read More
Arrow Right

Senior ML Engineer

We are seeking an experienced Senior ML to join our team and engage in a diverse...
Location
Location
United Kingdom , Bath
Salary
Salary:
Not provided
bmt.org Logo
BMT
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Be a UK sole national
  • Have held no other nationality at any time
  • Have continuously resided in the United Kingdom for the past five years
  • Be able to obtain and maintain full UK security clearance in accordance with government vetting standards
  • Provide satisfactory evidence of identity, nationality, and residency as part of the clearance process
  • Capability to design and implement end‑to‑end ML pipelines
  • Ability to select, train, and tune models (classical ML and deep learning) using frameworks such as PyTorch, TensorFlow, or scikit‑learn
  • Experience containerising and deploying models (e.g., Docker), implement CI/CD, monitoring, drift detection, and automated retraining on Azure/AWS/GCP as appropriate
  • Demonstrated capability to work with data engineers to ensure high‑quality datasets, versioning, lineage, and governance
  • Capable of pairing with data scientists and software engineers, review code, and share best practices
Job Responsibility
Job Responsibility
  • Design, build, and deployment of machine‑learning systems, applying robust software engineering practices and an in‑depth understanding of model behaviour, performance, and limitations
  • Select, prepare, and pipeline data for model training and inference. Implements, trains, evaluates, and optimises machine‑learning models, continually improving them through iterative experimentation and additional data
  • Create scalable and automated ML pipelines, including feature extraction, model training, validation, packaging, deployment, and monitoring
  • Design and implement dashboards, diagnostics, and evaluation tooling to ensure transparency, performance tracking, and operational reliability across the ML lifecycle
  • Within defined delivery goals, refines prototype models into production‑ready components, contributing to development, optimisation, demonstration, and integration activities
  • Apply standardised engineering and evaluation methods, producing clear technical documentation and communicating design choices, performance outcomes, and limitations
  • Contribute to internal knowledge bases and participates in professional ML engineering communities
  • Ensure responsible handling of data throughout the ML lifecycle, including secure storage, access control, data lineage, versioning, and quality checks
  • Evaluate data integrity and suitability for ML workflows, and advises on transformations, feature representation, and schemas needed for efficient training and inference
  • Implement metadata standards, reproducible data pipelines, and automated validation procedures to maintain trustworthy data assets
What we offer
What we offer
  • Private Medical (family coverage)
  • Enhanced Pension
  • 18 weeks enhanced maternity pay (after a qualifying period of 1 year)
  • Family friendly policies
  • Committed to an inclusive culture
  • Wellbeing Fund – an annual fund for personal hobbies or interests
  • 26 Days Annual Leave (plus bank holidays)
  • Holiday Trading
  • Retail Vouchers
  • Professional Subscriptions
  • Fulltime
Read More
Arrow Right

Senior AI / ML Engineer

We are seeking an experienced Senior ML to join our team and engage in a diverse...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
bmt.org Logo
BMT
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Be a UK sole national
  • Have held no other nationality at any time
  • Have continuously resided in the United Kingdom for the past five years
  • Be able to obtain and maintain full UK security clearance in accordance with government vetting standards
  • Provide satisfactory evidence of identity, nationality, and residency as part of the clearance process
  • Ability to select, train, and tune models (classical ML and deep learning) using frameworks such as PyTorch, TensorFlow, or scikit-learn
  • perform robust validation and error analysis
  • Experience containerising and deploying models (e.g., Docker), implement CI/CD, monitoring, drift detection, and automated retraining on Azure/AWS/GCP as appropriate
  • Strong engineering skills in Python (typing, testing, packaging)
  • experience with version control (Git) and code review workflows
Job Responsibility
Job Responsibility
  • Designing, building, testing, and deploying machine-learning systems, applying robust software engineering practices and an in-depth understanding of model behaviour, performance, and limitations
  • Selecting and preparing data pipelines for model training and inference
  • Implementing, training, evaluating, and optimising machine-learning models, continually improving them through iterative experimentation and additional data
  • Creating scalable and automated ML pipelines, including feature extraction, model training, validation, packaging, deployment, and monitoring
  • Applying standardised engineering and evaluation methods, producing clear technical documentation and communicating design choices, performance outcomes, and limitations
  • Evaluating data integrity and suitability for ML workflows, and advising on transformations, feature representation, and schemas needed for efficient training and inference
  • Applying engineering-focused data modelling and system design techniques to create, modify, or maintain ML-relevant data structures, feature stores, and associated components
  • Supporting alignment of data structures, model interfaces, and infrastructure components to ensure efficient and scalable ML system operation
What we offer
What we offer
  • Private Medical (family coverage)
  • Enhanced Pension
  • 18 weeks enhanced maternity pay (after a qualifying period of 1 year)
  • Family friendly policies
  • Committed to an inclusive culture
  • Wellbeing Fund – an annual fund for personal hobbies or interests
  • 26 Days Annual Leave (plus bank holidays)
  • Holiday Trading
  • Retail Vouchers
  • Professional Subscriptions
  • Fulltime
Read More
Arrow Right