CrawlJobs Logo

Staff Machine Learning Engineer - ML Training Infrastructure

United States, Austin Employment contract 185000.00 - 335300.00 USD / Year · Job Posted June 14, 2026
Apply Position
Job Link Share

Job Description

The Role:   We are seeking an experienced, technically strong, impact-driven expert in ML Training Infrastructure with a demonstrated ability to lead through hands-on technical work. In this role, you will be responsible for defining the technical direction and driving the design and development of scalable, reliable, and high-performance AI/ML platform infrastructure that enables advanced AI research and model development at scale. As a Staff ML Engineer, you will operate as a technical leader across initiatives, partnering closely with machine learning engineers, research scientists, and platform teams to shape architecture, drive major technical decisions, and deliver state-of-the-art AI infrastructure that enables the future of intelligent driving technologies across General Motors vehicles.

Job Responsibility

  • Define and drive the architecture, design, and development of scalable, reliable, and high-performance ML frameworks and platform capabilities to support model training at scale
  • Lead model training performance analysis and optimization efforts across distributed training workflows, improving scalability, efficiency, and cost across heterogeneous hardware environments
  • Raise the bar on system observability, debuggability, operational excellence, and developer experience across the ML training stack
  • Own large, ambiguous, cross-functional technical initiatives from strategy through execution, including technical roadmap definition, tradeoff analysis, and delivery
  • Influence platform direction by identifying long-term infrastructure investments, setting engineering standards, and driving adoption of best practices across teams
  • Collaborate across organizational boundaries to align requirements, resolve technical disagreements, and integrate new capabilities into the platform ecosystem
  • Mentor engineers through design reviews, technical guidance, and hands-on partnership, while elevating engineering quality across the team

Requirements

  • Bachelor's degree or higher in Computer Science or a related field, or equivalent practical experience
  • 8+ years of professional software engineering experience
  • 5+ years of specialized experience in AI/ML infrastructure, such as enabling distributed training for large-scale ML models
  • Strong programming skills in Python, with deep proficiency in frameworks such as PyTorch (preferred), TensorFlow, or similar ML systems
  • Proven experience designing and operating distributed systems for ML training, including distributed computing, GPU computing, and cloud environments (AWS, GCP, Azure)
  • Demonstrated track record of leading technically ambiguous, cross-team infrastructure initiatives and driving them to measurable impact
  • Strong architectural judgment and ability to make sound technical tradeoffs across performance, reliability, usability, and cost
  • Willingness to travel to Sunnyvale, CA as needed
  • Comfortable operating in highly ambiguous and dynamic environments

Nice to have

  • 8+ years of professional software engineering experience
  • Deep expertise in PyTorch 2.x+ and distributed training frameworks
  • Experience designing and developing training platforms that support FSDP, pipeline parallelism, and other scalable solutions for training large foundational models
  • Experience profiling, analyzing, debugging, and optimizing training and data loading performance at scale
  • Strong record of technical leadership through architecture reviews, roadmap influence, and cross-team execution
  • Excellent communication skills, with the ability to build consensus, navigate controversial decisions, communicate risks clearly, and provide constructive technical feedback
  • Self-motivated, execution-oriented, and motivated by delivering broad organizational impact

What we offer

  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • employee assistance program
  • GM vehicle discounts
  • company vehicle evaluation program

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Staff Machine Learning Engineer - ML Training Infrastructure

8 matching positions

Staff Machine Learning Engineer

Applied AI is a horizontal AI team at Uber partnering with product and platform ...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
uber.com Logo
Uber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of industry experience in machine learning or software engineering, with a proven record of delivering ML solutions to production
  • Strong knowledge of machine learning, deep learning, and exposure to generative AI techniques (e.g., transformers, LLMs, diffusion)
  • Experience designing and scaling ML systems or platforms, including training pipelines, serving infrastructure, and model lifecycle tooling
  • Fluency in ML frameworks (e.g., PyTorch, TensorFlow, JAX) and development in Python and/or scalable backend languages (e.g., Java, Go)
  • Excellent collaboration and communication skills with the ability to work across teams and functions
Job Responsibility
Job Responsibility
  • Design and implement ML-driven systems that power core Uber experiences, with a focus on scalability, reliability, and performance
  • Lead the technical execution of key projects involving classical ML, deep learning, and generative AI technologies (e.g., LLMs, multimodal models)
  • Collaborate closely with product, data science, and infrastructure teams to develop AI solutions from ideation through production deployment
  • Contribute to and influence the technical direction for Applied AI, particularly around system design, model architecture, and infrastructure decisions
  • Champion engineering best practices in ML development — including experimentation workflows, model versioning, evaluation, monitoring, and responsible AI
  • Provide mentorship to engineers on the team and across partner orgs to help raise the technical bar
  • Fulltime
Read More
Arrow Right

Senior Staff Machine Learning Engineer – Moonshot AI

The Moonshot AI team sits within Uber AI Solutions where we are building an ente...
Location
Location
United States , Sunnyvale
Salary
Salary:
267000.00 - 297000.00 USD / Year
uber.com Logo
Uber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of industry experience developing and shipping production machine learning models
  • Ph.D., MS, or Bachelor's degree in Computer Science, Machine Learning, or a closely related discipline
  • Proven track record of technical leadership on large-scale ML initiatives with measurable business impact
  • Deep expertise across multiple areas: Computer Vision, Natural Language Processing, Deep Learning, and Generative AI
  • Strong proficiency with modern ML frameworks (PyTorch, TensorFlow, JAX) and programming languages
  • Extensive experience with distributed training infrastructure, large-scale model development, and ML platform design
  • Demonstrated ability to collaborate with product, engineering, and data science leadership on technical roadmaps and strategic priorities
  • Excellent problem-solving abilities with deep ML methodology expertise
Job Responsibility
Job Responsibility
  • Shape the technical vision and roadmap for Moonshot AI's ML initiatives
  • Architect foundational ML platforms and systems for marketplace optimization and annotation automation
  • Drive end-to-end ML solutions from conception through production deployment
  • Lead GenAI innovation: design and implement cutting-edge systems using custom SLMs, computer vision, and LLMs
  • Advance AI research capabilities: establish research direction, design benchmarks, contribute to research and publications
  • Build industry-leading evaluation frameworks: architect LLM-as-Judge systems and automated quality assessment platforms
  • Provide technical leadership across Uber AI Solutions
  • Mentor and develop engineering talent
  • Enable cross-functional impact
What we offer
What we offer
  • Bonus program
  • Equity award & other types of comp
  • 401(k) plan
  • Various benefits
  • Fulltime
Read More
Arrow Right

Staff ML Infrastructure Engineer - Embodied AI

At General Motors, our product teams are redefining mobility. Through a human-ce...
Location
Location
United States , Sunnyvale
Salary
Salary:
189300.00 - 290700.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience building large-scale distributed systems, applications, or advanced ML systems
  • Proven track record of designing robust frameworks with high-quality, durable APIs
  • Deep understanding of machine learning algorithms with hands‑on application
  • Expertise in building reliable, high-performance, and cost-efficient systems on modern cloud infrastructure
  • End-to-end experience across the ML development lifecycle, including MLOps practices
  • Strong cross functional collaboration skills across teams and organizations
  • Exceptional coding skills in Python or C++
  • Strong interest in autonomous driving and its transformative potential
  • BS, MS, or PhD in Computer Science, Mathematics, or equivalent practical experience
Job Responsibility
Job Responsibility
  • Lead the design, implementation, and deployment of scalable platforms and tools that drive machine learning model training and evaluation workflows across GM
  • Own complex technical projects end-to-end, making key architectural decisions and technical trade-offs
  • Take a holistic view of projects, considering their impact across multiple teams, and across a longer timeline
  • Proactively drive technical prioritization
  • Collaborate closely with partner teams to ensure maximum benefit from the systems we build
  • Help shape our team through technical interviewing with high, well-calibrated standards, and play an essential role in recruiting
  • Mentor and onboard junior engineers and interns, helping them grow their careers
What we offer
What we offer
  • Medical
  • Dental
  • Vision
  • Health Savings Account
  • Flexible Spending Accounts
  • Retirement savings plan
  • Sickness and accident benefits
  • Life insurance
  • Paid vacation & holidays
  • Tuition assistance programs
  • Fulltime
Read More
Arrow Right

Staff Machine Learning Engineer

Tonal is looking for a Staff Machine Learning Engineer to help expand Tonal’s in...
Location
Location
United States , San Francisco; Toronto
Salary
Salary:
200000.00 - 235000.00 USD / Year
tonal.com Logo
Tonal
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7 plus years of experience in software engineering or applied ML
  • 5 plus with a Master’s degree
  • PhD with 3 plus years of experience
  • Strong coding skills in Python
  • Experience with frameworks such as PyTorch, TensorFlow, or JAX
  • Experienced in ML training, evaluation, and deployment workflows such as Sagemaker, MLFlow, Databricks, or similar
  • Deep understanding of time series modeling, human motion, or sensor based learning from devices such as force transducers, position encoders, IMUs, or cameras
  • Familiar with MLOps best practices and scalable model training pipelines
  • Strong communicator who can collaborate with scientists, product managers, and engineers
  • Track record of delivering performant ML systems from prototype to production
Job Responsibility
Job Responsibility
  • Design, implement, and optimize machine learning training pipelines and model serving infrastructure for real time applications
  • Develop algorithms and ML models that enable personalized training, adaptive coaching, and performance prediction
  • Fine tune and evaluate transformer based or self supervised learning models using Tonal’s multimodal dataset
  • Build data driven systems that measure training effectiveness, effort, and progression beyond traditional weight based metrics
  • Prototype, train, and deploy ML models that run efficiently at scale or on device
  • Collaborate cross functionally with Exercise Science, Product, and Software teams to deliver intelligent features that improve the member experience
  • Contribute to the development of automated tools for experimentation, model validation, and continuous retraining
  • Write high quality, maintainable Python code and work closely with backend engineers to integrate models into Tonal’s production systems
  • Mentor teammates and help shape Tonal’s growing AI and ML best practices
What we offer
What we offer
  • Offers Equity
  • Health insurance
  • Retirement savings benefits
  • Life insurance
  • Disability benefits
  • Flexible paid time off
  • Parental leave
  • Other additional benefits (location dependent)
  • Fulltime
Read More
Arrow Right

Staff Machine Learning Engineer

We are seeking a highly experienced and strategic Staff Machine Learning Enginee...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
bazaarvoice.com Logo
Bazaarvoice
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum of 8+ years of experience in Machine Learning Engineering, Applied Machine Learning, or a related field, with a proven track record of building and maintaining production models
  • Expert proficiency with the AWS ecosystem for MLOps, including a deep understanding of how to architect solutions using key services like Amazon SageMaker, S3, AWS Step Functions, AWS CloudFormation, Amazon CloudWatch, Amazon Managed Streaming for Apache Kafka (MSK), and Amazon Bedrock
  • Deep expertise in building and deploying scalable solutions for NLP, including experience with challenges such as sarcasm detection, polysemy, and managing multilingual data
  • Experience with a variety of ML algorithms and models, including traditional supervised and unsupervised learning, deep learning, and modern Generative AI techniques (e.g., LLMs, RAG, Prompt Engineering)
  • Proficiency with ML frameworks and libraries such as PyTorch, TensorFlow, and scikit-learn, with an ability to adapt and tune open-source or pre-trained models
  • A strong understanding of core software engineering principles, including design patterns, data structures, testing, security, and version control
  • Experience with continuous integration (CI/CD) and regression testing
  • The ability to translate complex business problems into viable technical solutions and communicate findings to stakeholders in non-technical terms
Job Responsibility
Job Responsibility
  • Lead the design, development, and deployment of complex, production-grade ML systems and data pipelines, particularly for Natural Language Processing (NLP) and Generative AI applications
  • Serve as a domain expert in the application of AI to solve core business challenges, including sentiment analysis, content moderation, product recommendations, and personalized search
  • Drive innovation by identifying and addressing high-impact technical challenges and long-standing technical debt within our ML and data infrastructure
  • Provide technical mentorship to other engineers on the team and beyond, raising the bar for engineering excellence, maintainability, and best practices across the organization
  • Collaborate closely with Data Scientists, Product Managers, and other engineering teams to translate complex business requirements into robust, data-driven ML solutions
  • Implement and oversee MLOps practices, including automated CI/CD pipelines, model monitoring, and governance, to ensure our systems are reliable, reproducible, and performant at scale
  • Implement robust observability frameworks to proactively detect and diagnose issues like model drift, data quality anomalies, and performance degradation in production
Read More
Arrow Right

Staff Machine Learning Engineer - AI Platform

You will join our Data Department to support the development of Phantom Intellig...
Location
Location
Salary
Salary:
Not provided
phantombuster.com Logo
PhantomBuster
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience as a Data Scientist or Machine Learning Engineer
  • Experience working with LLMs (e.g., prompt engineering, fine-tuning, retrieval-augmented generation)
  • Experience working with Agents for Amazon Bedrock AgentCore or similar agent setups
  • Strong understanding of machine learning algorithms, statistical methods, and data preprocessing techniques
  • Experience with cloud platforms for model training and deployment, especially AWS
  • Proficiency in Python, including experience with libraries such as LangChain, Scikit-Learn, NumPy, Pandas, and PyTorch/TensorFlow
  • Proficiency in SQL and experience working with data warehouses (e.g., Snowflake, GCP)
  • Knowledge of MLOps best practices, including CI/CD pipelines, model monitoring, and versioning (e.g., MLflow, Airflow)
  • Experience deploying models to production and supporting them post-deployment
  • Fluency in English
Job Responsibility
Job Responsibility
  • Define and evolve our infrastructure to allow for better ML and AI capabilities, with a focus on LLM-based and agentic systems
  • Contribute to the development and expansion of our agentic AI framework powered by AWS Bedrock, enabling both internal tools and customer-facing features
  • Identify, source, and refine datasets to allow tuning models, powering retrieval pipelines, or expanding agentic workflows
  • Pre-process data by using techniques such as data cleaning, feature engineering, and transformation
  • Train, evaluate, and deploy both LLM-based systems and traditional machine learning models into production
  • Monitor, debug, and continuously improve deployed models and AI tools
  • Support machine learning usage throughout the company, including selecting the right modeling approach for the use case (LLM vs. traditional ML)
  • Support the integration and use of LLMs, including approaches such as fine-tuning, prompt tuning, and retrieval-augmented generation (RAG), to improve accuracy
What we offer
What we offer
  • Fully remote working environment
  • €40/month for remote work
  • Flexible working time
  • Home office budget up to €1500
  • 100% of an Alan Blue subscription (french-based contracts)
  • Lunch vouchers - €8 (50% The Phantom Company) / worked day (french-based contracts)
  • Partnership with MokaCare
  • €70 a month benefit for entertainment expenses
  • Book Allowance and Sharing Program
  • Fulltime
Read More
Arrow Right

Staff Machine Learning Engineer - Ads

The Ads Machine Learning team at Uber is responsible for designing, building, an...
Location
Location
United States , New York; San Francisco
Salary
Salary:
30.00 USD / Hour
uber.com Logo
Uber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree or equivalent experience in Computer Science, Computer Engineering, Data Science, Machine Learning, Statistics, or a related quantitative field
  • Demonstrated ownership of designing, deploying, and evolving large scale machine learning systems powering ads ranking, auction, or pricing in production environments
  • Strong proficiency in Python for building production ML systems and defining model, feature, and training abstractions used across teams
  • Deep understanding of SQL with experience driving production decision making, data validation, and system level analysis
  • Strong grasp of big data and distributed system architectures, with experience designing data platforms and ETL pipelines that support Ads ML workloads
  • Hands on experience building and operating batch data pipelines using Spark or comparable distributed compute frameworks, with accountability for data quality and correctness
  • Proven expertise in experimentation and evaluation, including A/B testing and offline metrics for ads auctions, ranking quality, and marketplace outcomes
  • Experience defining and operationalizing model and serving level metrics, and building observability for reliable online ML inference systems
  • Experience owning or influencing online model serving, including latency aware inference, scalability, and reliability considerations
  • Strong grounding in statistical methods, with the ability to reason about bias, uncertainty, and tradeoffs in ads and marketplace systems
Job Responsibility
Job Responsibility
  • Lead the design and evolution of machine learning models that power ads ranking, pricing, and auction systems at scale
  • Own end to end ML systems, including training pipelines, feature infrastructure, and low latency online inference for real time and batch use cases
  • Apply advanced statistical and ML techniques to improve ads relevance, marketplace efficiency, and advertiser outcomes
  • Define experimentation strategies, success metrics, and evaluation frameworks, and drive iteration through rigorous offline and online testing
  • Establish model and system observability through metrics, dashboards, and reliability best practices
  • Translate ambiguous product goals into durable ML architectures in close partnership with Product and Engineering
  • Provide technical leadership through mentorship, design reviews, and raising engineering standards across the Ads ML org
  • Stay current on advances in machine learning and ads auction systems, and drive adoption where they deliver clear impact
  • Fulltime
Read More
Arrow Right

Staff Machine Learning Engineer

As a Staff Machine Learning Engineer, you will be a driving force behind our AI ...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
EDITED
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong proficiency in Python and frameworks like PyTorch, TensorFlow, or Scikit-learn, with a deep understanding of NLP, deep learning, or reinforcement learning
  • Hands-on experience with modern AI orchestration tools such as LangChain and LangSmith
  • Proven experience with Docker, Kubernetes, and cloud infrastructure (AWS/GCP/Azure), with a focus on scaling models in production
  • Expert-level SQL/NoSQL skills and the ability to design high-performance pipelines for massive datasets
  • A Master’s or PhD in Computer Science or a related field, or equivalent experience leading research-heavy engineering projects
Job Responsibility
Job Responsibility
  • Design, develop, and deploy robust ML systems and multi-model AI agents that solve real-world retail challenges
  • Lead the entire lifecycle, including prototyping, deployment, monitoring, and maintenance using modern CI/CD and containerisation practices
  • Build high-performance data pipelines (ETL/ELT) for both training and real-time inference, ensuring our systems are scalable and reliable
  • Act as a technical lead for the team, mentoring junior engineers, setting engineering best practices, and shaping our long-term technical roadmap
  • Partner with Product Managers and Data Scientists to translate business ambitions into sophisticated technical requirements
  • Build models to solve specific problems for our customers and internal teams
  • Prioritise delivering a working solution that solves a business challenge
  • Use data and user feedback to refine your technical approach as the problem becomes clearer
What we offer
What we offer
  • Flexible working policy
  • Hybrid approach in our central London office
  • Enhanced parental leave policy
  • 25 days annual leave + public holidays (and an extra day for every year at EDITED)
  • Work from anywhere policy
  • Season Ticket Loan & Cycle to Work schemes
  • Health Cash App
  • Access to an Employee Assistance Programme
  • Gifts for work anniversaries and big life events
  • Dog friendly office
  • Fulltime
Read More
Arrow Right