CrawlJobs Logo

Senior AI Infrastructure Engineer - Training Platform

United States, San Francisco Employment contract 216000.00 - 270000.00 USD / Year · Job Posted May 04, 2026
Apply Position
Job Link Share

Job Description

As a Software Engineer on the Machine Learning Infrastructure team, you will build the "Operating System" for our large-scale GPU clusters. You will architect a high-performance training platform that handles the immense complexity of multi-thousand GPU workloads, ensuring every cycle is used efficiently. Your work directly determines the velocity at which our researchers can train and iterate on the world's most advanced models.

Job Responsibility

  • Architect and scale a multi-tenant orchestration layer that abstracts away the complexity of GPU clusters, ensuring high utilization and seamless job recovery
  • Design and implement scheduling primitives to optimize the lifecycle of training jobs
  • Develop deep observability and automated health-checking into the training stack to proactively identify and isolate hardware failures
  • Evaluate and integrate emerging technologies in the CNCF and AI ecosystem (e.g. Ray, Kueue), making data-driven build vs. buy decisions that balance velocity with long-term maintainability
  • Work closely with Finance and Procurement teams to drive our capacity planning process
  • Participate in our team's on call process to ensure the availability of our services
  • Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment

Requirements

  • 5+ years of experience in backend or infrastructure engineering, with at least 2 years focused on orchestrating ML workloads at scale (100+ GPU nodes)
  • Strong programming skills in one or more languages (e.g. Python, Go, Rust, C++)
  • Experience with complex compute management systems that cover queueing, quotas, preemption, and gang scheduling
  • Experience with distributed training infrastructure, such as EFA, Infiniband, and topology-aware scheduling
  • Experience with distributed storage systems (e.g. Lustre, S3) as they relate to training throughput
  • Expert-level knowledge of Kubernetes internals (Custom Resources, Operators, Admission Controllers) and how they interact with device plugins for specialized hardware
  • Familiarity with cloud infrastructure (AWS, GCP) and infrastructure as code (e.g., Terraform)
  • Proven ability to solve complex problems and work independently in fast-moving environments

Nice to have

  • Experience with distributed training techniques such as DeepSpeed, FSDP, etc.
  • Experience with the NVIDIA software and hardware stack (CUDA, NCCL)
  • Experience with PyTorch
  • Familiarity with post-training algorithms such as GRPO, and with Reinforcement Learning

What we offer

  • Comprehensive health, dental and vision coverage
  • retirement benefits
  • a learning and development stipend
  • generous PTO
  • commuter stipend (may be eligible)

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior AI Infrastructure Engineer - Training Platform

8 matching positions

Senior ML Platform Engineer, AI Platform

We are seeking a skilled and passionate ML Platform Engineer to join our team an...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
airwallex.com Logo
Airwallex
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years in backend software development
  • at least 2+ years focus on AI/ML Platform or MLOps infrastructure
  • deep expertise in MLOps practices, including automated deployment pipelines, model optimization, and production lifecycle management
  • proven experience designing and implementing low-latency model serving solutions
  • proficiency in Python
  • skill in writing high-quality, maintainable code
  • experience in design and development of large-scale distributed, high concurrency, low-latency inference, high availability systems
  • excellent communication and mentoring abilities
  • a relevant degree in Computer Science, Mathematics or related fields
Job Responsibility
Job Responsibility
  • Platform Development: Design, build, and maintain the end-to-end MLOps platform using Kubernetes and Cloud Services
  • Infrastructure as Code (IaC): Use Terraform or similar tools to manage, provision, and scale all ML-related infrastructure securely and efficiently
  • Pipeline Automation: Implement and optimize CI/CD/CT (Continuous Integration, Delivery, Training) pipelines to automate model training, testing, packaging, and deployment using tools like Argo and Kubeflow Pipelines
  • Serving Infrastructure: Build highly available, low-latency, and high-throughput model serving infrastructure
  • Observability: Implement robust monitoring, alerting, and logging solutions to track infrastructure health, model performance, and data/model drift
  • Tooling & Support: Evaluate, integrate, and support ML tools such as Feature Stores and distributed model training pipelines
  • Security & Compliance: Ensure platform security, implement RBAC (Role-Based Access Control), and manage secrets for sensitive data and production environments
  • Collaboration: Work closely with Data Scientists and ML Engineers to understand their needs and provide technical guidance on best practices for scaling their models
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Managed AI - AI Platform

Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'...
Location
Location
United States , San Francisco, CA; Sunnyvale, CA
Salary
Salary:
172425.00 - 209000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Advanced degree in Computer Science/Engineering
  • 4-5+ years of industry experience with demonstrated history of consistent success leading a varied portfolio of initiatives across your function
  • Experience with distributed systems, cloud services (compute, storage, networking, database), and delivering early-stage projects quickly
  • Experience with Generative AI (LLMs, Multimodal) and familiar with AI infrastructure (training, inference, ETL pipelines)
  • Proficient with container runtimes (e.g., Kubernetes), microservices, REST APIs, gRPC, and the full software development lifecycle including CI/CD
Job Responsibility
Job Responsibility
  • Lead the design and implementation of core AI services, including: Resilient fault-tolerant queues for efficient task distribution
  • Model catalogs for managing and versioning AI models
  • Scheduling mechanisms optimized for cost and performance
  • Architect and scale infrastructure to handle millions of API requests per second
  • Implement robust monitoring and alerting to ensure system health and 24/7 availability
  • Collaborate closely with product management, business strategy, and other engineering teams to define the AI platform roadmap
  • Influence the long-term vision and architectural decisions of the platform
  • Contribute to open-source AI frameworks and actively participate in the AI community
  • Prototype and rapidly iterate on emerging technologies and new features
What we offer
What we offer
  • Restricted Stock Units
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Senior ML Engineer - AI Platform & Agents

We are building agentic AI into the core of our product and need someone who can...
Location
Location
France , Bordeaux
Salary
Salary:
Not provided
phantombuster.com Logo
PhantomBuster
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience as an ML Engineer, AI Engineer, or Software Engineer with a strong AI focus
  • Hands-on experience building AI agents using frameworks such as LangChain, Amazon Bedrock AgentCore, or similar
  • Strong understanding of LLM-based systems: prompt engineering, agent orchestration, tool use, and multi-agent workflows
  • Familiarity with MCP (Model Context Protocol) and experience integrating agents with external APIs or data sources
  • Experience working with Agents for Amazon Bedrock AgentCore or similar agent setups
  • Strong understanding of machine learning algorithms, statistical methods, and data preprocessing techniques
  • Experience with cloud platforms for model training and deployment, especially AWS
  • Proficiency in Python, including LangChain, and standard data libraries (Pandas, NumPy, etc.)
  • Fluency in English
Job Responsibility
Job Responsibility
  • Define and evolve our infrastructure to allow for better ML and AI capabilities, with a focus on LLM-based and agentic systems
  • Contribute to the development and expansion of our agentic AI framework powered by AWS Bedrock, enabling both internal tools and customer-facing features
  • Identify, source, and refine datasets to allow tuning models, powering retrieval pipelines, or expanding agentic workflows
  • Pre-process data by using techniques such as data cleaning, feature engineering, and transformation
  • Train, evaluate, and deploy both LLM-based systems and traditional machine learning models into production
  • Monitor, debug, and continuously improve deployed models and AI tools
  • Support machine learning usage throughout the company, including selecting the right modeling approach for the use case (LLM vs. traditional ML)
  • Support the integration and use of LLMs, including approaches such as fine-tuning, prompt tuning, and retrieval-augmented generation (RAG), to improve accuracy
What we offer
What we offer
  • International team
  • Fun team building events
  • €40/month for remote work
  • Flexible working time
  • Home office budget up to €1500
  • 100% of an Alan Blue subscription
  • Lunch vouchers - €8 (50% The Phantom Company) / worked day
  • Partnership with MokaCare
  • €70 a month benefit for entertainment expenses
  • Book Allowance and Sharing Program
Read More
Arrow Right

Senior ML Engineer - AI Platform & Agents

Join PhantomBuster as a Senior ML Engineer to build agentic AI with AWS Bedrock,...
Location
Location
France; Spain; Portugal
Salary
Salary:
Not provided
phantombuster.com Logo
PhantomBuster
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience as an ML Engineer, AI Engineer, or Software Engineer with a strong AI focus
  • Hands-on experience building AI agents using frameworks such as LangChain, Amazon Bedrock AgentCore, or similar
  • Strong understanding of LLM-based systems: prompt engineering, agent orchestration, tool use, and multi-agent workflows
  • Familiarity with MCP (Model Context Protocol) and experience integrating agents with external APIs or data sources
  • Experience working with Agents for Amazon Bedrock AgentCore or similar agent setups
  • Strong understanding of machine learning algorithms, statistical methods, and data preprocessing techniques
  • Experience with cloud platforms for model training and deployment, especially AWS
  • Proficiency in Python, including LangChain, and standard data libraries (Pandas, NumPy, etc.)
  • Fluency in English
Job Responsibility
Job Responsibility
  • Define and evolve our infrastructure to allow for better ML and AI capabilities, with a focus on LLM-based and agentic systems
  • Contribute to the development and expansion of our agentic AI framework powered by AWS Bedrock, enabling both internal tools and customer-facing features
  • Identify, source, and refine datasets to allow tuning models, powering retrieval pipelines, or expanding agentic workflows
  • Pre-process data by using techniques such as data cleaning, feature engineering, and transformation
  • Train, evaluate, and deploy both LLM-based systems and traditional machine learning models into production
  • Monitor, debug, and continuously improve deployed models and AI tools
  • Support machine learning usage throughout the company, including selecting the right modeling approach for the use case (LLM vs. traditional ML)
  • Support the integration and use of LLMs, including approaches such as fine-tuning, prompt tuning, and retrieval-augmented generation (RAG), to improve accuracy
What we offer
What we offer
  • Fully remote working environment (France, Spain, or Portugal)
  • Real ownership: you will define how agentic AI is built at PhantomBuster, not follow someone else's decisions
  • Freedom to research and adopt new technologies as the space evolves & to make an impact at a small, self-funded, and profitable tech startup by laying the foundation for machine learning and AI
  • Collaborative and open-minded culture based on rationality, humility, honesty, and long-term thinking
  • International team
  • Fun team building events
  • €40/month for remote work
  • Flexible working time
  • Home office budget up to €1500
  • 100% of an Alan Blue subscription (french-based contracts)
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, AI Platform and Enablement

We're building a next-generation AI-powered platform and web application for cre...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 286000.00 USD / Year
descript.com Logo
Descript
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience in deploying and managing AI models in production
  • Experience with the tools of large volume data pipelines like spark, flume, dask, etc.
  • Familiarity with cloud platforms (AWS, Google Cloud, Azure) and container technologies (Docker, Kubernetes)
  • Knowledge of DevOps and MLOps best practices
  • Strong problem-solving abilities and excellent communication skills
Job Responsibility
Job Responsibility
  • Build, maintain, and standardize third-party model integrations, including consulting for other engineering teams with AI model integration needs
  • Design, implement, and maintain our AI infrastructure supporting our machine learning life cycle, including data ingestion pipelines, training developer experience and infrastructure, evaluation frameworks, and deployments / GPU infrastructure
  • Collaborate with Product Managers, Research Engineers, and AI Researchers to understand their infrastructure needs and ensure our AI systems are robust, scalable, and efficient
  • Optimize and scale our models and algorithms for efficient inference
  • Deploy, monitor, and manage AI models in production
What we offer
What we offer
  • Generous healthcare package
  • 401k matching program
  • Catered lunches
  • Flexible vacation time
  • Fulltime
Read More
Arrow Right

Senior ML Infrastructure Engineer - Embodied AI

At General Motors, our product teams are redefining mobility. Through a human-ce...
Location
Location
United States , Sunnyvale
Salary
Salary:
153200.00 - 234100.00 USD / Year
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of experience working on large-scale distributed systems, applications, or ML infrastructure
  • Experience designing robust services or frameworks with durable, well-designed APIs
  • Solid understanding of machine learning workflows and hands-on experience applying ML systems in production environments
  • Experience building reliable, high-performance, and cost-efficient systems on modern cloud infrastructure
  • Practical experience across the ML development lifecycle, including model training, deployment, and MLOps practices
  • Strong cross-functional collaboration skills across teams and organizations
  • Strong coding skills in Python or C++
  • Interest in autonomous driving and large-scale ML systems
  • BS, MS, or PhD in Computer Science, Mathematics, or equivalent practical experience
Job Responsibility
Job Responsibility
  • Design, implement, and deploy scalable platforms and tools supporting machine learning training and evaluation workflows across GM
  • Drive complex technical projects with strong ownership of implementation, code quality, and system reliability
  • Contribute to technical design discussions and architectural decisions while collaborating with senior engineers and technical leads
  • Work closely with partner teams to ensure platforms meet real-world ML development needs and maximize adoption
  • Identify technical improvements and help prioritize platform investments to improve performance, reliability, and developer productivity
  • Contribute to a strong engineering culture through high-quality code reviews, documentation, and operational excellence
  • Support onboarding and mentoring of junior engineers and interns
What we offer
What we offer
  • medical
  • dental
  • vision
  • Health Savings Account
  • Flexible Spending Accounts
  • retirement savings plan
  • sickness and accident benefits
  • life insurance
  • paid vacation & holidays
  • tuition assistance programs
  • Fulltime
Read More
Arrow Right
New

Senior Machine Learning Engineer, AI Platform

The AI Platform team is responsible for building the foundational infrastructure...
Location
Location
United States; Canada
Salary
Salary:
139000.00 - 218000.00 USD / Year
mozilla.org Logo
Mozilla
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree with 4–6 years of relevant industry experience, or Master’s degree with significant hands-on experience building and operating production ML systems, or work experience equivalent
  • Strong experience developing in Python for machine learning systems, backend services, or distributed data processing
  • Proven experience deploying and operating ML workloads in cloud environments, including production-grade infrastructure
  • Solid understanding of model serving architectures, inference pipelines, and performance tradeoffs (latency, throughput, cost, scaling strategies)
  • Hands-on experience working with GPU-based workloads and accelerated computing in production settings
  • Experience designing CI/CD pipelines and development workflows that support reliable ML system deployment
  • Ability to independently scope and drive technical initiatives while balancing product and operational priorities
  • Strong problem-solving skills and the ability to debug performance and reliability issues in distributed systems
  • Clear and effective communication skills, with experience collaborating across engineering, product, and infrastructure teams
Job Responsibility
Job Responsibility
  • Design, build, and operate core AI platform components used to train, deploy, and serve machine learning models in production environments
  • Own model serving and inference workflows end-to-end, driving improvements in reliability, scalability, performance, and operational excellence
  • Lead efforts to optimize inference systems for throughput, latency, and cost efficiency across CPU and GPU workloads
  • Design and manage GPU-based inference and training workloads, including performance tuning, capacity planning, and resource utilization optimization
  • Own and improve critical parts of the model lifecycle, including packaging, versioning, testing strategies, validation, and deployment automation
  • Implement and evolve observability practices (metrics, logging, tracing, alerting) to improve visibility and operational resilience of ML services and pipelines
  • Partner closely with product, infrastructure, security, and data teams to design scalable platform capabilities that enable AI-powered features
  • Contribute to technical design discussions, propose architectural improvements, and mentor junior engineers through code reviews and knowledge sharing
  • Participate in and help improve operational processes, including incident response, on-call rotations, and post-incident reviews
What we offer
What we offer
  • Generous performance-based bonus plans
  • Rich medical, dental, and vision coverage
  • Generous retirement contributions with 100% immediate vesting
  • Quarterly all-company wellness days
  • Country specific holidays plus a day off for your birthday
  • One-time home office stipend
  • Annual professional development budget
  • Quarterly well-being stipend
  • Considerable paid parental leave
  • Employee referral bonus program
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer, ML Training Platform

Location
Location
United States
Salary
Salary:
216700.00 - 303400.00 USD / Year
Reddit
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of software engineering experience, with a focus on Platform Engineering, ML Infrastructure, or Backend Systems
  • Deep Kubernetes Expertise: You know K8s beyond just 'deploying pods.' You understand CRDs, Controllers and the Operator pattern
  • Jupyter Ecosystem Knowledge: Experience customizing JupyterHub, JupyterLab extensions, or building similar interactive computing platforms
  • Strong Coding Skills: Proficiency in Python (for the ML ecosystem) and Go (for Kubernetes controllers/infrastructure tooling)
  • GPU Experience: Hands-on practice with CUDA environments, GPU virtualization/containerization, and doing it all within Kubernetes
  • Cloud Provider Experience: Familiarity with both managed ML offerings (Vertex AI, Sagemaker, etc) and building custom ML components in AWS and/or GCP
  • Experience working with distributed training frameworks, including Ray and Kubernetes
  • Comfortable with distributed systems, big data (Petabyte scale) and data-intensive systems
  • Strong focus on scalability, reliability, performance, and ease of use. You are an undying advocate for platform users and have a deep intuition for the machine learning development lifecycle
  • Strong organizational & communication skills
Job Responsibility
Job Responsibility
  • Lead the building, testing, and maintenance of ML training infrastructure at Reddit
  • Play a pivotal role in designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows
  • Evolve the MLE experience, from provisioning interactive GPU environments through large-scale training, supporting on-demand and self-service workflows
  • Kubernetes Automation: Write custom Kubernetes Controllers and Operators to manage the lifecycle of interactive Jupyter workspaces and long-running ML training jobs, handle auto-idling, and ensure fault tolerance
  • GPU Orchestration: Work with the underlying compute team to ensure MLEs have efficient access to training hardware resources and handle resource contention gracefully
  • Developer Experience (DevX): Treat internal MLEs as your customers. Conduct user research, reduce friction in the 'Idea-to-Prototype' loop, and standardize software environments (Docker images, Python dependency management)
What we offer
What we offer
  • Comprehensive Healthcare Benefits and Income Replacement Programs
  • 401k Match
  • Family Planning Support
  • Gender-Affirming Care
  • Mental Health & Coaching Benefits
  • Flexible Vacation & Reddit Global Days off
  • Generous paid Parental Leave
  • Paid Volunteer time off
  • Fulltime
Read More
Arrow Right