CrawlJobs Logo

ML Infra Engineer

United States, San Francisco · Job Posted February 21, 2026
Apply Position
Job Link Share

Job Description

In this role you will help scale and optimize our training systems and core model code. You’ll own critical infrastructure for large-scale training, from managing GPU/TPU compute and job orchestration to building reusable and efficient JAX training pipelines. You’ll work closely with researchers and model engineers to translate ideas into experiments—and those experiments into production training runs. This is a hands-on, high-leverage role at the intersection of ML, software engineering, and scalable infrastructure.

Job Responsibility

  • Own training/inference infrastructure: Design, implement, and maintain systems for large-scale model training, including scheduling, job management, checkpointing, and metrics/logging
  • Scale distributed training: Work with researchers to scale JAX-based training across TPU and GPU clusters with minimal friction
  • Optimize performance: Profile and improve memory usage, device utilization, throughput, and distributed synchronization
  • Enable rapid iteration: Build abstractions for launching, monitoring, debugging, and reproducing experiments
  • Manage compute resources: Ensure efficient allocation and utilization of cloud-based GPU/TPU compute while controlling cost
  • Partner with researchers: Translate research needs into infra capabilities and guide best practices for training at scale
  • Contribute to core training code: Evolve JAX model and training code to support new architectures, modalities, and evaluation metrics

Requirements

  • Strong software engineering fundamentals and experience building ML training infrastructure or internal platforms
  • Hands-on large-scale training experience in JAX (preferred), PyTorch
  • Familiarity with distributed training, multi-host setups, data loaders, and evaluation pipelines
  • Experience managing training workloads on cloud platforms (e.g., SLURM, Kubernetes, GCP TPU/GKE, AWS)
  • Ability to debug and optimize performance bottlenecks across the training stack
  • Strong cross-functional communication and ownership mindset

Nice to have

  • Deep ML systems background (e.g., training compilers, runtime optimization, custom kernels)
  • Experience operating close to hardware (GPU/TPU performance tuning)
  • Background in robotics, multimodal models, or large-scale foundation models
  • Experience designing abstractions that balance researcher flexibility with system reliability

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

ML Infra Engineer

8 matching positions

Senior Software Engineer – ML Model Compliance & Automation

We are seeking a highly skilled and motivated Senior Software Engineer to lead t...
Location
Location
India , Jaipur
Salary
Salary:
Not provided
infoobjects.com Logo
InfoObjects
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience Required: 3 - 7 yrs
  • GoLang (preferred)
  • Python (preferred)
  • Bash
  • MLOps Tools: KitOps, MLModelCI, MLflow, ONNX, TensorFlow, PyTorch, Docker
  • SBOM & Security: Syft, Grype, Trivy, CycloneDX, SPDX
  • CI/CD: GitHub Actions, GitLab CI, Jenkins, ArgoCD
  • Infra: Kubernetes, Docker, Helm, Terraform
  • Cloud: AWS, GCP, Azure (EKS/GKE/ECS preferred)
  • Version Control: Git, GitOps
Job Responsibility
Job Responsibility
  • Model Packaging & Artifact Management: Design and implement workflows for packaging ML models using KitOps, ONNX, MLflow, or TensorFlow SavedModel
  • Manage model artifact versioning, registries, and reproducibility
  • Ensure artifact integrity, consistency, and traceability across CI/CD pipelines
  • Model Profiling & Optimization: Automate model profiling (latency, size, ops) using MLModelCI, TorchServe, or ONNX Runtime
  • Apply quantization, pruning, and format conversions (e.g., FP32→INT8) for optimization
  • Embed profiling and optimization checks into CI/CD pipelines to assess deployment readiness
  • Compliance & SBOM Generation: Develop pipelines to generate and validate SBOMs for ML models
  • Implement compliance checks for licensing, vulnerabilities, and security using CycloneDX, SPDX, Syft, or Trivy
  • Validate schema, dependencies, and runtime environments for production readiness
  • Cloud Integration & Deployment: Automate model registration, endpoint creation, and monitoring setup in AWS/GCP/Azure
  • Fulltime
Read More
Arrow Right

Engineering Manager - Machine Learning Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
241200.00 - 400000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8–10 years of experience in ML infrastructure, including direct hands-on expertise as an engineer, IC/TL
  • 2+ years of experience managing infrastructure or ML platform engineers
  • Proven experience delivering and operating ML or AI infrastructure at scale
  • Solid technical depth across ML/AI infrastructure domains (e.g., feature stores, pipelines, deployment, inference, observability)
  • Demonstrated ability to drive execution on complex technical projects with cross-team stakeholders
  • Strong communication and stakeholder management skills
Job Responsibility
Job Responsibility
  • Lead and support the ML Infra team, driving project execution and ensuring delivery on key commitments
  • Build and launch Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Define and drive adoption of an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines, deployment tooling, and inference systems
  • Partner with ML product teams to understand requirements and deliver solutions that accelerate model development and iteration
  • Recruit, mentor, and develop engineers, fostering a collaborative and high-performing team culture
What we offer
What we offer
  • medical
  • dental
  • vision
  • 401(k)
  • equity
  • commission
  • Fulltime
Read More
Arrow Right

Machine Learning Operations Engineer II

Kensho is S&P Global’s hub for AI innovation and transformation. With expertise ...
Location
Location
United States , Cambridge; New York
Salary
Salary:
130000.00 - 175000.00 USD / Year
kensho.com Logo
Kensho Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2+ years of experience in ML infra, ML Ops, ML Engineering or some similar skillset
  • Experience managing distributed systems with Kubernetes
  • Cloud Platform (AWS) understanding
  • Python proficiency
  • Familiarity with distributed computing frameworks and workflow orchestration (ie. Ray, Airflow)
  • Familiarity with software engineering best practices in an ML context
  • Some basic understanding of ML concepts, LLMs and agents
  • Ability to debug distributed systems across infrastructure, networking and application layers
  • Excellent communication skills to drive adoption of new tools and best practices across multiple teams
  • Someone who’s very curious, driven, low-ego and eager to learn across a range of engineering disciplines
Job Responsibility
Job Responsibility
  • Iterate on Kensho’s ML processes to develop tools, services, and frameworks that make every stage of the ML workflow robust, auditable, and usable
  • Work closely with ML engineers to understand their unique processes, identify pain points, and form effective solutions
  • Empower engineers with the stable tooling necessary to rapidly experiment and actualize their research into demonstrable prototypes and mature products
  • Provide resources and training for ML teams on best practices, enabling them to efficiently productionize their work to be leveraged by high-value products and services
  • Evaluate, select and champion open source and third-party solutions, driving their adoption across teams and integrating into Kensho’s existing platform ecosystem
  • Ship scalable, efficient, and automated processes for model fine-tuning and reinforcement learning and for the evaluation of LLMs/Agents
  • Improve LLM and Agentic observability to help monitor agentic applications in production, detecting performance, decay and drift issues
  • Stay at the frontier by actively tracking emerging tools and frameworks, promote best practices and strengthen the technical expertise of the team with your unique skill set
What we offer
What we offer
  • Medical, Dental, and Vision insurance
  • 100% company paid premiums
  • Unlimited Paid Time Off
  • 26 weeks of 100% paid Parental Leave (paternity and maternity)
  • 401(k) plan with 6% employer matching
  • Generous company matching on donations to non-profit charities
  • Up to $20,000 tuition assistance toward degree programs, plus up to $4,000/year for ongoing professional education such as industry conferences
  • Plentiful snacks, drinks, and regularly catered lunches
  • Dog-friendly office (CAM office)
  • Bike sharing program memberships
  • Fulltime
Read More
Arrow Right

Principal Engineer - Marketplace

Principal Engineer role in the Marketplace Engineering team to lead breakthrough...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
302000.00 - 336000.00 USD / Year
uber.com Logo
Uber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • PhD in Computer Science, Machine Learning, Operations Research, or related quantitative field OR Master’s degree with 12+ years of industry experience
  • 10+ years of experience building and deploying ML models in large-scale production environments
  • Expert-level proficiency in modern ML frameworks (TensorFlow, PyTorch, JAX) and distributed computing platforms (Spark, Ray)
  • Deep expertise across multiple areas including: Deep Learning, Causal Inference, Reinforcement Learning, Multi-objective Optimization, Algorithmic Game Theory, and Large-scale Ads Ranking/Auction Systems
  • Proven track record of leading complex ML projects from research through production with significant measurable business impact
  • Strong programming skills in Python, Java, or Go with experience building production ML systems
  • Experience with feature engineering, model serving, and ML infrastructure at scale (handling millions of predictions per second)
  • Technical leadership experience including mentoring senior engineers and driving cross-team technical initiatives
  • Advanced Deep Learning and Neural Network architectures
  • Scalable ML architecture and distributed model training
Job Responsibility
Job Responsibility
  • Lead the design and implementation of advanced ML systems for dynamic pricing algorithms serving millions of drivers across 70+ countries around the world
  • Architect real-time ML infrastructure handling 1M+ pricing decisions per second with sub-50ms latency requirements
  • Drive breakthrough research in causal ML, reinforcement learning, algorithmic game theory, and multi-objective optimization for marketplace optimization with strategic agents
  • Own end-to-end ML model lifecycle from research through production deployment and continuous optimization
  • Develop and enforce best practices in system design, ensuring data integrity, security, and optimal performance
  • Serve as a representative for the Marketplace organization to the broader internal and external technical community
  • Contribute to the eng brand for Marketplace and serve as a talent magnet to help attract and retain talent for the team
  • Stay abreast of industry trends and emerging technologies in software engineering, focused particularly on ML/AI, to enhance our systems and processes continually
  • Build scalable ML architecture and feature management systems supporting Driver Pricing and broader Marketplace teams
  • Design experimentation frameworks enabling rapid testing of pricing algorithms using A/B, Switchback, Synthetic Control, and other experimental methodologies
What we offer
What we offer
  • Eligible to participate in Uber's bonus program
  • May be offered an equity award & other types of comp
  • Eligible to participate in a 401(k) plan
  • Eligible for various benefits (details at provided link)
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer

We’re seeking a Senior Machine Learning Engineer (P50) to join our new GenAI Mod...
Location
Location
Singapore
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience (generally 5+ years) in ML systems engineering, backend engineering, or infrastructure roles
  • Strong background in one or more of: LLMs, NLP, search/retrieval, embeddings, or applied ML
  • Hands-on experience with at least one GenAI area: RAG pipelines, fine-tuning, hybrid retrieval, or orchestration frameworks
  • Proficiency with modern ML frameworks (PyTorch, TensorFlow, Hugging Face, LangChain, LlamaIndex)
  • Familiarity with vector databases (Weaviate, Pinecone, FAISS, etc.) and large-scale serving infra
  • Strong coding skills (Python, backend engineering) and ability to move fast from idea to prototype
  • Comfort working in fast-paced, experimental environments with evolving direction
  • Bachelor’s or Master’s in Computer Science, Machine Learning, or related field—or equivalent experience
Job Responsibility
Job Responsibility
  • Build and apply advanced GenAI models
  • Develop and fine-tune LLMs and embeddings for Atlassian’s unique knowledge and enterprise data
  • Implement retrieval-augmented generation (RAG), hybrid retrieval, and knowledge-grounded modeling approaches
  • Work hands-on with modern frameworks, contributing directly to high-value prototypes and experiments
  • Prototype and experiment quickly
  • Build proof-of-concept systems for GenAI-powered assistants, agentic workflows, and innovative user experiences
  • Run experiments, collect feedback, and iterate fast to validate impact
  • Design and implement evaluation methods for quality, groundedness, and user value
  • Collaborate and contribute
  • Work closely with peers across ML, engineering, and product teams to bring new ideas to life
What we offer
What we offer
  • Health and wellbeing resources
  • Paid volunteer days
Read More
Arrow Right

Senior Machine Learning Engineering Manager, Gen AI

We're seeking a Senior Machine Learning Manager (M60) to lead a cross-functional...
Location
Location
United States
Salary
Salary:
193500.00 - 303150.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in ML, search, or backend engineering roles, with 3+ years leading teams
  • Strong track record of shipping ML-powered or LLM-integrated user-facing products
  • Experience with RAG systems (vector search, hybrid retrieval, LLM orchestration)
  • Deep experience in either modeling (e.g., LLMs, search, NLP) or engineering (e.g., backend infra, full-stack), with the ability to lead end-to-end
  • Deep understanding of LLM ecosystems (OpenAI, Claude, Mistral, OSS), orchestration frameworks (LangChain, LlamaIndex), and vector databases (Weaviate, Pinecone, FAISS, etc.)
  • Strong product intuition and ability to translate complex tech into valuable user features
  • Familiarity with GenAI evaluation methods: hallucination detection, groundedness scoring, and human-in-the-loop feedback loops
  • Master’s or PhD in Computer Science, Machine Learning, or related field preferred—or equivalent practical experience
Job Responsibility
Job Responsibility
  • Lead the vision, design, and execution of LLM-powered AI products, leveraging advance AI modeling (e.g. SLM post-training/fine-tuning), RAG architectures and hybrid ranking system
  • Define system architecture across retrievers, rankers, orchestration layers, prompt templates, and feedback mechanisms
  • Work closely with product and design teams to ensure delightful, fast, and grounded user experiences
  • Build and manage a cross-disciplinary team including ML engineers, backend/frontend engineers, and applied scientists
  • Foster a culture of E2E ownership — empowering the team to move from prototype to production quickly and iteratively
  • Mentor individuals to grow in both technical depth and product acumen
  • Shape the technical roadmap and long-term strategy for GenAI search across Atlassian’s product suite
  • Partner with platform and infra teams to scale inference, evaluate performance, and integrate usage signals for continuous improvement
  • Champion data quality, grounding, and responsible AI practices in all deployed features
What we offer
What we offer
  • health and wellbeing resources
  • paid volunteer days
  • Fulltime
Read More
Arrow Right

Senior CVML Platform Engineer

We are seeking a Senior CVML Platform Engineer to help design, build, and evolve...
Location
Location
United States
Salary
Salary:
160000.00 - 287000.00 USD / Year
bluerivertechnology.com Logo
Blue River Technology
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of professional engineering experience, with a focus on platform, infrastructure, or systems engineering
  • Strong technical judgment, balancing the evolution of legacy platforms with the design and delivery of new, greenfield components shared across multiple teams and workloads
  • Excellent Python skills, used in production systems, tooling, and platform components
  • Solid understanding of ML systems and the end-to-end model development lifecycle, from experimentation to deployment and iteration
  • Hands-on experience or strong familiarity with cloud platforms (AWS preferred) and container orchestration systems such as Kubernetes and Slurm
  • Ability to partner effectively with ML engineers, infra teams, and product stakeholders to translate requirements into platform capabilities
  • Ability to quickly ramp up on new domains, tools, and complex existing systems
Job Responsibility
Job Responsibility
  • Design, build, and evolve platform capabilities that support ML training, batch inference, and model deployment workflows at scale
  • Own and improve core platform components (e.g., compute orchestration, data pipelines, inference systems) used by multiple teams across Blue River and John Deere
  • Continuously enhance platform reliability, scalability, and performance, with a focus on real-world ML workloads
  • Enable ML engineers to move faster by building intuitive, well-documented platform tools and workflows across the model lifecycle (experimentation, deployment, and iteration)
  • Improve model inference performance and throughput while balancing trade-offs among cost, latency, and reliability
  • Support and scale distributed training and inference systems, including frameworks such as Ray and related tooling
  • Develop and optimize hybrid compute environments (cloud + on-prem/GPU infrastructure) to support large-scale ML workloads
  • Build and maintain infrastructure leveraging Kubernetes, Slurm, and cloud platforms (AWS preferred)
  • Identify and resolve bottlenecks in compute, storage, and data movement pipelines
  • Evaluate existing platform systems and make thoughtful decisions on when to extend, refactor, or rebuild components
What we offer
What we offer
  • bonus and benefit programs
  • Fulltime
Read More
Arrow Right

Senior AI Data Engineer

We are looking for a Senior AI Data Engineer to join a high-impact AI product in...
Location
Location
United States
Salary
Salary:
Not provided
velvetech.com Logo
Velvetech
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in Data Engineering / ML Engineering / AI Engineering
  • Strong programming skills in Python
  • Hands-on experience with PyTorch (training and deploying deep learning models)
  • Experience working with Vertex AI or similar ML platforms (GCP preferred)
  • Proven experience with vector databases (Milvus, Pinecone, or similar)
  • Strong knowledge of: Feature engineering techniques, Model evaluation and validation frameworks, Predictive inference systems
  • Experience with multiple database paradigms: Relational (PostgreSQL), Time-series (InfluxDB), Graph (Neo4j)
  • Solid understanding of embeddings and semantic/vector search systems
  • Experience implementing model lifecycle management, including: Drift detection, Monitoring, Governance
  • Strong understanding of scalable system design and performance optimization
Job Responsibility
Job Responsibility
  • Own and develop the biometric extraction model lifecycle (training, validation, deployment)
  • Design and maintain a vector memory layer using tools such as Milvus or Pinecone
  • Build and optimize predictive inference services for real-time and batch use cases
  • Develop and maintain data pipelines for PFM (Personal Financial Management) data preparation
  • Implement advanced feature engineering frameworks and model evaluation pipelines
  • Work with Vertex AI for model training, deployment, and orchestration
  • Manage and integrate heterogeneous data storage systems: InfluxDB (time-series data), PostgreSQL (relational data), Neo4j (graph data)
  • Develop vector embeddings pipelines and similarity search logic
  • Implement model governance processes: Drift detection and monitoring, Shadow-mode validation, Performance tracking and reporting
  • Design and apply optimization policies for inference latency, cost, and accuracy
What we offer
What we offer
  • FLEXIBLE working conditions
  • COOPERATIVE environment
  • Competitive salary
  • Many CHALLENGING and exciting projects with new opportunities and learning
  • GROWTH opportunities, skills and competencies improvement, and professional certification
  • In-company TRAINING (English, Software / DevOps / Project management / Design / Business)
  • Fulltime
Read More
Arrow Right