CrawlJobs Logo

Machine Learning Infra Engineer

United States, San Francisco Employment contract 150000.00 - 300000.00 USD / Year · Job Posted May 17, 2026
Apply Position
Job Link Share

Job Description

As an ML Infra Engineer, you’ll play a key role in building the inference and training frameworks that make it possible to deliver results at scale. You’ll collaborate closely with our ML and Platform teams to scale training across nodes, develop faster and more efficient serving, and create observability across the stack. This is a high-impact role where you’ll help define what high performance ML training and inference look like at Reducto.

Job Responsibility

  • Build and maintain our training and inference stack with an emphasis for fast iteration on training + flexibility for exploring new methods and high performance in inference
  • Develop benchmarks for both sets of stacks to identify bottlenecks
  • Explore SOTA advances in training and inference and work to apply them
  • Design systems for scaling model training across multi-node, multi-GPU environments with strong reliability and observability
  • Scale distributed training and inference workloads across large GPU clusters while improving utilization, reliability, and cost efficiency
  • Build the tooling, abstractions, and observability that help ML engineers move faster from experiment to production

Requirements

  • Hold yourself to a high bar for quality and precision
  • Enjoy solving complex problems and building from first principles
  • Strong Python skills + a background in systems engineering
  • Comfortable with Kubernetes and distributed training frameworks
  • Love getting your hands dirty with real-world implementation challenges
  • Operate well in fast-changing, high-growth environments
  • Collaborate effectively across technical and non-technical teams
  • Take full ownership from strategy through execution

Nice to have

  • Experience at an early-stage or high-growth startup
  • Developed in open source training/inference stacks in a meaningful way
  • Excited to set up distributed inference across 100s-1000s of GPUs
  • Care deeply about combining technical excellence with business impact

What we offer

  • Unlimited PTO
  • Lunch
  • Reimbursed Transportation
  • Insurance
  • Health and Wellness Budget
  • Parental Leave

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Machine Learning Infra Engineer

8 matching positions

Principal Machine Learning Engineer

This is a high-leverage leadership role that spans architecture, execution, and ...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
ema.co Logo
Ema
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s (or PhD) degree in Computer Science, Machine Learning, Statistics, or a related field
  • A strong track record (usually 10-12+ years) of applied experience with ML techniques, especially in large-scale settings
  • Experience building production ML systems that operate at scale (latency / throughput / cost constraints)
  • Experience in Knowledge retrieval and Search space
  • Exposure in building Agentic Systems and Frameworks
  • Proficiency in relevant programming languages (e.g. Python, C++, Java) and ML frameworks (TensorFlow, PyTorch, etc.)
  • Strong understanding of the full ML lifecycle: data pipelines, feature engineering, model training, serving, monitoring, maintenance
  • Experience designing systems for monitoring, diagnostics, logging, model versioning, etc.
  • Deep knowledge of computational trade-offs: distributed training, inference, optimizations (e.g. quantization, pruning, batching)
  • Excellent communication skills
Job Responsibility
Job Responsibility
  • Lead the technical direction of GenAI and agentic ML systems that power enterprise-grade AI agents — spanning reasoning, retrieval, tool use, and integrations across various SaaS products
  • Architect, design, and implement scalable production pipelines for model training, fine-tuning, retrieval (RAG), agent orchestration, and evaluation — ensuring robustness, latency efficiency, and continuous learning
  • Define and own the multi-year ML roadmap for GenAI infrastructure — including agent frameworks, RAG systems, world-class evaluation loops, and integration with MCP, browser, and vision pipelines
  • Identify and integrate cutting-edge ML methods / research (deep learning, large models, recommender systems, LLMs, etc.) into Ema’s products or infrastructure
  • Research, prototype, and integrate cutting-edge ML and LLM advancements (reasoning, memory architectures, multi-modal perception, long-context models, autonomous agents) into the platform
  • Optimize trade-offs between accuracy, latency, cost, interpretability, and real-world reliability across the agent lifecycle — from prompt design to orchestration and execution
  • Champion engineering excellence — drive observability, reproducibility, versioning, testing, and bias-aware development across ML and agentic systems
  • Mentor and elevate senior engineers and researchers, fostering a culture of scientific rigor, experimentation, and system-level thinking
  • Collaborate cross-functionally with product, infra, and research teams to align ML innovation with enterprise needs — enabling secure integrations, privacy-aware deployments, and scalable use cases
  • Influence data strategy — guide how retrieval indices, embeddings, structured/unstructured corpora, and feedback loops evolve to improve grounding, factuality, and reasoning depth
  • Fulltime
Read More
Arrow Right

Machine Learning Engineer – HPC

At Meshy, we believe 3D creation should be boundless and accessible. Our mission...
Location
Location
China , Shanghai
Salary
Salary:
Not provided
meshy.ai Logo
Meshy LLC
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Hands-on experience with CUDA and GPU programming
  • Strong programming skills in C++ and Python
  • Solid understanding of parallel programming, performance tuning, and numerical computation
Job Responsibility
Job Responsibility
  • Design, implement, and optimize GPU computing kernels to accelerate model training and inference for next-generation 3D GenAI models
  • Develop and maintain domain-specific libraries and performance-critical components for 3D generation workloads
  • Work closely with researchers and infra engineers to identify bottlenecks, benchmark performance, and deliver high-efficiency, production-ready GPU modules
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer

Start.io, a leading mobile marketing and audience platform, empowers the app eco...
Location
Location
Salary
Salary:
Not provided
start.io Logo
Start.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • B.Sc. or M.Sc. in Computer Science, Software Engineering, or a related technical discipline
  • 5+ years of experience building high-performance backend or ML inference systems
  • Deep expertise in Python and experience with low-latency APIs and real-time serving frameworks (e.g., FastAPI, Triton Inference Server, TorchServe, BentoML)
  • Experience with scalable service architecture, message queues (Kafka, Pub/Sub), and async processing
  • Strong understanding of model deployment practices, online/offline feature parity, and real-time monitoring
  • Experience in cloud environments (AWS, GCP, or OCI) and container orchestration (Kubernetes)
  • Experience working with in-memory and NoSQL databases (e.g. Aerospike, Redis, Bigtable) to support ultra-fast data access in production-grade ML services
  • Familiarity with observability stacks (Prometheus, Grafana, OpenTelemetry) and best practices for alerting and diagnostics
  • A strong sense of ownership and the ability to drive solutions end-to-end
  • Passion for performance, clean architecture, and impactful systems
Job Responsibility
Job Responsibility
  • Own and lead the design and development of low-latency Algo inference services handling billions of requests per day
  • Build and scale robust real-time decision-making engines, integrating ML models with business logic under strict SLAs
  • Collaborate closely with DS to deploy models seamlessly and reliably in production
  • Design systems for model versioning, shadowing, and A/B testing at runtime
  • Ensure high availability, scalability, and observability of production systems
  • Continuously optimize latency, throughput, and cost-efficiency using modern tooling and techniques
  • Work independently while interfacing with cross-functional stakeholders from Algo, Infra, Product, Engineering, BA & Business
What we offer
What we offer
  • Lead the mission-critical inference engine that drives our core product
  • Join a high-caliber Algo group solving real-time, large-scale, high-stakes problems
  • Work on systems where every millisecond matters, and every decision drives real value
  • Enjoy a fast-paced, collaborative, and empowered culture with full ownership of your domain
Read More
Arrow Right

Senior Machine Learning Engineer

We’re seeking a Senior Machine Learning Engineer (P50) to join our new GenAI Mod...
Location
Location
Singapore
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience (generally 5+ years) in ML systems engineering, backend engineering, or infrastructure roles
  • Strong background in one or more of: LLMs, NLP, search/retrieval, embeddings, or applied ML
  • Hands-on experience with at least one GenAI area: RAG pipelines, fine-tuning, hybrid retrieval, or orchestration frameworks
  • Proficiency with modern ML frameworks (PyTorch, TensorFlow, Hugging Face, LangChain, LlamaIndex)
  • Familiarity with vector databases (Weaviate, Pinecone, FAISS, etc.) and large-scale serving infra
  • Strong coding skills (Python, backend engineering) and ability to move fast from idea to prototype
  • Comfort working in fast-paced, experimental environments with evolving direction
  • Bachelor’s or Master’s in Computer Science, Machine Learning, or related field—or equivalent experience
Job Responsibility
Job Responsibility
  • Build and apply advanced GenAI models
  • Develop and fine-tune LLMs and embeddings for Atlassian’s unique knowledge and enterprise data
  • Implement retrieval-augmented generation (RAG), hybrid retrieval, and knowledge-grounded modeling approaches
  • Work hands-on with modern frameworks, contributing directly to high-value prototypes and experiments
  • Prototype and experiment quickly
  • Build proof-of-concept systems for GenAI-powered assistants, agentic workflows, and innovative user experiences
  • Run experiments, collect feedback, and iterate fast to validate impact
  • Design and implement evaluation methods for quality, groundedness, and user value
  • Collaborate and contribute
  • Work closely with peers across ML, engineering, and product teams to bring new ideas to life
What we offer
What we offer
  • Health and wellbeing resources
  • Paid volunteer days
Read More
Arrow Right

Machine Learning Operations Engineer II

Kensho is S&P Global’s hub for AI innovation and transformation. With expertise ...
Location
Location
United States , Cambridge; New York
Salary
Salary:
130000.00 - 175000.00 USD / Year
kensho.com Logo
Kensho Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2+ years of experience in ML infra, ML Ops, ML Engineering or some similar skillset
  • Experience managing distributed systems with Kubernetes
  • Cloud Platform (AWS) understanding
  • Python proficiency
  • Familiarity with distributed computing frameworks and workflow orchestration (ie. Ray, Airflow)
  • Familiarity with software engineering best practices in an ML context
  • Some basic understanding of ML concepts, LLMs and agents
  • Ability to debug distributed systems across infrastructure, networking and application layers
  • Excellent communication skills to drive adoption of new tools and best practices across multiple teams
  • Someone who’s very curious, driven, low-ego and eager to learn across a range of engineering disciplines
Job Responsibility
Job Responsibility
  • Iterate on Kensho’s ML processes to develop tools, services, and frameworks that make every stage of the ML workflow robust, auditable, and usable
  • Work closely with ML engineers to understand their unique processes, identify pain points, and form effective solutions
  • Empower engineers with the stable tooling necessary to rapidly experiment and actualize their research into demonstrable prototypes and mature products
  • Provide resources and training for ML teams on best practices, enabling them to efficiently productionize their work to be leveraged by high-value products and services
  • Evaluate, select and champion open source and third-party solutions, driving their adoption across teams and integrating into Kensho’s existing platform ecosystem
  • Ship scalable, efficient, and automated processes for model fine-tuning and reinforcement learning and for the evaluation of LLMs/Agents
  • Improve LLM and Agentic observability to help monitor agentic applications in production, detecting performance, decay and drift issues
  • Stay at the frontier by actively tracking emerging tools and frameworks, promote best practices and strengthen the technical expertise of the team with your unique skill set
What we offer
What we offer
  • Medical, Dental, and Vision insurance
  • 100% company paid premiums
  • Unlimited Paid Time Off
  • 26 weeks of 100% paid Parental Leave (paternity and maternity)
  • 401(k) plan with 6% employer matching
  • Generous company matching on donations to non-profit charities
  • Up to $20,000 tuition assistance toward degree programs, plus up to $4,000/year for ongoing professional education such as industry conferences
  • Plentiful snacks, drinks, and regularly catered lunches
  • Dog-friendly office (CAM office)
  • Bike sharing program memberships
  • Fulltime
Read More
Arrow Right

Machine Learning Data Engineer - Systems & Retrieval

As a Machine Learning Data Engineer - Systems & Retrieval, you will build and op...
Location
Location
United States , Palo Alto
Salary
Salary:
Not provided
zyphra.com Logo
Zyphra
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering background with fluency in Python
  • Experience designing, building, and maintaining data pipelines in production environments
  • Deep understanding of data structures, storage formats, and distributed data systems
  • Familiarity with indexing and retrieval techniques for large-scale document corpora
  • Understanding of database systems (SQL and NoSQL), their internals, and performance characteristics
  • Strong attention to security, access controls, and compliance best practices (e.g., GDPR, SOC2)
  • Excellent debugging, observability, and logging practices to support reliability at scale
  • Strong communication skills and experience collaborating across ML, infra, and product teams
Job Responsibility
Job Responsibility
  • Design and implementation of distributed data ingestion and transformation pipelines
  • Building retrieval and indexing systems that support RAG and other LLM-based methods
  • Mining and organizing large unstructured datasets, both in research and production environments
  • Collaborating with ML engineers, systems engineers, and DevOps to scale pipelines and observability
  • Ensuring compliance and access control in data handling, with security and auditability in mind
What we offer
What we offer
  • Comprehensive medical, dental, vision, and FSA plans
  • Competitive compensation and 401(k)
  • Relocation and immigration support on a case-by-case basis
  • On-site meals prepared by a dedicated culinary team
  • Thursday Happy Hours
  • Fulltime
Read More
Arrow Right

Engineering Manager - Machine Learning Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , New York
Salary
Salary:
216000.00 - 367200.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8–10 years of experience in ML infrastructure, including direct hands-on expertise as an engineer, IC/TL
  • 2+ years of experience managing infrastructure or ML platform engineers
  • Proven experience delivering and operating ML or AI infrastructure at scale
  • Solid technical depth across ML/AI infrastructure domains (e.g., feature stores, pipelines, deployment, inference, observability)
  • Demonstrated ability to drive execution on complex technical projects with cross-team stakeholders
  • Strong communication and stakeholder management skills
Job Responsibility
Job Responsibility
  • Lead and support the ML Infra team, driving project execution and ensuring delivery on key commitments
  • Build and launch Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Define and drive adoption of an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines, deployment tooling, and inference systems
  • Partner with ML product teams to understand requirements and deliver solutions that accelerate model development and iteration
  • Recruit, mentor, and develop engineers, fostering a collaborative and high-performing team culture
  • Fulltime
Read More
Arrow Right

ML Infra Engineer (Data Systems)

As an ML Infra Engineer (Data Systems), you’ll build and operate the data infras...
Location
Location
United States , San Francisco
Salary
Salary:
Not provided
physicalintelligence.company Logo
Physical Intelligence
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering fundamentals
  • Experience building distributed systems or large-scale data pipelines
  • Comfort reasoning about performance, memory, I/O, and storage efficiency
  • Familiarity with batch and/or streaming processing systems
  • Experience with object storage systems and data format tradeoffs
  • Ownership mindset: design, build, operate, and iterate on systems end-to-end
  • Enjoy working closely with researchers and unblocking fast-moving projects
Job Responsibility
Job Responsibility
  • Data Ingestion & Processing: Design and build high-throughput pipelines that validate, transform, and featurize raw multimodal data
  • Batch & Streaming Systems: Operate large-scale batch and streaming workflows over massive datasets
  • Storage Systems: Design object storage layouts, metadata systems, and efficient access patterns
  • choose file formats with performance and scalability in mind
  • Data Lifecycle Management: Build systems for backfills, dataset rebuilds, garbage collection, and large-scale transformations
  • Training-Time Performance: Optimize dataloaders, sharding, prefetching, caching, and throughput to reduce time from data arrival → model training
  • Metadata & Indexing: Build scalable metadata stores for datasets, annotations, and training artifacts
  • Data Movement: Move hundreds of terabytes to petabytes efficiently across clusters and environments
  • Operational Correctness: Implement observability, validation, and guardrails to prevent silent data regressions
  • Cross-Functional Collaboration: Work closely with cross-functional teams of researchers, engineers and roboticists to translate evolving data needs into robust systems
  • Fulltime
Read More
Arrow Right