CrawlJobs Logo

Machine Learning Infra Engineer

reducto.ai Logo

Reducto

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

150000.00 - 300000.00 USD / Year

Job Description:

As an ML Infra Engineer, you’ll play a key role in building the inference and training frameworks that make it possible to deliver results at scale. You’ll collaborate closely with our ML and Platform teams to scale training across nodes, develop faster and more efficient serving, and create observability across the stack. This is a high-impact role where you’ll help define what high performance ML training and inference look like at Reducto.

Job Responsibility:

  • Build and maintain our training and inference stack with an emphasis for fast iteration on training + flexibility for exploring new methods and high performance in inference
  • Develop benchmarks for both sets of stacks to identify bottlenecks
  • Explore SOTA advances in training and inference and work to apply them
  • Design systems for scaling model training across multi-node, multi-GPU environments with strong reliability and observability
  • Scale distributed training and inference workloads across large GPU clusters while improving utilization, reliability, and cost efficiency
  • Build the tooling, abstractions, and observability that help ML engineers move faster from experiment to production

Requirements:

  • Hold yourself to a high bar for quality and precision
  • Enjoy solving complex problems and building from first principles
  • Strong Python skills + a background in systems engineering
  • Comfortable with Kubernetes and distributed training frameworks
  • Love getting your hands dirty with real-world implementation challenges
  • Operate well in fast-changing, high-growth environments
  • Collaborate effectively across technical and non-technical teams
  • Take full ownership from strategy through execution

Nice to have:

  • Experience at an early-stage or high-growth startup
  • Developed in open source training/inference stacks in a meaningful way
  • Excited to set up distributed inference across 100s-1000s of GPUs
  • Care deeply about combining technical excellence with business impact
What we offer:
  • Unlimited PTO
  • Lunch
  • Reimbursed Transportation
  • Insurance
  • Health and Wellness Budget
  • Parental Leave

Additional Information:

Job Posted:
May 17, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Machine Learning Infra Engineer

Senior Machine Learning System Engineer

As a Senior ML System Engineer on the AI & ML Platform team, you will play a piv...
Location
Location
United States , Seattle; San Francisco; New York; Austin
Salary
Salary:
165500.00 - 265800.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience in building machine learning systems or ML infra / MLOps platform
  • Fluency in at least one modern object-oriented programming language (preferably Java/Kotlin and Python)
  • Experience with RESTful microservices
  • Experience using cloud tools such as Amazon Web Services (S3, Kinesis, Cloud Formation, EKS, AWS Security and Networking)
  • Experience with Continuous Delivery and Continuous Integration
Job Responsibility
Job Responsibility
  • Collaborate with your teammates to solve complex problems, from technical design to launch
  • Deliver cutting-edge solutions that are used by other Atlassian teams and products to build AI features that reach millions of customers
  • Deliver code reviews, documentation & bug fixes within a strong engineering culture
  • Partner across engineering teams to take on company-wide initiatives spanning multiple projects
  • Mentor junior members of the team
What we offer
What we offer
  • health and wellbeing resources
  • paid volunteer days
  • Fulltime
Read More
Arrow Right

Engineering Manager - Machine Learning Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
241200.00 - 400000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8–10 years of experience in ML infrastructure, including direct hands-on expertise as an engineer, IC/TL
  • 2+ years of experience managing infrastructure or ML platform engineers
  • Proven experience delivering and operating ML or AI infrastructure at scale
  • Solid technical depth across ML/AI infrastructure domains (e.g., feature stores, pipelines, deployment, inference, observability)
  • Demonstrated ability to drive execution on complex technical projects with cross-team stakeholders
  • Strong communication and stakeholder management skills
Job Responsibility
Job Responsibility
  • Lead and support the ML Infra team, driving project execution and ensuring delivery on key commitments
  • Build and launch Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Define and drive adoption of an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines, deployment tooling, and inference systems
  • Partner with ML product teams to understand requirements and deliver solutions that accelerate model development and iteration
  • Recruit, mentor, and develop engineers, fostering a collaborative and high-performing team culture
What we offer
What we offer
  • medical
  • dental
  • vision
  • 401(k)
  • equity
  • commission
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer

Start.io, a leading mobile marketing and audience platform, empowers the app eco...
Location
Location
Salary
Salary:
Not provided
start.io Logo
Start.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • B.Sc. or M.Sc. in Computer Science, Software Engineering, or a related technical discipline
  • 5+ years of experience building high-performance backend or ML inference systems
  • Deep expertise in Python and experience with low-latency APIs and real-time serving frameworks (e.g., FastAPI, Triton Inference Server, TorchServe, BentoML)
  • Experience with scalable service architecture, message queues (Kafka, Pub/Sub), and async processing
  • Strong understanding of model deployment practices, online/offline feature parity, and real-time monitoring
  • Experience in cloud environments (AWS, GCP, or OCI) and container orchestration (Kubernetes)
  • Experience working with in-memory and NoSQL databases (e.g. Aerospike, Redis, Bigtable) to support ultra-fast data access in production-grade ML services
  • Familiarity with observability stacks (Prometheus, Grafana, OpenTelemetry) and best practices for alerting and diagnostics
  • A strong sense of ownership and the ability to drive solutions end-to-end
  • Passion for performance, clean architecture, and impactful systems
Job Responsibility
Job Responsibility
  • Own and lead the design and development of low-latency Algo inference services handling billions of requests per day
  • Build and scale robust real-time decision-making engines, integrating ML models with business logic under strict SLAs
  • Collaborate closely with DS to deploy models seamlessly and reliably in production
  • Design systems for model versioning, shadowing, and A/B testing at runtime
  • Ensure high availability, scalability, and observability of production systems
  • Continuously optimize latency, throughput, and cost-efficiency using modern tooling and techniques
  • Work independently while interfacing with cross-functional stakeholders from Algo, Infra, Product, Engineering, BA & Business
What we offer
What we offer
  • Lead the mission-critical inference engine that drives our core product
  • Join a high-caliber Algo group solving real-time, large-scale, high-stakes problems
  • Work on systems where every millisecond matters, and every decision drives real value
  • Enjoy a fast-paced, collaborative, and empowered culture with full ownership of your domain
Read More
Arrow Right

Senior Machine Learning Engineer

We’re seeking a Senior Machine Learning Engineer (P50) to join our new GenAI Mod...
Location
Location
Singapore
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience (generally 5+ years) in ML systems engineering, backend engineering, or infrastructure roles
  • Strong background in one or more of: LLMs, NLP, search/retrieval, embeddings, or applied ML
  • Hands-on experience with at least one GenAI area: RAG pipelines, fine-tuning, hybrid retrieval, or orchestration frameworks
  • Proficiency with modern ML frameworks (PyTorch, TensorFlow, Hugging Face, LangChain, LlamaIndex)
  • Familiarity with vector databases (Weaviate, Pinecone, FAISS, etc.) and large-scale serving infra
  • Strong coding skills (Python, backend engineering) and ability to move fast from idea to prototype
  • Comfort working in fast-paced, experimental environments with evolving direction
  • Bachelor’s or Master’s in Computer Science, Machine Learning, or related field—or equivalent experience
Job Responsibility
Job Responsibility
  • Build and apply advanced GenAI models
  • Develop and fine-tune LLMs and embeddings for Atlassian’s unique knowledge and enterprise data
  • Implement retrieval-augmented generation (RAG), hybrid retrieval, and knowledge-grounded modeling approaches
  • Work hands-on with modern frameworks, contributing directly to high-value prototypes and experiments
  • Prototype and experiment quickly
  • Build proof-of-concept systems for GenAI-powered assistants, agentic workflows, and innovative user experiences
  • Run experiments, collect feedback, and iterate fast to validate impact
  • Design and implement evaluation methods for quality, groundedness, and user value
  • Collaborate and contribute
  • Work closely with peers across ML, engineering, and product teams to bring new ideas to life
What we offer
What we offer
  • Health and wellbeing resources
  • Paid volunteer days
Read More
Arrow Right

Senior Machine Learning Engineering Manager, Gen AI

We're seeking a Senior Machine Learning Manager (M60) to lead a cross-functional...
Location
Location
United States
Salary
Salary:
193500.00 - 303150.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in ML, search, or backend engineering roles, with 3+ years leading teams
  • Strong track record of shipping ML-powered or LLM-integrated user-facing products
  • Experience with RAG systems (vector search, hybrid retrieval, LLM orchestration)
  • Deep experience in either modeling (e.g., LLMs, search, NLP) or engineering (e.g., backend infra, full-stack), with the ability to lead end-to-end
  • Deep understanding of LLM ecosystems (OpenAI, Claude, Mistral, OSS), orchestration frameworks (LangChain, LlamaIndex), and vector databases (Weaviate, Pinecone, FAISS, etc.)
  • Strong product intuition and ability to translate complex tech into valuable user features
  • Familiarity with GenAI evaluation methods: hallucination detection, groundedness scoring, and human-in-the-loop feedback loops
  • Master’s or PhD in Computer Science, Machine Learning, or related field preferred—or equivalent practical experience
Job Responsibility
Job Responsibility
  • Lead the vision, design, and execution of LLM-powered AI products, leveraging advance AI modeling (e.g. SLM post-training/fine-tuning), RAG architectures and hybrid ranking system
  • Define system architecture across retrievers, rankers, orchestration layers, prompt templates, and feedback mechanisms
  • Work closely with product and design teams to ensure delightful, fast, and grounded user experiences
  • Build and manage a cross-disciplinary team including ML engineers, backend/frontend engineers, and applied scientists
  • Foster a culture of E2E ownership — empowering the team to move from prototype to production quickly and iteratively
  • Mentor individuals to grow in both technical depth and product acumen
  • Shape the technical roadmap and long-term strategy for GenAI search across Atlassian’s product suite
  • Partner with platform and infra teams to scale inference, evaluate performance, and integrate usage signals for continuous improvement
  • Champion data quality, grounding, and responsible AI practices in all deployed features
What we offer
What we offer
  • health and wellbeing resources
  • paid volunteer days
  • Fulltime
Read More
Arrow Right

Senior Software Engineer – ML Model Compliance & Automation

We are seeking a highly skilled and motivated Senior Software Engineer to lead t...
Location
Location
India , Jaipur
Salary
Salary:
Not provided
infoobjects.com Logo
InfoObjects
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience Required: 3 - 7 yrs
  • GoLang (preferred)
  • Python (preferred)
  • Bash
  • MLOps Tools: KitOps, MLModelCI, MLflow, ONNX, TensorFlow, PyTorch, Docker
  • SBOM & Security: Syft, Grype, Trivy, CycloneDX, SPDX
  • CI/CD: GitHub Actions, GitLab CI, Jenkins, ArgoCD
  • Infra: Kubernetes, Docker, Helm, Terraform
  • Cloud: AWS, GCP, Azure (EKS/GKE/ECS preferred)
  • Version Control: Git, GitOps
Job Responsibility
Job Responsibility
  • Model Packaging & Artifact Management: Design and implement workflows for packaging ML models using KitOps, ONNX, MLflow, or TensorFlow SavedModel
  • Manage model artifact versioning, registries, and reproducibility
  • Ensure artifact integrity, consistency, and traceability across CI/CD pipelines
  • Model Profiling & Optimization: Automate model profiling (latency, size, ops) using MLModelCI, TorchServe, or ONNX Runtime
  • Apply quantization, pruning, and format conversions (e.g., FP32→INT8) for optimization
  • Embed profiling and optimization checks into CI/CD pipelines to assess deployment readiness
  • Compliance & SBOM Generation: Develop pipelines to generate and validate SBOMs for ML models
  • Implement compliance checks for licensing, vulnerabilities, and security using CycloneDX, SPDX, Syft, or Trivy
  • Validate schema, dependencies, and runtime environments for production readiness
  • Cloud Integration & Deployment: Automate model registration, endpoint creation, and monitoring setup in AWS/GCP/Azure
  • Fulltime
Read More
Arrow Right

Research Engineering Manager, Post-Training

Meta is seeking a Research Engineering Manager to lead the Post-Training team wi...
Location
Location
United States , Menlo Park
Salary
Salary:
219000.00 - 301000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science, Machine Learning, or a related technical field
  • 4+ years of experience in machine learning engineering, machine learning research, or a related technical role
  • 3+ years of experience managing or leading technical teams, including hiring, mentoring, and performance management
  • Proficiency in Python and experience with ML frameworks such as PyTorch
  • Proven track record of leading medium to large-scale technical projects (specifically data pipelines or ML infrastructure) from conception to deployment
  • Software engineering practices including version control, testing, code review, and system design
  • Demonstrated ability to balance hands-on technical work with people management and strategic planning
  • Great communication skills with the ability to influence cross-functional stakeholders
Job Responsibility
Job Responsibility
  • Build, mentor, and grow a team of research engineers focused on full-stack post-training data infrastructure
  • Conduct performance reviews, career development conversations, and provide technical mentorship to team members
  • Foster a Culture of Engineering Excellence, data rigor, and rapid iteration within the team
  • Partner with recruiting to hire world-class research engineering talent
  • Oversee the development and scaling of data collection pipelines for high-value domains (STEM, GDP-valuable tasks, finance, legal, health) and complex agentic workflows (deep research, computer use, shopping agents)
  • Establish and manage partnerships with external data vendors to source and securely prepare expert-level post-training datasets
  • Influence the technical roadmap for data infrastructure in collaboration with the MSL Infra team
  • Translate the strategic vision of research scientists into actionable engineering plans for synthetic data generation, SFT, and RLHF pipelines
  • Partner with research scientists, product teams, and model training teams to align data collection priorities with organizational capability goals
  • Build robust, reusable data pipelines that can rapidly deliver high-quality datasets to multiple model lines
What we offer
What we offer
  • bonus
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

Research Engineering Manager, Evaluations, Meta Superintelligence Labs

Meta is seeking a Research Engineering Manager to lead the Evaluations team with...
Location
Location
United States , Menlo Park
Salary
Salary:
219000.00 - 301000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science, Machine Learning, or a related technical field
  • 4+ years of experience in machine learning engineering, machine learning research, or a related technical role
  • 3+ years of experience managing or leading technical teams, including hiring, mentoring, and performance management
  • Proficiency in Python and experience with ML frameworks such as PyTorch
  • Proven track record of leading medium to large-scale technical projects from conception to deployment
  • Demonstrated experience balancing hands-on technical work with people management and strategic planning
  • Clear communication and experience influencing cross-functional stakeholders
Job Responsibility
Job Responsibility
  • Build, mentor, and grow a team of research engineers and scientists focused on evaluation infrastructure and benchmarking
  • Conduct performance reviews, career development conversations, and provide technical mentorship to team members
  • Foster a culture of engineering excellence, research rigor, and rapid iteration within the team
  • Partner with recruiting to hire world-class research engineering talent
  • Curate and integrate publicly available and internal benchmarks to direct the capabilities of frontier model development
  • Oversee the development and implementation of evaluation environments, including environments for novel model capabilities and modalities
  • Establish partnerships with external data vendors to source and prepare high-quality evaluation datasets
  • Influence the technical roadmap for evaluation infrastructure in collaboration with MSL Infra team
  • Translate the technical vision of research scientists into actionable engineering plans and execution strategies
  • Partner with research scientists, product teams, and other engineering teams to align evaluation priorities with organizational goals
What we offer
What we offer
  • bonus
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right