CrawlJobs Logo

Research Scientist / Engineer – Training Infrastructure

United States, Palo Alto 187500.00 - 395000.00 USD / Year · Job Posted January 13, 2026
Apply Position
Job Link Share

Job Description

Luma’s mission is to build multimodal AI to expand human imagination and capabilities. We believe that multimodality is critical for intelligence. To go beyond language models and build more aware, capable and useful systems, the next step function change will come from vision. So, we are working on training and scaling up multimodal foundation models for systems that can see and understand, show and explain, and eventually interact with our world to effect change. We are looking for engineers with significant experience solving hard problems in PyTorch, CUDA and distributed systems. You will work alongside the rest of the research team to build & train cutting edge foundation models on thousands of GPUs that are built to scale from the ground up. The Training Infrastructure team at Luma is responsible for building and maintaining the distributed systems that enable training of our large-scale multimodal models across thousands of GPUs. This team ensures our researchers can focus on innovation while having access to reliable, efficient, and scalable training infrastructure that pushes the boundaries of what's possible in AI model development.

Job Responsibility

  • Design, implement, and optimize efficient distributed training systems for models with thousands of GPUs
  • Research and implement advanced parallelization techniques (FSDP, Tensor Parallel, Pipeline Parallel, Expert Parallel)
  • Build monitoring, visualization, and debugging tools for large-scale training runs
  • Optimize training stability, convergence, and resource utilization across massive clusters

Requirements

  • Extensive experience with distributed PyTorch training and parallelisms in foundation model training
  • Deep understanding of GPU clusters, networking, and storage systems
  • Familiarity with communication libraries (NCCL, MPI) and distributed system optimization

Nice to have

  • Strong Linux systems administration and scripting capabilities
  • Experience managing training runs across >100 GPUs
  • Experience with containerization, orchestration, and cloud infrastructure

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Research Scientist / Engineer – Training Infrastructure

8 matching positions

Research Engineer / Research Scientist - Foundations Retrieval Lead

The Foundations Research team works on high-risk, high-reward ideas that could s...
Location
Location
United States , San Francisco
Salary
Salary:
445000.00 - 555000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience leading high-performance teams of researchers or engineers in ML infrastructure or foundational research
  • Deep technical expertise in representation learning, embedding models, or vector retrieval systems
  • Familiarity with transformer-based LLMs and how embedding spaces can interact with language model objectives
  • Research experience in areas such as contrastive learning, supervised or unsupervised embedding learning, or metric learning
  • A track record of building or scaling large machine learning systems, particularly embedding pipelines in production or research contexts
  • A first-principles mindset for challenging assumptions about how retrieval and memory should work for large models
Job Responsibility
Job Responsibility
  • Lead research into embedding models and retrieval systems optimized for grounding, relevance, and adaptive reasoning
  • Manage a team of researchers and engineers building end-to-end infrastructure for training, evaluating, and integrating embeddings into frontier models
  • Drive innovation in dense, sparse, and hybrid representation techniques, metric learning, and learning-to-retrieve systems
  • Collaborate closely with Pretraining, Inference, and other Research teams to integrate retrieval throughout the model lifecycle
  • Contribute to OpenAI’s long-term vision of AI systems with memory and knowledge access capabilities rooted in learned representations
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

Research Scientist / Engineer – Foundation Model: Core Research

This is a rare and foundational opportunity to define the future of multimodal A...
Location
Location
United States , Palo Alto
Salary
Salary:
250000.00 - 450000.00 USD / Year
lumalabs.ai Logo
Luma AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A Bachelor's, Master's, or PhD degree in Computer Science, Machine Learning, Physics, or Mathematics is essential
  • A 'first-principles' intuition for scaling
  • Fluent in the language of frontier AI
  • Proven ability to design and rigorously analyze experiments and to articulate complex technical concepts effectively
  • Practical experience with distributed or high-performance computing environments, particularly managing and optimizing training runs on large-scale GPU clusters
Job Responsibility
Job Responsibility
  • Unified Modeling & Efficiency Drive the core research that powers all of Luma's products — co-designing multimodal representations, advancing core algorithms for long-context training, and establishing rigorous scaling laws to predict performance across compute budgets
  • Alignment & Evaluation Close the gap between training loss and user experience. Develop proxy tasks and automated metrics that serve as the compass for research decisions — ensuring our models optimize for what actually matters to users, not just benchmarks
  • Research Infrastructure Build the engine for high-velocity research. Maintain production-research parity, ensure reproducibility, and design systems for rapid experimentation — so that novel ideas go from hypothesis to validated result as fast as possible
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer (Research Scientist) - Data Foundation & AI

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , New York
Salary
Salary:
228960.00 - 315360.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong applied ML research skills with production delivery experience
  • Depth in Transformers/LLMs, representation learning, or large-scale model training
  • Demonstrated ability to ship models to production (not just prototype)
  • Distributed training experience and strong Python + software engineering fundamentals
  • Fintech / financial data domain experience is a plus
  • External publications or open-source contributions is a plus
Job Responsibility
Job Responsibility
  • Building a foundation model on one of the world’s richest financial datasets that no one else has
  • Doing research that ships: moving from experimentation and prototypes to production systems serving real customers
  • Working across the full ML stack, from pretraining objectives and architectures to serving infrastructure and monitoring
  • Collaborating with a high-caliber team and seeing your work amplify the capabilities of multiple product teams
  • Helping hundreds of millions of consumers achieve greater financial freedom through data-driven products
  • Fulltime
Read More
Arrow Right

Junior Research Infrastructure Engineer

We are seeking a Product-Minded Junior Research Infrastructure Engineer to join ...
Location
Location
United States , Sunnyvale
Salary
Salary:
Not provided
meshy.ai Logo
Meshy LLC
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2+ years of experience in software engineering, backend development, or distributed systems
  • Strong programming skills in Python (plus Scala/Java/C++ a plus)
  • Familiarity with distributed frameworks (Spark, Dask, Ray) and cloud platforms (AWS/GCP/Azure)
  • Experience with workflow orchestration tools (Temporal, Celery, or Airflow)
  • Proficiency with Infrastructure as Code (Terraform) and CI/CD tools (GitHub Actions)
  • Experience building web applications or internal tools using React or Next.js
  • A 'product-first' mindset: an interest in how users interact with infrastructure and a desire to build clean, functional interfaces
Job Responsibility
Job Responsibility
  • Participate in the design and implementation of distributed task orchestration systems using Temporal or Celery
  • Architect pipelines across cloud object storage (S3, GCS), data lakes, and metadata catalogs
  • Implement partitioning, sharding, and caching strategies to ensure data processing pipelines are resilient, highly available, and consistent
  • Design, implement, and maintain distributed ingestion pipelines for structured and unstructured data (images, 3D/2D assets, binaries)
  • Build scalable ETL/ELT workflows to transform, validate, and enrich datasets for AI/ML model training and analytics
  • Support preprocessing of unstructured assets (e.g., images, 3D/2D models, video) for training pipelines, including format conversion, normalization, augmentation, and metadata extraction
  • Implement validation and quality checks to ensure datasets meet ML training requirements
  • Collaborate with ML researchers to quickly adapt pipelines to evolving pretraining and evaluation needs
  • Use infrastructure-as-code (Terraform, Kubernetes, etc.) to manage scalable and reproducible environments
  • Manage data assets using Databricks Asset Bundles (DABs) and build rigorous CI/CD pipelines (GitHub Actions)
What we offer
What we offer
  • Competitive salary, equity, and benefits package
  • Opportunity to work with a talented and passionate team at the forefront of AI and 3D technology
  • Flexible work environment, with options for remote and on-site work
  • Opportunities for fast professional growth and development
  • An inclusive culture that values creativity, innovation, and collaboration
  • Unlimited, flexible time off
  • Stock options available for core team members
  • 401(k) plan for employees
  • Comprehensive health, dental, and vision insurance
  • The latest and best office equipment
  • Fulltime
Read More
Arrow Right

Principal/Senior Applied Scientist Security Models Training Team - Next-Gen frontier research

The Security Models Training team is expanding to drive the development of a new...
Location
Location
Israel , Tel Aviv, Herzliya
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • M.Sc. / Ph.D. in Computer Science, Information Systems, Electrical or Computer Engineering or Data Science (Ph.D. strongly preferred)
  • Candidates with M.Sc. / Ph.D. in related fields with proven industry experience or a strong publication record in the areas of LLM, Information Retrieval, Machine Learning, Natural Language Processing, Time Series Forecasting and Deep Learning are considered as well
  • Proven hands-on experience of at least 5 years (including post-grad work) in building and deploying Machine Learning products
  • Key areas of expertise include Natural Language Processing and Large Language Models, along with an understanding of concepts such as Privacy and Responsible AI
  • Candidates are expected to demonstrate a strong history of successfully translating applied research into production-ready solutions, along with a proven track record of delivering projects within large-scale production environments
  • Proven expertise in the LLM and/or time-series forecasting domain, demonstrating comprehensive knowledge of relevant concepts in the domain
  • Ideal applicants should be proficient in areas such as LLM’s pre and post training, including CPT, SFT and RL, LLM benchmarking, agentic flows, and model alignment
  • Hands-on experience in building neural model architectures at the 100M+ scale and the proficiency to adapt them at all abstraction levels down the individual block (e.g. changing the innerworkings of an attention block, introducing new blocks, or changing the routings)
  • Demonstrated proficiency in problem-solving and data analysis, with substantial expertise in evaluating the performance of large language models (LLMs) and/or time-series forecasting models, developing benchmarks tailored to practical scenarios
Job Responsibility
Job Responsibility
  • Technical Leadership & Ownership: set technical direction for major security domain initiatives
  • lead security model programs spanning pre‑training, task tuning, reinforcement learning, and evaluation
  • translate cutting‑edge research into production‑ready capabilities
  • Advanced Model Design – Building and customizing deep learning model architectures (e.g., modifying transformer blocks, attention/memory modules, etc.) at the SLM/LLM scale
  • making principled architectural tradeoffs to improve reliability, robustness, and security‑specific behavior
  • Advanced Model Training – Apply deep expertise in pre-training, post-training, and reinforcement learning (RL) for both language and other modalities, including time-series
  • Design & Evaluate Datasets – Build high-quality datasets and benchmarks
  • define objective evaluation frameworks and quality gates
  • run ablation studies to measure impact and optimize data and training effectiveness to support confident product decisions
  • Develop Data Infrastructure – Create and maintain scalable pipelines for ingestion, preprocessing, filtering, and annotation of large, complex datasets, with attention to privacy, governance, and long‑term reuse across security scenarios
  • Fulltime
Read More
Arrow Right
New

Research Scientist, Machine Learning

Meta is seeking a Research Scientist to join its AI research organization, focus...
Location
Location
France , Paris
Salary
Salary:
Not provided
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Currently has, or is in the process of obtaining, a PhD in Computer Science, Statistics, Mathematics, Electrical Engineering, or a related quantitative field
  • 2+ years of experience designing, training, and evaluating machine learning models using deep learning frameworks such as PyTorch or JAX
  • Experience with Reinforcement Learning and/or Large Language Models (LLMs), including areas such as RLHF, policy optimization, or LLM fine-tuning
  • Experience implementing and iterating on ML experiments end-to-end, from data pipeline construction through model evaluation and analysis
  • Must obtain work authorization in the country of employment at the time of hire, and maintain ongoing work authorization during employment
Job Responsibility
Job Responsibility
  • Design and execute machine learning research experiments, including model architecture design, training recipe development, and rigorous evaluation across benchmarks and production metrics
  • Develop and optimize ML models and algorithms targeting core research problems such as representation learning, generative modeling, or large-scale optimization
  • Implement clean, reusable, and well-tested research code using modern ML frameworks such as PyTorch, contributing to shared research infrastructure
  • Analyze experimental results to form data-driven conclusions, identify failure modes, and iterate on hypotheses in collaboration with research and engineering partners
  • Contribute to the preparation and submission of research findings to peer-reviewed venues such as NeurIPS, ICML, ICLR, or CVPR
  • Partner with cross-functional teams including product engineering, data science, and infrastructure to translate research advances into production-ready solutions
  • Participate in code review processes to maintain research code quality, catching subtle issues including those arising from AI-assisted code generation
  • Instrument experiments with appropriate logging, monitoring, and evaluation pipelines to ensure reproducibility and reliability of research outcomes
  • Contribute to the research community through open-source releases, technical documentation, and knowledge sharing within the team
  • Fulltime
Read More
Arrow Right

Ai Research Scientist —Generative Ai For Materials Discovery

Location
Location
United States , Redmond
Salary
Salary:
154000.00 - 217000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Ph.D. degree in Machine Learning, Computational Chemistry, Materials Science, Chemical Engineering, Physics, or a closely related technical field
  • 3+ years of research experience in generative modeling applied to molecular systems, crystal structures, or materials science (academic or industry)
  • Familiarity with large-scale molecular and crystal databases and data processing pipelines for chemical data
  • Demonstrated expertise in deep generative models — including diffusion models, flow matching / continuous normalizing flows, variational autoencoders, or autoregressive models — with applications to 3D molecular or crystal structure generation
  • Programming proficiency in Python with hands-on experience in PyTorch or JAX
  • proficiency in building, training, and evaluating large-scale deep learning models
  • Track record of first-author publications in top-tier ML or computational chemistry venues (e.g., NeurIPS, ICML, ICLR, JACS, Nature Computational Science, Digital Discovery)
  • Solid understanding of crystallography fundamentals— and molecular representations (molecular graphs, SMILES, 3D conformers)
Job Responsibility
Job Responsibility
  • Develop, train, and deploy generative models (diffusion models, flow matching, variational autoencoders, transformer-based architectures) for molecular and crystal structure generation, property-conditioned design, and crystal structure prediction (CSP)
  • Design and implement reinforcement learning and alignment strategies (e.g., physics-informed reward signals from machine-learned interatomic potentials) to steer generative models toward physically stable and synthesizable candidates
  • Build foundational models and scalable pretraining pipelines that unify generative and predictive learning across molecules and crystalline materials, handling both discrete atom types and continuous 3D geometries
  • Collaborate closely with computational chemists to integrate first-principles calculations (DFT, force fields), molecular dynamics simulations, and domain-specific constraints into generative workflows
  • Partner with AI agent scientists to embed generative molecular design capabilities into LLM-based multi-agent systems, enabling closed-loop autonomous experiment planning, candidate generation, and decision making
  • Curate, preprocess, and manage large-scale molecular and crystal structure datasets for model training and benchmarking
  • Establish rigorous evaluation frameworks — measuring validity, novelty, uniqueness, stability, and synthesizability of generated structures — and benchmark against state-of-the-art methods
  • Contribute to the architecture and roadmap of the autonomous materials-discovery platform, ensuring generative design modules interface seamlessly with robotic workcells, characterization instruments, and data infrastructure
What we offer
What we offer
  • bonus + equity + benefits
  • Fulltime
Read More
Arrow Right

Ai Research Scientist, Computer Vision

Meta is seeking Research Scientists to join Meta Superintelligence Labs (MSL) or...
Location
Location
United States , Menlo Park
Salary
Salary:
154000.00 - 217000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Currently has or is in the process of obtaining a PhD in the field of Computer Vision, Language, Machine Learning, a related field, or equivalent practical experience. Degree must be completed prior to joining Meta
  • Research Experience in deep learning, computer vision, robotics, or AI infrastructure
  • Experience with Python and PyTorch
  • Experience developing software and executing complex experiments
  • Experience communicating research for public audiences of peers
  • Must obtain work authorization in the country of employment at the time of hire, and maintain ongoing work authorization during employment
Job Responsibility
Job Responsibility
  • Build models that can estimate underlying 3D properties from 2D observations
  • Conduct research and work with models in 3D reconstruction and generation
  • Define, build and benchmark new capabilities needed for the next generation of AI
  • Train and optimize state-of-the-art machine learning and neural network methodologies
  • Work with and create large datasets, and contribute to data annotation engines
What we offer
What we offer
  • bonus
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right