Research Scientist / Engineer – Training Infrastructure Job at Luma AI (Palo Alto)

Research Engineering Manager, Post-Training

Meta is seeking a Research Engineering Manager to lead the Post-Training team wi...

Location

United States , Menlo Park

Salary:

219000.00 - 301000.00 USD / Year

Research Engineer / Research Scientist - Foundations Retrieval Lead

The Foundations Research team works on high-risk, high-reward ideas that could s...

Location

United States , San Francisco

Salary:

445000.00 - 555000.00 USD / Year

OpenAI

Expiration Date

Until further notice

Requirements

Proven experience leading high-performance teams of researchers or engineers in ML infrastructure or foundational research
Deep technical expertise in representation learning, embedding models, or vector retrieval systems
Familiarity with transformer-based LLMs and how embedding spaces can interact with language model objectives
Research experience in areas such as contrastive learning, supervised or unsupervised embedding learning, or metric learning
A track record of building or scaling large machine learning systems, particularly embedding pipelines in production or research contexts
A first-principles mindset for challenging assumptions about how retrieval and memory should work for large models

Job Responsibility

Lead research into embedding models and retrieval systems optimized for grounding, relevance, and adaptive reasoning
Manage a team of researchers and engineers building end-to-end infrastructure for training, evaluating, and integrating embeddings into frontier models
Drive innovation in dense, sparse, and hybrid representation techniques, metric learning, and learning-to-retrieve systems
Collaborate closely with Pretraining, Inference, and other Research teams to integrate retrieval throughout the model lifecycle
Contribute to OpenAI’s long-term vision of AI systems with memory and knowledge access capabilities rooted in learned representations

What we offer

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible

Fulltime

Member of technical staff - Research - Agent

About H: H exists to push the boundaries of superintelligence with agentic AI. B...

Location

France; United Kingdom , Paris; London

Salary:

Not provided

H Company

Expiration Date

Until further notice

Requirements

Senior Experience: Previous demonstrable role(s) as a Staff, Principal, or Senior Engineer (or equivalent Research Scientist) in a Frontier AI Lab with a proven track record of leading complex, end-to-end AI/ML projects from conception to production
Education / Publication: Preferably PhD (or equivalent research experience) in Machine Learning, Computer Science, or a related field, preferably with a strong publication record (e.g., NeurIPS, ICML, ICLR) in Computer Science
Core Expertise: Deep theoretical and practical expertise in Agentic AI and proven experience building, scaling, and shipping solutions involving foundation models (LLMs/VLMs)
Soft Skills: Collaborative: Enjoys collaboration and thrives in a teamwork-oriented, fast-paced research environment
High-Impact Communicator: Possesses impactful communication skills, with the ability to bridge the gap between research and engineering and articulate complex ideas clearly
Mission-Driven: Genuinely eager to explore and solve the new engineering and research challenges at the frontier of agentic AI

Job Responsibility

Research & Leadership: Design and develop new agents, proposing new research directions, e.g., combining state-of-the-art RL with foundation models (LLMs/VLMs)
Algorithm & Systems Design: Design, implement, and scale complex, high-performance systems for training large-scale agents. This includes both the foundational infrastructure and the novel algorithms, reward models, and sophisticated training environments
Research-to-Production: Collaborate closely with researchers and engineers to implement, test, and productionize new agent logics, learning algorithms, and system architectures
Evaluation & Reliability: Create, manage, and scale massive benchmarks and evaluation systems to rigorously track agent capabilities. You will own system reliability, scalability, and observability for our entire research infrastructure
Mentorship & Standards: Mentor and guide other engineers and researchers on the team, fostering technical excellence. You will establish and enforce engineering standards, tooling, and best practices for both code and research design
Innovation: Conduct thorough code and design reviews, champion technical innovation, and proactively address technical debt to accelerate the R&D lifecycle

What we offer

Join the exciting journey of shaping the future of AI, and be part of the early days of one of the hottest AI startups
Collaborate with a fun, dynamic, and multicultural team, working alongside world-class AI talent in a highly collaborative environment
Enjoy a competitive salary
Unlock opportunities for professional growth, continuous learning, and career development

Fulltime

Member of Technical Staff, Integration/RL Team (Research Engineer)

The integration team is responsible for developing and scaling machine learning ...

Location

Salary:

Not provided

Cohere

Expiration Date

Until further notice

Requirements

Extremely strong software engineering skills
Value test-driven development methods, clean code, and strive to reduce technical debts at all levels
Proficiency in Python and related ML frameworks such as JAX, Pytorch and/or XLA/MLIR
Experience using and debugging large-scale distributed training strategies (memory/speed profiling)
[Bonus] Experience with distributed training infrastructures (Kubernetes) and associated frameworks (Ray)
[Bonus] Hands-on experience with the post-training phase of model training, with a strong emphasis on scalability and performance
[Bonus] Experience in ML, LLM and RL academic research

Job Responsibility

Design and write high-performing and scalable software for training models
Develop new tools to support and accelerate research and LLM training
Coordinate with other engineering teams (Infrastructure, Efficiency, Serving) and the scientific teams (Agent, Multimodal, Multilingual, etc.) to create a strong and integrated post-training ecosystem
Craft and implement techniques to improve performance and speed up our training cycles, both on SFT, offline preference, and the RL regime
Research, implement, and experiment with ideas on our cluster and data infrastructure
Collaborate, Collaborate, and Collaborate with other scientists, engineers, and teams!

What we offer

An open and inclusive culture and work environment
Work closely with a team on the cutting edge of AI research
Weekly lunch stipend, in-office lunches & snacks
Full health and dental benefits, including a separate budget to take care of your mental health
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
6 weeks of vacation (30 working days!)

Fulltime

Senior Systems Engineer HPC

Location

India , Gurgaon

Salary:

Not provided

Rackspace

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Computer Science, Engineering, or a related field (equivalent experience may substitute for degree)
Minimum of 10 years of systems experience, including at least 5 years working specifically with HPC
Strong knowledge of Linux operating systems (e.g., Rocky Linux, Ubuntu) with a fundamental understanding of Linux internals, system administration, and performance tuning
Experience building and managing RPM and DEB packages
Experience with cluster management tools such as Bright Cluster Manager, OpenHPC stack, or Warewulf
Proficiency with job schedulers and resource managers such as Slurm and LSF
Strong understanding of Linux networking (e.g., TCP/IP, DNS, routing) and HPC interconnects (e.g., InfiniBand, Ethernet) including performance tuning
Knowledge of parallel file systems such as Lustre, Ceph, or GPFS
Working knowledge of Linux authentication and directory services such as LDAP and Active Directory
Strong experience with DevOps and configuration management tools, including Ansible, Terraform, Jenkins, and Git

Job Responsibility

System Administration & Maintenance: Install, configure, and maintain HPC clusters (hardware, software, operating systems), perform regular updates/patching, manage user accounts and permissions, and troubleshoot/resolve hardware or software issues
Performance & Optimization: Monitor and analyse system and application performance, identify bottlenecks, implement tuning solutions, and profile workloads to improve efficiency
Cluster & Resource Management: Manage and optimize job scheduling, resource allocation, and cluster operations using tools such as Slurm, LSF, Bright Cluster Manager / Base Command Manager, OpenHPC, and Warewulf
Networking & Interconnects: Configure, manage, and tune Linux networking (TCP/IP, DNS, routing) and high-speed HPC interconnects (InfiniBand, Ethernet) to ensure low-latency, high-bandwidth communication
Storage & Data Management: Implement and maintain large-scale storage and parallel file systems (Lustre, Ceph, GPFS), ensure data integrity, manage backups, and support disaster recovery
Security & Authentication: Implement security controls, ensure compliance with policies, and manage authentication and directory services such as LDAP and Active Directory
DevOps & Automation: Use configuration management and DevOps practices (Ansible, Terraform, Jenkins, Git) to automate deployments, application packaging (RPM/DEB), and system configurations
User Support & Collaboration: Provide technical support, documentation, and training to researchers
collaborate with scientists, HPC architects, and engineers to align infrastructure with research needs
Planning & Innovation: Contribute to the design and planning of HPC infrastructure upgrades, evaluate and recommend hardware/software solutions, and explore cloud-based HPC solutions where applicable

Fulltime

Research Engineer

We’re looking for Research Engineers who enjoy getting their hands dirty: buildi...

Location

Canada , Toronto

Salary:

Not provided

Cohere

Expiration Date

Until further notice

Requirements

Have a strong engineering background in machine learning, NLP, or related areas (through a Master’s degree, industry experience, or equivalent hands-on work)
Enjoy writing clean, reliable code and building systems that others can use and extend
Are comfortable experimenting, running ablations, analyzing results, and iterating quickly
Have experience with deep learning frameworks and model optimization techniques (PyTorch, distributed training, RLHF, finetuning, evaluation frameworks)
Like collaborating closely with researchers and translating ideas into practical implementations
Are excited to grow your research instincts while staying grounded in engineering excellence

Job Responsibility

Building experiments
Debugging models
Scaling training pipelines
Turning research ideas into working systems
Work closely with scientists and other engineers to implement new methods, run large-scale experiments, and help shape the infrastructure that supports our research programs

What we offer

An open and inclusive culture and work environment
Work closely with a team on the cutting edge of AI research
Weekly lunch stipend, in-office lunches & snacks
Full health and dental benefits, including a separate budget to take care of your mental health
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
6 weeks of vacation (30 working days!)

Fulltime

Technical Program Manager, Research

We’re looking for a Technical Program Manager to partner closely with researcher...

Location

United States , Palo Alto

Salary:

Not provided

Luma AI

Expiration Date

Until further notice

Requirements

5+ years of experience in Technical Program Management, Engineering Program Management, or similar role
Strong technical background with the ability to engage deeply with: Machine learning concepts (especially deep learning), Large-scale training and experimentation workflows, Distributed systems or ML infrastructure
Experience working directly with researchers or research-adjacent teams
Proven ability to manage ambiguous, fast-evolving technical programs
Excellent communication skills — able to align highly technical stakeholders

Job Responsibility

Partner with research scientists, ML engineers, and infrastructure teams to plan and deliver programs for generative video model development
Translate research goals into clear technical milestones, timelines, and dependencies
Drive execution across the full lifecycle: experimentation → training → evaluation → scaling → deployment
Coordinate cross-functional efforts spanning: Model training and evaluation, Data pipelines and curation, Compute planning (GPU/TPU usage, scheduling, cost awareness), Inference optimization and deployment
Create lightweight but effective program artifacts (roadmaps, risk registers, decision logs)
Identify risks early (technical, resourcing, compute, data) and proactively drive mitigations
Improve operational rigor without slowing down research velocity
Act as a connective tissue between research, product, and platform teams
Help define and evolve best practices for running large-scale AI research programs

What we offer

Competitive compensation, meaningful equity, and strong benefits

Fulltime

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...

Location

United States , Palo Alto

Salary:

90000.00 - 300000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
3+ years of hands-on experience with machine learning infrastructure and deployment at scale
2+ years of experience working with Large Language Models and transformer architectures
Proficient in Python
strong skills in Go, Rust, or Java preferred
Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)

Job Responsibility

Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
Design, implement, and maintain feature stores for ML model training and inference pipelines
Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

Research Scientist / Engineer – Training Infrastructure

Luma AI

Location:
United States , Palo Alto

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:
January 13, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Research Scientist / Engineer – Training Infrastructure

Research Engineering Manager, Post-Training

Research Engineer / Research Scientist - Foundations Retrieval Lead

Member of technical staff - Research - Agent