CrawlJobs Logo

Research Scientist / Engineer – Training Infrastructure

Luma AI

Location Icon

Location:
United States , Palo Alto

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

187500.00 - 395000.00 USD / Year

Job Description:

Luma’s mission is to build multimodal AI to expand human imagination and capabilities. We believe that multimodality is critical for intelligence. To go beyond language models and build more aware, capable and useful systems, the next step function change will come from vision. So, we are working on training and scaling up multimodal foundation models for systems that can see and understand, show and explain, and eventually interact with our world to effect change. We are looking for engineers with significant experience solving hard problems in PyTorch, CUDA and distributed systems. You will work alongside the rest of the research team to build & train cutting edge foundation models on thousands of GPUs that are built to scale from the ground up. The Training Infrastructure team at Luma is responsible for building and maintaining the distributed systems that enable training of our large-scale multimodal models across thousands of GPUs. This team ensures our researchers can focus on innovation while having access to reliable, efficient, and scalable training infrastructure that pushes the boundaries of what's possible in AI model development.

Job Responsibility:

  • Design, implement, and optimize efficient distributed training systems for models with thousands of GPUs
  • Research and implement advanced parallelization techniques (FSDP, Tensor Parallel, Pipeline Parallel, Expert Parallel)
  • Build monitoring, visualization, and debugging tools for large-scale training runs
  • Optimize training stability, convergence, and resource utilization across massive clusters

Requirements:

  • Extensive experience with distributed PyTorch training and parallelisms in foundation model training
  • Deep understanding of GPU clusters, networking, and storage systems
  • Familiarity with communication libraries (NCCL, MPI) and distributed system optimization

Nice to have:

  • Strong Linux systems administration and scripting capabilities
  • Experience managing training runs across >100 GPUs
  • Experience with containerization, orchestration, and cloud infrastructure

Additional Information:

Job Posted:
January 13, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Research Scientist / Engineer – Training Infrastructure

Machine Learning Engineer II - Training

As a Machine Learning Engineer II on our Training team, you will develop algorit...
Location
Location
United States , Boston
Salary
Salary:
125000.00 - 170000.00 USD / Year
whoop.com Logo
Whoop
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Mathematics, Statistics, Computer Science, or a related field
  • 2+ years of ML engineering, applied research, or a similar role
  • 2+ years experience applying advanced mathematical and statistical techniques
  • Experience deploying and maintaining production ML systems on cloud platforms (e.g., Kubernetes, AWS, GCP)
  • Familiarity with MLOps best practices and the ability to collaborate effectively with infrastructure teams on Docker, CI/CD workflows, model versioning, and observability tools
  • Proficiency in scientific Python and SQL
  • Excellent verbal and written communication skills
Job Responsibility
Job Responsibility
  • Design, train, and optimize machine learning algorithms for movement, exercise and training applications across diverse backend platforms
  • Collaborate closely with data scientists, ML Ops and software engineering teams to ensure reliable deployment, observability, and robust integration with the WHOOP ecosystem
  • Contribute to technical roadmap development and architectural decision-making for projects that you are involved in
  • Work closely with a team of data scientists in developing algorithms that power member-facing features
  • Work with Data Engineers to improve data pipelining, tooling for machine learning, and systems for quality and validation
  • Periodically serve as the on-call data scientist to respond in real time to incidents affecting production services
What we offer
What we offer
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

Senior Systems Engineer HPC

Location
Location
India , Gurgaon
Salary
Salary:
Not provided
rackspace.com Logo
Rackspace
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related field (equivalent experience may substitute for degree)
  • Minimum of 10 years of systems experience, including at least 5 years working specifically with HPC
  • Strong knowledge of Linux operating systems (e.g., Rocky Linux, Ubuntu) with a fundamental understanding of Linux internals, system administration, and performance tuning
  • Experience building and managing RPM and DEB packages
  • Experience with cluster management tools such as Bright Cluster Manager, OpenHPC stack, or Warewulf
  • Proficiency with job schedulers and resource managers such as Slurm and LSF
  • Strong understanding of Linux networking (e.g., TCP/IP, DNS, routing) and HPC interconnects (e.g., InfiniBand, Ethernet) including performance tuning
  • Knowledge of parallel file systems such as Lustre, Ceph, or GPFS
  • Working knowledge of Linux authentication and directory services such as LDAP and Active Directory
  • Strong experience with DevOps and configuration management tools, including Ansible, Terraform, Jenkins, and Git
Job Responsibility
Job Responsibility
  • System Administration & Maintenance: Install, configure, and maintain HPC clusters (hardware, software, operating systems), perform regular updates/patching, manage user accounts and permissions, and troubleshoot/resolve hardware or software issues
  • Performance & Optimization: Monitor and analyse system and application performance, identify bottlenecks, implement tuning solutions, and profile workloads to improve efficiency
  • Cluster & Resource Management: Manage and optimize job scheduling, resource allocation, and cluster operations using tools such as Slurm, LSF, Bright Cluster Manager / Base Command Manager, OpenHPC, and Warewulf
  • Networking & Interconnects: Configure, manage, and tune Linux networking (TCP/IP, DNS, routing) and high-speed HPC interconnects (InfiniBand, Ethernet) to ensure low-latency, high-bandwidth communication
  • Storage & Data Management: Implement and maintain large-scale storage and parallel file systems (Lustre, Ceph, GPFS), ensure data integrity, manage backups, and support disaster recovery
  • Security & Authentication: Implement security controls, ensure compliance with policies, and manage authentication and directory services such as LDAP and Active Directory
  • DevOps & Automation: Use configuration management and DevOps practices (Ansible, Terraform, Jenkins, Git) to automate deployments, application packaging (RPM/DEB), and system configurations
  • User Support & Collaboration: Provide technical support, documentation, and training to researchers
  • collaborate with scientists, HPC architects, and engineers to align infrastructure with research needs
  • Planning & Innovation: Contribute to the design and planning of HPC infrastructure upgrades, evaluate and recommend hardware/software solutions, and explore cloud-based HPC solutions where applicable
  • Fulltime
Read More
Arrow Right
New

Technical Program Manager, Research

We’re looking for a Technical Program Manager to partner closely with researcher...
Location
Location
United States , Palo Alto
Salary
Salary:
Not provided
Luma AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in Technical Program Management, Engineering Program Management, or similar role
  • Strong technical background with the ability to engage deeply with: Machine learning concepts (especially deep learning), Large-scale training and experimentation workflows, Distributed systems or ML infrastructure
  • Experience working directly with researchers or research-adjacent teams
  • Proven ability to manage ambiguous, fast-evolving technical programs
  • Excellent communication skills — able to align highly technical stakeholders
Job Responsibility
Job Responsibility
  • Partner with research scientists, ML engineers, and infrastructure teams to plan and deliver programs for generative video model development
  • Translate research goals into clear technical milestones, timelines, and dependencies
  • Drive execution across the full lifecycle: experimentation → training → evaluation → scaling → deployment
  • Coordinate cross-functional efforts spanning: Model training and evaluation, Data pipelines and curation, Compute planning (GPU/TPU usage, scheduling, cost awareness), Inference optimization and deployment
  • Create lightweight but effective program artifacts (roadmaps, risk registers, decision logs)
  • Identify risks early (technical, resourcing, compute, data) and proactively drive mitigations
  • Improve operational rigor without slowing down research velocity
  • Act as a connective tissue between research, product, and platform teams
  • Help define and evolve best practices for running large-scale AI research programs
What we offer
What we offer
  • Competitive compensation, meaningful equity, and strong benefits
  • Fulltime
Read More
Arrow Right

Machine Learning Platform / Backend Engineer

We are seeking a Machine Learning Platform/Backend Engineer to design, build, an...
Location
Location
Serbia; Romania , Belgrade; Timișoara
Salary
Salary:
Not provided
everseen.ai Logo
Everseen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4-5+ years of work experience in either ML infrastructure, MLOps, or Platform Engineering
  • Bachelors degree or equivalent focusing on the computer science field is preferred
  • Excellent communication and collaboration skills
  • Expert knowledge of Python
  • Experience with CI/CD tools (e.g., GitLab, Jenkins)
  • Hands-on experience with Kubernetes, Docker, and cloud services
  • Understanding of ML training pipelines, data lifecycle, and model serving concepts
  • Familiarity with workflow orchestration tools (e.g., Airflow, Kubeflow, Ray, Vertex AI, Azure ML)
  • A demonstrated understanding of the ML lifecycle, model versioning, and monitoring
  • Experience with ML frameworks (e.g., TensorFlow, PyTorch)
Job Responsibility
Job Responsibility
  • Design, build, and maintain scalable infrastructure that empowers data scientists and machine learning engineers
  • Own the design and implementation of the internal ML platform, enabling end-to-end workflow orchestration, resource management, and automation using cloud-native technologies (GCP/Azure)
  • Design and manage Kubernetes-based infrastructure for multi-tenant GPU and CPU workloads with strong isolation, quota control, and monitoring
  • Integrate and extend orchestration tools (Airflow, Kubeflow, Ray, Vertex AI, Azure ML or custom schedulers) to automate data processing, training, and deployment pipelines
  • Develop shared services for model behavior/performance tracking, data/datasets versioning, and artifact management (MLflow, DVC, or custom registries)
  • Build out documentation in relation to architecture, policies and operations runbooks
  • Share skills, knowledge, and expertise with members of the data engineering team
  • Foster a culture of collaboration and continuous learning by organizing training sessions, workshops, and knowledge-sharing sessions
  • Collaborate and drive progress with cross-functional teams to design and develop new features and functionalities
  • Ensure that the developed solutions meet project objectives and enhance user experience
  • Fulltime
Read More
Arrow Right

Product Marketing Manager for AI Cloud Providers and Foundation Model Builders

Product Marketing Manager for AI Cloud Providers and Foundation Model Builders –...
Location
Location
United States
Salary
Salary:
Not provided
vastdata.com Logo
VAST Data
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of proven experience in product or solution marketing within cloud services, AI/ML platforms, large-scale data infrastructure, data storage, operating systems, or related fields, with a strong understanding of 'as-a-Service' GTM models and/or the needs of large-scale AI research and development
  • Deep understanding of infrastructure for cloud service providers and/or large-scale AI training/inference clusters, including multi-tenant architectures, service orchestration, distributed computing, virtualization, containers, and unified file/object storage solutions at petabyte to exabyte scale
  • Familiarity with modern AI/ML data pipelines, the lifecycle of Foundation Models (data ingestion, pre-processing, training, fine-tuning, inference), and analytics workloads, and how they are delivered as cloud services or built in dedicated environments
  • Strong expertise in aligning complex technical solutions like the VAST AI OS with business-driven objectives for both AI Cloud Providers (e.g., service differentiation, new revenue streams, TCO reduction) and Foundation Model Builders (e.g., faster time-to-model, research breakthroughs, efficient scaling, optimized resource utilization)
  • Exceptional communication skills, with proven ability to articulate technical concepts clearly to both technical and business audiences
  • Demonstrated success developing and executing impactful marketing strategies and campaigns
  • Highly collaborative with a proactive approach to managing cross-functional projects
  • Willingness to travel for customer engagements, industry events, and internal meetings
Job Responsibility
Job Responsibility
  • Develop and execute strategic go-to-market plans tailored to AI Cloud Providers and Foundation Model Builders for the VAST AI OS
  • Craft compelling, differentiated messaging and positioning for the VAST AI OS that resonates with stakeholders at AI Cloud Providers (product management, service architects, business development) and at organizations building Foundation Models (AI researchers, MLOps engineers, data scientists, infrastructure leads)
  • Conduct market analysis, identifying trends, threats, and opportunities in the AI cloud services, large-scale AI model development, and underlying data infrastructure landscape, relevant to the VAST AI OS
  • Translate complex technical features of the VAST AI OS into clear benefits for AI Cloud Providers (service differentiation, revenue opportunities, TCO) and for Foundation Model Builders (accelerated training, reduced data management overhead, faster iteration cycles, scalable deployment)
  • Serve as an expert resource on architectures for AI cloud services and large-scale model development, including multi-tenancy, service orchestration, distributed training, high-performance data pipelines, and how the VAST AI OS underpins these for AI-as-a-Service and Foundation Model lifecycles, emphasizing unified file and object storage, data protection, compliance, and security
  • Collaborate closely with Product Management and Engineering to influence the VAST AI OS roadmap and direction based on the unique requirements of AI Cloud Providers, Foundation Model Builders, and market insights for AI services and model development
  • Create high-impact sales tools, presentations, reference architectures, product demonstrations, webinars, and training materials for the VAST AI OS that effectively communicate technical and business advantages to and through AI Cloud Providers, and directly to organizations building Foundation Models
  • Support partner development, sales teams, and direct engagement efforts with strategic responses to AI Cloud Provider opportunities, Foundation Model initiatives leveraging the VAST AI OS, and joint RFI/RFPs
  • Engage regularly with AI Cloud Providers and key players in the Foundation Model ecosystem to capture insights, validate VAST AI OS positioning, and foster advocacy and joint marketing opportunities
  • Produce influential content for the VAST AI OS including whitepapers, case studies, solution briefs, blogs, and FAQs tailored to AI Cloud Provider and Foundation Model Builder audiences and their respective customers or users
  • Fulltime
Read More
Arrow Right

ML Platform Engineer

We are seeking a Machine Learning Engineer to help build and scale our machine l...
Location
Location
United States
Salary
Salary:
Not provided
duettocloud.com Logo
Duetto
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of experience in ML engineering or a similar role building and deploying machine learning models in production
  • Strong experience with AWS ML services (SageMaker, Lambda, EMR, ECR) for training, serving, and orchestrating model workflows
  • Hands-on experience with Kubernetes (e.g., EKS) for container orchestration and job execution at scale
  • Strong proficiency in Python, with exposure to ML/DL libraries such as TensorFlow, PyTorch, scikit-learn
  • Experience working with feature stores, data pipelines, and model versioning tools (e.g., SageMaker Feature Store, Feast, MLflow)
  • Familiarity with infrastructure-as-code and deployment tools such as Terraform, GitHub Actions, or similar CI/CD systems
  • Experience with logging and monitoring stacks such as Prometheus, Grafana, CloudWatch, or similar
  • Experience working in cross-functional teams with data scientists and DevOps engineers to bring models from research to production
  • Strong communication skills and ability to operate effectively in a fast-paced, ambiguous environment with shifting priorities
Job Responsibility
Job Responsibility
  • Develop, maintain, and scale machine learning pipelines for training, validation, and batch or real-time inference across thousands of hotel-specific models
  • Build reusable components to support model training, evaluation, deployment, and monitoring within a Kubernetes- and AWS-based environment
  • Partner with data scientists to translate notebooks and prototypes into production-grade, versioned training workflows
  • Implement and maintain feature engineering workflows, integrating with custom feature pipelines and supporting services
  • Collaborate with platform and DevOps teams to manage infrastructure-as-code (Terraform), automate deployment (CI/CD), and ensure reliability and security
  • Integrate model monitoring for performance metrics, drift detection, and alerting (using tools like Prometheus, CloudWatch, or Grafana)
  • Improve retraining, rollback, and model versioning strategies across different deployment contexts
  • Support experimentation infrastructure and A/B testing integrations for ML-based products
Read More
Arrow Right

Machine Learning Engineering Team Lead

Lead a high-performing team focused on building large-scale distributed training...
Location
Location
Germany , Berlin
Salary
Salary:
Not provided
aignostics.com Logo
Aignostics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science, Engineering, Mathematics, or a related field
  • 6+ years of software engineering or ML engineering experience, with at least 2 years in a technical leadership or team lead role
  • Proven track record of building and leading high-performing engineering teams
  • Experience guiding projects across the whole Software Development Life Cycle
  • Deep understanding of fundamental Machine Learning concepts and principles, familiarity with advanced model optimization techniques
  • Significant experience with large-scale distributed training systems and frameworks (especially PyTorch and NCCL)
  • Familiarity with GPUs, distributed systems, parallel computing and scaling laws
  • Advanced programming skills in Python, experience in performance-critical languages (C/C++ or CUDA) being a plus
  • Familiarity of MLOps/DevOps best practices including CI/CD, Docker, Kubernetes, and observability, cloud platforms (GCP, AWS or Azure) and infrastructure-as-code
  • Experience with Linux, version control, and container technologies
Job Responsibility
Job Responsibility
  • Build and scale a high-performing team capable of tackling complex distributed ML challenges
  • Own the full employee lifecycle: recruiting, onboarding, performance management, career development, and retention
  • Empower your team members and help them grow in autonomy and technical expertise
  • Mentor engineers at all levels, fostering a culture of continuous learning and psychological safety
  • Create an inclusive environment where diverse perspectives drive innovation
  • Define and execute technical roadmaps aligned with company objectives and product needs
  • Lead resource allocation and capacity planning to balance team workload and business priorities
  • Own FinOps responsibilities: optimize cloud costs, track spending, and ensure efficient resource utilization
  • Ensure operational readiness through monitoring, incident response protocols, and system reliability practices
  • Establish and track KPIs for team performance, system efficiency and health
What we offer
What we offer
  • Learning & Development yearly budget of 1,000€ (plus 2 L&D days)
  • Language classes, and internal development programs
  • Access to leadership development programs and executive coaching
  • Flexible working hours and teleworking policy
  • 30 paid vacation days per year
  • Family & pet friendly and support flexible parental leave options
  • Subsidized membership of your choice among public transport, sports, and well-being
  • Social gatherings, lunches, and off-site events for a fun and inclusive work environment
  • Optional company pension scheme
Read More
Arrow Right

Senior Machine Learning Engineer, Personalization and Recommendations

As a Senior Machine Learning Engineer on the Personalization & Recommendations t...
Location
Location
United States , San Francisco
Salary
Salary:
183360.00 - 248000.00 USD / Year
edtechjobs.io Logo
EdTech Jobs
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in applied machine learning or ML-heavy software engineering, with a strong focus on personalization, ranking, or recommendation systems
  • Demonstrated impact improving key metrics such as CTR, retention, or engagement through recommender or search systems in production
  • Strong hands-on skills in Python and PyTorch, with expertise in data and feature engineering, distributed training and inference on GPUs, and familiarity with modern MLOps practices — including model registries, feature stores, monitoring, and drift detection
  • Deep understanding of retrieval and ranking architectures, such as Two-Tower models, deep cross networks, Transformers, or MMoE, and the ability to apply them to real-world problems
  • Experience with large-scale embedding models and vector search, including FAISS, ScaNN, or similar systems
  • Proficiency in experiment design and evaluation, connecting offline metrics (AUC, NDCG, calibration) with online A/B test outcomes to drive product decisions
  • Clear, effective communication, collaborating well with product managers, data scientists, engineers, and cross-functional partners
  • A growth and mentorship mindset, helping elevate team quality in modeling, experimentation, and reliability
  • Commitment to responsible and inclusive personalization, ensuring our systems respect learner privacy, fairness, and diverse goals
Job Responsibility
Job Responsibility
  • Design and implement personalization models across candidate retrieval, ranking, and post-ranking layers, leveraging user embeddings, contextual signals and content features
  • Develop scalable retrieval and serving systems using architectures such as Two-Tower models, deep ranking networks, and ANN-based vector search for real-time personalization
  • Build and maintain model training, evaluation, and deployment pipelines, ensuring reliability, training–serving consistency, observability, and robust monitoring
  • Partner with Product and Data Science to translate learner objectives (engagement, retention, mastery) into measurable modeling goals and experiment designs
  • Advance evaluation methodologies, contributing to offline metric design (e.g., NDCG, CTR, calibration) and supporting rigorous A/B testing to measure learner and business impact
  • Collaborate with platform and infrastructure teams to optimize distributed training, inference latency, and serving cost in production environments
  • Stay informed on industry and research trends, evaluating opportunities to meaningfully apply them within Quizlet’s ecosystem
  • Mentor junior and mid-level engineers, supporting technical growth, experimentation rigor, and responsible ML practices
  • Champion collaboration, inclusion, curiosity, and data-driven problem solving, contributing to a healthy and productive team culture
What we offer
What we offer
  • 20 vacation days
  • Competitive health, dental, and vision insurance (100% employee and 75% dependent PPO, Dental, VSP Choice)
  • Employer-sponsored 401k plan with company match
  • Access to LinkedIn Learning and other resources to support professional growth
  • Paid Family Leave, FSA, HSA, Commuter benefits, and Wellness benefits
  • 40 hours of annual paid time off to participate in volunteer programs of choice
  • Fulltime
Read More
Arrow Right