CrawlJobs Logo

Research Engineer - Distributed Training

Prime Intellect

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

Building Open Superintelligence Infrastructure. Prime Intellect is building the open superintelligence stack - from frontier agentic models to the infra that enables anyone to create, train, and deploy them. We aggregate and orchestrate global compute into a single control plane and pair it with the full rl post-training stack: environments, secure sandboxes, verifiable evals, and our async RL trainer. We enable researchers, startups and enterprises to run end-to-end reinforcement learning at frontier scale, adapting models to real tools, workflows, and deployment contexts. As a Research Engineer working on Distributed Training, you'll play a crucial role in shaping our technological direction, focusing on our decentralizing AI training stack. If you love scaling things and maximizing training efficiency, this role is for you.

Job Responsibility:

  • Lead and participate in novel research to build a massive scale, highly reliable and secure decentralized training orchestration solution
  • Optimize the performance, cost, and resource utilization of AI workloads by leveraging the most recent advances for compute & memory optimization techniques
  • Contribute to the development of our open-source libraries and frameworks for distributed model training
  • Publish research in top-tier AI conferences such as ICML & NeurIPS
  • Distill highly technical project outcomes in layman approachable technical blogs to our customers and developers
  • Stay up-to-date with the latest advancements in AI/ML infrastructure and tools, decentralized training research and proactively identify opportunities to enhance our platform's capabilities and user experience

Requirements:

  • Strong background in AI/ML engineering, with extensive experience in designing and implementing end-to-end pipelines for training and deploying large-scale AI models
  • Deep expertise in distributed training techniques, frameworks (e.g., PyTorch Distributed, DeepSpeed, MosaicML’s LLM Foundry), and tools (e.g. Ray) for optimizing the performance and scalability of AI workloads
  • Experience in large-scale model training incl. distributed training techniques such as data, tensor & pipeline parallelism
  • Solid understanding of MLOps best practices, including model versioning, experiment tracking, and continuous integration/deployment (CI/CD) pipelines
  • Passion for advancing the state-of-the-art in decentralized AI model training and democratizing access to AI capabilities for researchers, developers, and businesses worldwide
What we offer:
  • Competitive compensation, including equity incentives, aligning your success with the growth and impact of Prime Intellect
  • Flexible work arrangements, with the option to work remotely or in-person at our offices in San Francisco
  • Visa sponsorship and relocation assistance for international candidates
  • Quarterly team off-sites, hackathons, conferences and learning opportunities
  • Opportunity to work with a talented, hard-working and mission-driven team, united by a shared passion for leveraging technology to accelerate science and AI

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Research Engineer - Distributed Training

AI Research Engineer, Scaling

As a Research Engineer focused on Scaling, you will design and build robust infr...
Location
Location
United States , Palo Alto
Salary
Salary:
180000.00 - 300000.00 USD / Year
1x.tech Logo
1X Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong programming experience in Python and/or C++
  • Deep intuitive understanding of training and inference speed bottlenecks and scaling laws
  • A mindset aligned with extremely high scaling: belief that scale is foundational to enabling humanoid robotics
  • Degree in Computer Science or a related field
  • Experience with distributed training frameworks (e.g., TorchTitan, DeepSpeed, FSDP/ZeRO), multi-node debugging, and experiment management
  • Proven skills in optimizing inference performance using graph compilers, batching/scheduling, and serving systems like TensorRT or equivalents
  • Familiarity with quantization strategies (PTQ, QAT, INT8/FP8) and tools such as TensorRT and bitsandbytes
  • Experience developing or tuning CUDA or Triton kernels with understanding of hardware-level optimization (vectorization, tensor cores, memory hierarchies)
Job Responsibility
Job Responsibility
  • Own and lead scaling of distributed training and inference systems
  • Ensure compute resources are optimized to make data the primary constraint
  • Enable massive training runs (1000+ GPUs) using robot data, with robust fault tolerance, experiment tracking, and distributed operations
  • Optimize inference throughput for datacenter use cases such as world models and diffusion engines
  • Reduce latency and enhance performance for on-device robot policies using techniques such as quantization, scheduling, and distillation
What we offer
What we offer
  • Equity
  • Health, dental, and vision insurance
  • 401(k) with company match
  • Paid time off and holidays
  • Fulltime
Read More
Arrow Right

Senior Research Engineer

We are seeking a highly skilled Senior Research Engineer to collaborate closely ...
Location
Location
United States
Salary
Salary:
210000.00 - 309000.00 USD / Year
assembly.ai Logo
Assembly
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong expertise in the Python ecosystem and major ML frameworks (PyTorch, JAX)
  • Experience with lower-level programming (C++ or Rust preferred)
  • Deep understanding of GPU acceleration (CUDA, profiling, kernel-level optimization)
  • TPU experience is a strong plus
  • Proven ability to accelerate deep learning workloads using compiler frameworks, graph optimizations, and parallelization strategies
  • Solid understanding of the deep learning lifecycle: model design, large-scale training, data processing pipelines, and inference deployment
  • Strong debugging, profiling, and optimization skills in large-scale distributed environments
  • Excellent communication and collaboration skills, with the ability to clearly prioritize and articulate impact-driven technical solutions
Job Responsibility
Job Responsibility
  • Investigate and mitigate performance bottlenecks in large-scale distributed training and inference systems
  • Develop and implement both low-level (operator/kernel) and high-level (system/architecture) optimization strategies
  • Translate research models and prototypes into highly optimized, production-ready inference systems
  • Explore and integrate inference compilers such as TensorRT, ONNX Runtime, AWS Neuron and Inferentia, or similar technologies
  • Design, test, and deploy scalable solutions for parallel and distributed workloads on heterogeneous hardware
  • Facilitate knowledge transfer and bidirectional support between Research and Engineering teams, ensuring alignment of priorities and solutions
What we offer
What we offer
  • competitive equity grants
  • 100% employer-paid benefits
  • flexibility of being fully remote
  • Fulltime
Read More
Arrow Right

Research Engineer, Scaling

As a Research Engineer, Scaling, you will design and build infrastructure to sup...
Location
Location
United States , Palo Alto
Salary
Salary:
180000.00 - 300000.00 USD / Year
1x.tech Logo
1X Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong programming experience in Python and/or C++
  • Deep intuitive understanding of what affects training or inference speed: from bottlenecks to scaling laws
  • A mindset aligned with extremely high scaling: belief that scale is foundational to enabling humanoid robotics
  • Degree in Computer Science or a related field
  • Hands‑on experience with distributed training frameworks (e.g., TorchTitan, DeepSpeed, FSDP/ZeRO), multi‑node debugging, experiment management
  • Proven skills optimizing inference performance: graph compilers, batching/scheduling, serving systems (e.g., using TensorRT or equivalents)
  • Familiarity with quantization strategies: PTQ, QAT, INT8/FP8
  • tools like TensorRT, bitsandbytes, etc.
  • Experience writing or tuning CUDA or Triton kernels
  • understanding of hardware features like vectorization, tensor cores, and memory hierarchies
Job Responsibility
Job Responsibility
  • Own and lead scaling of both distributed training and inference systems
  • Ensure compute resources are sufficient so that data, not hardware, is the limiter
  • Enable massive training at scale (1000+ GPUs) on robot data, handling fault tolerance, experiment tracking, distributed operations, and large datasets
  • Optimize inference throughput in datacenter contexts (e.g., for world models and diffusion engines)
  • Reduce latency and optimize performance for on‑device robot policies through techniques like quantization, scheduling, distillation, etc.
What we offer
What we offer
  • Health, dental, and vision insurance
  • 401(k) with company match
  • Paid time off and holidays
  • Fulltime
Read More
Arrow Right

Research Engineer AI

The role involves conducting high-quality research in AI and HPC, shaping future...
Location
Location
United Kingdom , Bristol
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A good working knowledge of AI/ML frameworks, at least TensorFlow and PyTorch, as well as the data preparation, handling, and lineage control, as well as model deployment, in particular in a distributed environment
  • At least a B.Sc. equivalent in a Science, Technology, Engineering or Mathematical discipline
  • Development experience in compiled languages such as C, C++ or Fortran and experience with interpreted environments such as Python
  • Parallel programming experience, with relevant programming models such as OpenMP, MPI, CUDA, OpenACC, HIP, PGAS languages is highly desirable
Job Responsibility
Job Responsibility
  • Perform world-class research while also shaping products of the future
  • Enable high performance AI software stacks on supercomputers
  • Provide new environments/abstractions to support application developers to build, deploy, and run AI applications taking advantage of leading-edge hardware at scale
  • Manage modern data-intensive AI training and inference workloads
  • Port and optimize workloads of key research centers like the AI safety institute
  • Support onboarding and scaling of domain-specific applications
  • Foster collaboration with the UK and European research community
What we offer
What we offer
  • Health & Wellbeing benefits that support physical, financial and emotional wellbeing
  • Career development programs catered to achieving career goals
  • Unconditional inclusion in the workplace
  • Flexibility to manage work and personal needs
  • Fulltime
Read More
Arrow Right

Senior Principal Machine Learning Engineer - LLM Post-Training and Optimization

Atlassian is seeking a highly skilled and experienced Senior Principle Machine L...
Location
Location
United States , Mountain View
Salary
Salary:
243100.00 - 407200.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Ph.D. or Master’s degree in Computer Science, Machine Learning, Artificial Intelligence, or a related field
  • 8+ years of experience in machine learning, with a focus on large-scale model development and optimization
  • Deep expertise in LLM and transformer architectures (e.g., GPT, BERT, T5)
  • Strong proficiency in Python and ML frameworks such as PyTorch, JAX, or TensorFlow
  • Experience with distributed training techniques and large-scale data processing pipelines
  • Proven track record of deploying machine learning models in production environments
  • Familiarity with model optimization techniques, including quantization, pruning, and knowledge distillation
  • Strong problem-solving skills and ability to work in a fast-paced, collaborative environment
  • Excellent communication skills and ability to translate technical concepts for diverse audiences
Job Responsibility
Job Responsibility
  • Lead the fine-tuning and post-training optimization of large language models (LLMs) for diverse applications
  • Develop and implement techniques for model compression, quantization, pruning, and knowledge distillation to optimize performance and reduce computational costs
  • Conduct research on advanced techniques in transfer learning, reinforcement learning, and prompt engineering for LLMs
  • Design and execute rigorous benchmarking and evaluation frameworks to assess model performance across multiple dimensions
  • Collaborate with infrastructure teams to optimize LLM deployment pipelines, ensuring scalability and efficiency in production environments
  • Stay at the forefront of advancements in LLM technologies, sharing insights, driving innovation within the team, and leading agile development
  • Mentoring other team members, facilitating within/across team workshops, fostering a culture of technical excellence and continuous learning
What we offer
What we offer
  • health coverage
  • paid volunteer days
  • wellness resources
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, AI Training Infrastructure

As a Training Infrastructure Engineer, you'll design, build, and optimize the in...
Location
Location
United States , San Mateo
Salary
Salary:
175000.00 - 220000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Computer Engineering, or related field, or equivalent practical experience
  • 3+ years of experience with distributed systems and ML infrastructure
  • Experience with PyTorch
  • Proficiency in cloud platforms (AWS, GCP, Azure)
  • Experience with containerization, orchestration (Kubernetes, Docker)
  • Knowledge of distributed training techniques (data parallelism, model parallelism, FSDP)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for large-scale model training workloads
  • Develop and maintain distributed training pipelines for LLMs and multimodal models
  • Optimize training performance across multiple GPUs, nodes, and data centers
  • Implement monitoring, logging, and debugging tools for training operations
  • Architect and maintain data storage solutions for large-scale training datasets
  • Automate infrastructure provisioning, scaling, and orchestration for model training
  • Collaborate with researchers to implement and optimize training methodologies
  • Analyze and improve efficiency, scalability, and cost-effectiveness of training systems
  • Troubleshoot complex performance issues in distributed training environments
What we offer
What we offer
  • meaningful equity in a fast-growing startup
  • comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Vice President of Product and Engineering

Our client is building the next generation network security platform and is look...
Location
Location
United States
Salary
Salary:
Not provided
80twenty.com Logo
80Twenty
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven Leadership: 10+ years in product and/or engineering leadership roles, with experience guiding both disciplines in high-growth environments
  • Product Expertise: Deep understanding of customer discovery, product-market fit, and translating vision into detailed product requirements
  • Technical Depth: Strong background in cybersecurity, networking, distributed systems, cryptography, high-speed packet processing, or protocol design
  • Startup DNA: Comfortable working in an early-stage environment—rolling up your sleeves, making fast decisions, and scaling teams from zero to hundreds
  • Builder’s Mindset: Passion for elegant design, user experience, and code quality
  • Communicator & Partner: Ability to convey complex concepts to customers, investors, and cross-functional stakeholders
Job Responsibility
Job Responsibility
  • Strategic Technical Leadership: Define and own the product operations strategy and execution across all components
  • Participate in product discovery, market research, and customer engagement to ensure the roadmap aligns with market needs and company objectives
  • Translate customer and market insights into clear product requirements, technical specifications, and user stories
  • Balance near-term deliverables with long-term strategic investments
  • Execution & Delivery: Oversee architecture and technical design to ensure secure, scalable, and simple solutions
  • Drive the full engineering lifecycle from prototype to GA release, ensuring world-class quality and performance
  • Establish DevSecOps practices and continuous delivery pipelines
  • Ensure all core technology meets regulatory, security, and performance requirements
  • Team Building & Culture: Recruit, mentor, and inspire exceptional product managers and engineers
  • Build a culture of collaboration between product and engineering teams that reflects their DNA of simplicity, security, and speed
What we offer
What we offer
  • Equity & Ownership: Early equity stake and a leadership seat shaping both product and engineering strategy for a company built for rapid scale and long-term impact
  • Ground Floor Opportunity: Shape a platform aiming to redefine Internet trust and disrupt a $60B+ network security market
  • Cutting-Edge Technology: Work on cryptographic identity, trust-based routing, and AI-powered security at Internet scale
  • Fulltime
Read More
Arrow Right

Machine Learning Systems Engineer

We’re looking for a Machine Learning Systems Engineer to strengthen the performa...
Location
Location
United States , Bala Cynwyd (Philadelphia Area), Pennsylvania
Salary
Salary:
Not provided
sig.com Logo
Susquehanna International Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience with large-scale ML training pipelines and distributed training frameworks
  • Strong software engineering skills in python
  • Passion for diving deep into systems implementations and understanding fundamentals to improve their performance and maintainability
  • Experience improving resource efficiency across distributed computing environments by leveraging profiling, benchmarking, and implementing system-level optimizations
Job Responsibility
Job Responsibility
  • Collaborate with researchers to enable them to develop systems-efficient models and architectures
  • Apply the latest techniques to our internal training runs to achieve impressive hardware efficiency for our training runs
  • Create tooling to help researchers distribute their training jobs more effectively
  • Profile and optimize our training runs
Read More
Arrow Right