CrawlJobs Logo

Distributed Training Engineer

openai.com Logo

OpenAI

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

293000.00 - 490000.00 USD / Year

Job Description:

As a Distributed Systems/ML engineer, you will work on improving the training throughput for our internal training framework and enable researchers to experiment with new ideas. This requires good engineering (for example designing, implementing, and optimizing state-of-the-art AI models), writing bug-free machine learning code (surprisingly difficult!), and acquiring deep knowledge of the performance of supercomputers. We’re looking for people who love optimizing performance, understanding distributed systems, and who cannot stand having bugs in their code.

Job Responsibility:

  • Collaborate with researchers to enable them to develop systems-efficient video models and architectures
  • Apply the latest techniques to our internal training framework to achieve impressive hardware efficiency for our training runs
  • Profile and optimize our training framework

Requirements:

  • Experience working with multi-modal ML pipelines
  • Strong software engineering skills and proficiency in Python
  • Experience understanding and optimizing training kernels
  • Passionate about understanding stable training dynamics

Nice to have:

Love diving deep into systems implementations and understanding their fundamentals in order to improve their performance and maintainability

What we offer:
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Relocation support for eligible employees
  • Additional taxable fringe benefits, such as charitable donation matching and wellness stipends, may also be provided
  • Offers Equity

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Distributed Training Engineer

Release Train Engineer

Reinventing Geospatial (RGi) is a leading expert in geospatial solutions for Def...
Location
Location
United States , Chantilly; St. Louis; Gaithersburg; Denver
Salary
Salary:
Not provided
rgi-corp.com Logo
Reinventing Geospatial
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s in Computer Science, Software Engineering, or related field with 12–15 years of relevant experience, or Master’s with 10–13 years
  • SAFe RTE, SPC, or equivalent Agile certification
  • 10+ years in Agile program delivery, including 5+ years as a Release Train Engineer or equivalent
  • Proven experience facilitating Agile/SAFe ceremonies across time zones and distributed teams
  • Deep knowledge of SAFe, Lean-Agile, and DevSecOps practices
  • Proficient with Jira, Confluence, and related Agile tools
  • Strong leadership, facilitation, and communication skills across technical and business teams
  • Experience integrating security and DevOps pipelines in high-assurance environments
  • Skilled in risk identification and mitigation to ensure program success
  • Active Top Secret clearance with an ability to obtain SCI access and willingness to obtain CI Polygraph
Job Responsibility
Job Responsibility
  • Serve as a servant leader and coach for the Agile Release Train (ART), ensuring alignment with SAFe and Lean-Agile principles
  • Facilitate key ceremonies including PI Planning, ART Syncs, and Inspect & Adapt workshops across distributed teams
  • Lead people management activities — performance reviews, professional development, mentorship, and career growth — while maintaining engagement and morale
  • Partner with Product Owners, System Architects, and Scrum Masters to refine features, prioritize work, remove impediments, and optimize delivery flow
  • Track and report ART metrics and progress to Leidos leadership and NGA stakeholders
  • Champion continuous improvement by identifying process gaps, implementing Lean-Agile practices, and coaching teams on SAFe principles
  • Manage program risks, resolve conflicts, and foster a culture of trust, transparency, and collaboration
  • Mentor Scrum Masters and Agile team members on facilitation, risk management, and DevSecOps practices
  • Collaborate with NGA stakeholders to align delivery goals with mission priorities and compliance requirements
What we offer
What we offer
  • 100% paid employee healthcare & dental insurance
  • Paid parental leave
  • 401k with matching
  • Escalating vacation time
  • Referral bonuses
  • Tuition reimbursement
  • Professional development training
  • Free beverages and snacks
  • Weekly catered lunches and breakfast on Fridays
  • Fulltime
Read More
Arrow Right

Senior Principal Machine Learning Engineer - LLM Post-Training and Optimization

Atlassian is seeking a highly skilled and experienced Senior Principle Machine L...
Location
Location
United States , Mountain View
Salary
Salary:
243100.00 - 407200.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Ph.D. or Master’s degree in Computer Science, Machine Learning, Artificial Intelligence, or a related field
  • 8+ years of experience in machine learning, with a focus on large-scale model development and optimization
  • Deep expertise in LLM and transformer architectures (e.g., GPT, BERT, T5)
  • Strong proficiency in Python and ML frameworks such as PyTorch, JAX, or TensorFlow
  • Experience with distributed training techniques and large-scale data processing pipelines
  • Proven track record of deploying machine learning models in production environments
  • Familiarity with model optimization techniques, including quantization, pruning, and knowledge distillation
  • Strong problem-solving skills and ability to work in a fast-paced, collaborative environment
  • Excellent communication skills and ability to translate technical concepts for diverse audiences
Job Responsibility
Job Responsibility
  • Lead the fine-tuning and post-training optimization of large language models (LLMs) for diverse applications
  • Develop and implement techniques for model compression, quantization, pruning, and knowledge distillation to optimize performance and reduce computational costs
  • Conduct research on advanced techniques in transfer learning, reinforcement learning, and prompt engineering for LLMs
  • Design and execute rigorous benchmarking and evaluation frameworks to assess model performance across multiple dimensions
  • Collaborate with infrastructure teams to optimize LLM deployment pipelines, ensuring scalability and efficiency in production environments
  • Stay at the forefront of advancements in LLM technologies, sharing insights, driving innovation within the team, and leading agile development
  • Mentoring other team members, facilitating within/across team workshops, fostering a culture of technical excellence and continuous learning
What we offer
What we offer
  • health coverage
  • paid volunteer days
  • wellness resources
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, AI Training Infrastructure

As a Training Infrastructure Engineer, you'll design, build, and optimize the in...
Location
Location
United States , San Mateo
Salary
Salary:
175000.00 - 220000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Computer Engineering, or related field, or equivalent practical experience
  • 3+ years of experience with distributed systems and ML infrastructure
  • Experience with PyTorch
  • Proficiency in cloud platforms (AWS, GCP, Azure)
  • Experience with containerization, orchestration (Kubernetes, Docker)
  • Knowledge of distributed training techniques (data parallelism, model parallelism, FSDP)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for large-scale model training workloads
  • Develop and maintain distributed training pipelines for LLMs and multimodal models
  • Optimize training performance across multiple GPUs, nodes, and data centers
  • Implement monitoring, logging, and debugging tools for training operations
  • Architect and maintain data storage solutions for large-scale training datasets
  • Automate infrastructure provisioning, scaling, and orchestration for model training
  • Collaborate with researchers to implement and optimize training methodologies
  • Analyze and improve efficiency, scalability, and cost-effectiveness of training systems
  • Troubleshoot complex performance issues in distributed training environments
What we offer
What we offer
  • meaningful equity in a fast-growing startup
  • comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Vice President of Product and Engineering

Our client is building the next generation network security platform and is look...
Location
Location
United States
Salary
Salary:
Not provided
80twenty.com Logo
80Twenty
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven Leadership: 10+ years in product and/or engineering leadership roles, with experience guiding both disciplines in high-growth environments
  • Product Expertise: Deep understanding of customer discovery, product-market fit, and translating vision into detailed product requirements
  • Technical Depth: Strong background in cybersecurity, networking, distributed systems, cryptography, high-speed packet processing, or protocol design
  • Startup DNA: Comfortable working in an early-stage environment—rolling up your sleeves, making fast decisions, and scaling teams from zero to hundreds
  • Builder’s Mindset: Passion for elegant design, user experience, and code quality
  • Communicator & Partner: Ability to convey complex concepts to customers, investors, and cross-functional stakeholders
Job Responsibility
Job Responsibility
  • Strategic Technical Leadership: Define and own the product operations strategy and execution across all components
  • Participate in product discovery, market research, and customer engagement to ensure the roadmap aligns with market needs and company objectives
  • Translate customer and market insights into clear product requirements, technical specifications, and user stories
  • Balance near-term deliverables with long-term strategic investments
  • Execution & Delivery: Oversee architecture and technical design to ensure secure, scalable, and simple solutions
  • Drive the full engineering lifecycle from prototype to GA release, ensuring world-class quality and performance
  • Establish DevSecOps practices and continuous delivery pipelines
  • Ensure all core technology meets regulatory, security, and performance requirements
  • Team Building & Culture: Recruit, mentor, and inspire exceptional product managers and engineers
  • Build a culture of collaboration between product and engineering teams that reflects their DNA of simplicity, security, and speed
What we offer
What we offer
  • Equity & Ownership: Early equity stake and a leadership seat shaping both product and engineering strategy for a company built for rapid scale and long-term impact
  • Ground Floor Opportunity: Shape a platform aiming to redefine Internet trust and disrupt a $60B+ network security market
  • Cutting-Edge Technology: Work on cryptographic identity, trust-based routing, and AI-powered security at Internet scale
  • Fulltime
Read More
Arrow Right

Senior Authorised Engineer-33kV

Great opportunity to join one of the biggest independent network owners in the U...
Location
Location
United Kingdom
Salary
Salary:
60000.00 - 65000.00 GBP / Year
hederahiring.com Logo
Hedera Hiring Ltd
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • HNC, Degree, or equivalent electrical qualifications
  • Thorough knowledge of electrical distribution networks and safety protocols
  • Experience working with IDNO and DNO networks
  • Senior Authorised Engineer status up to 33kV (experience up to 132kV is a plus)
  • Ability to manage operational activities and lead technical projects
  • DBS clearance is required due to site and customer access
Job Responsibility
Job Responsibility
  • Operate as a Senior Authorised Engineer up to 33kV, ensuring safety and compliance with distribution network standards
  • Manage and coordinate operational projects related to network maintenance and upgrades
  • Provide technical support and advice to colleagues and customers on electrical network issues
  • Act as a training and development mentor for team members, supporting their skills growth
  • Assist with the rollout of network monitoring systems to improve network visibility and fault detection
  • Maintain a safe, efficient, and customer-focused work environment
What we offer
What we offer
  • Car Allowance: £5200
  • Fulltime
Read More
Arrow Right

AI Research Engineer, Scaling

As a Research Engineer focused on Scaling, you will design and build robust infr...
Location
Location
United States , Palo Alto
Salary
Salary:
180000.00 - 300000.00 USD / Year
1x.tech Logo
1X Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong programming experience in Python and/or C++
  • Deep intuitive understanding of training and inference speed bottlenecks and scaling laws
  • A mindset aligned with extremely high scaling: belief that scale is foundational to enabling humanoid robotics
  • Degree in Computer Science or a related field
  • Experience with distributed training frameworks (e.g., TorchTitan, DeepSpeed, FSDP/ZeRO), multi-node debugging, and experiment management
  • Proven skills in optimizing inference performance using graph compilers, batching/scheduling, and serving systems like TensorRT or equivalents
  • Familiarity with quantization strategies (PTQ, QAT, INT8/FP8) and tools such as TensorRT and bitsandbytes
  • Experience developing or tuning CUDA or Triton kernels with understanding of hardware-level optimization (vectorization, tensor cores, memory hierarchies)
Job Responsibility
Job Responsibility
  • Own and lead scaling of distributed training and inference systems
  • Ensure compute resources are optimized to make data the primary constraint
  • Enable massive training runs (1000+ GPUs) using robot data, with robust fault tolerance, experiment tracking, and distributed operations
  • Optimize inference throughput for datacenter use cases such as world models and diffusion engines
  • Reduce latency and enhance performance for on-device robot policies using techniques such as quantization, scheduling, and distillation
What we offer
What we offer
  • Equity
  • Health, dental, and vision insurance
  • 401(k) with company match
  • Paid time off and holidays
  • Fulltime
Read More
Arrow Right
New

Member of Technical Staff - Distributed Training Engineer

Our Training Infrastructure team is building the distributed systems that power ...
Location
Location
United States , San Francisco; Boston
Salary
Salary:
Not provided
liquid.ai Logo
Liquid AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Hands-on experience building distributed training infrastructure (PyTorch Distributed DDP/FSDP, DeepSpeed ZeRO, Megatron-LM TP/PP)
  • Experience diagnosing performance bottlenecks and failure modes (profiling, NCCL/collectives issues, hangs, OOMs, stragglers)
  • Understanding of hardware accelerators and networking topologies
  • Experience optimizing data pipelines for ML workloads
Job Responsibility
Job Responsibility
  • Design and build core systems that make large training runs fast and reliable
  • Build scalable distributed training infrastructure for GPU clusters
  • Implement and tune parallelism/sharding strategies for evolving architectures
  • Optimize distributed efficiency (topology-aware collectives, comm/compute overlap, straggler mitigation)
  • Build data loading systems that eliminate I/O bottlenecks for multimodal datasets
  • Develop checkpointing mechanisms balancing memory constraints with recovery needs
  • Create monitoring, profiling, and debugging tools for training stability and performance
What we offer
What we offer
  • Competitive base salary with equity in a unicorn-stage company
  • We pay 100% of medical, dental, and vision premiums for employees and dependents
  • 401(k) matching up to 4% of base pay
  • Unlimited PTO plus company-wide Refill Days throughout the year
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Location
Location
United States , Ft. Meade
Salary
Salary:
Not provided
cipherlogix.com Logo
CipherLogix
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Fourteen (14) years experience in software development/engineering, including requirements analysis, software development, installation, integration, evaluation, enhancement, maintenance, testing, and problem diagnosis/resolution
  • Ten (10) years experience in system engineering/architecture
  • Ten (10) years experience working with products that support highly distributed, massively parallel computation needs such as Hbase, Hadoop, CloudBase/Acumulo, Big Table, Cassandra, Scality etc
  • At least ten (10) years experience writing software scripts using scripting languages such as Perl, Python, or Ruby for software automation
  • At least four (4) years experience managing and monitoring large Cloud System (>200 nodes). Cloud Systems Administrator or Developer Certification
  • Experience in performing and providing technical direction for the development, engineering, interfacing, integration, and testing of complete hardware/software systems to include monitoring technical health of a system, improving organizational processes, implementation of postmortem (failure) analysis and incident management
  • Ten (10) years experience in the cleared environment
  • Ten (10) years demonstrated experience developing software for one of the following: Windows, UNIX, or Linux OS
  • Knowledge and experience with developing distributed storage routing and querying algorithms
  • Experience in developing documentation required to support a program’s technical issues and training situations
  • Fulltime
Read More
Arrow Right