CrawlJobs Logo

Member of Technical Staff - Distributed Training Engineer

liquid.ai Logo

Liquid AI

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

Our Training Infrastructure team is building the distributed systems that power our next-generation Liquid Foundation Models. As we scale, we need to design, implement, and optimize the infrastructure that enables large-scale training. This is a high-ownership training systems role focused on runtime/performance/reliability (not a general platform/SRE role). You’ll work on a small team with fast feedback loops, building critical systems from the ground up rather than inheriting mature infrastructure.

Job Responsibility:

  • Design and build core systems that make large training runs fast and reliable
  • Build scalable distributed training infrastructure for GPU clusters
  • Implement and tune parallelism/sharding strategies for evolving architectures
  • Optimize distributed efficiency (topology-aware collectives, comm/compute overlap, straggler mitigation)
  • Build data loading systems that eliminate I/O bottlenecks for multimodal datasets
  • Develop checkpointing mechanisms balancing memory constraints with recovery needs
  • Create monitoring, profiling, and debugging tools for training stability and performance

Requirements:

  • Hands-on experience building distributed training infrastructure (PyTorch Distributed DDP/FSDP, DeepSpeed ZeRO, Megatron-LM TP/PP)
  • Experience diagnosing performance bottlenecks and failure modes (profiling, NCCL/collectives issues, hangs, OOMs, stragglers)
  • Understanding of hardware accelerators and networking topologies
  • Experience optimizing data pipelines for ML workloads

Nice to have:

  • MoE (Mixture of Experts) training experience
  • Large-scale distributed training (100+ GPUs)
  • Open-source contributions to training infrastructure projects
What we offer:
  • Competitive base salary with equity in a unicorn-stage company
  • We pay 100% of medical, dental, and vision premiums for employees and dependents
  • 401(k) matching up to 4% of base pay
  • Unlimited PTO plus company-wide Refill Days throughout the year

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Member of Technical Staff - Distributed Training Engineer

Member of Technical Staff, AI Training Infrastructure

As a Training Infrastructure Engineer, you'll design, build, and optimize the in...
Location
Location
United States , San Mateo
Salary
Salary:
175000.00 - 220000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Computer Engineering, or related field, or equivalent practical experience
  • 3+ years of experience with distributed systems and ML infrastructure
  • Experience with PyTorch
  • Proficiency in cloud platforms (AWS, GCP, Azure)
  • Experience with containerization, orchestration (Kubernetes, Docker)
  • Knowledge of distributed training techniques (data parallelism, model parallelism, FSDP)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for large-scale model training workloads
  • Develop and maintain distributed training pipelines for LLMs and multimodal models
  • Optimize training performance across multiple GPUs, nodes, and data centers
  • Implement monitoring, logging, and debugging tools for training operations
  • Architect and maintain data storage solutions for large-scale training datasets
  • Automate infrastructure provisioning, scaling, and orchestration for model training
  • Collaborate with researchers to implement and optimize training methodologies
  • Analyze and improve efficiency, scalability, and cost-effectiveness of training systems
  • Troubleshoot complex performance issues in distributed training environments
What we offer
What we offer
  • meaningful equity in a fast-growing startup
  • comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Cloud Infrastructure

As a Software Engineer on our Cloud Infrastructure team, you'll be at the forefr...
Location
Location
United States , New York, NY; San Mateo, CA; Redwood City, CA
Salary
Salary:
175000.00 - 220000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • 5+ years of experience designing and building backend infrastructure in cloud environments (e.g., AWS, GCP, Azure)
  • Proven experience in ML infrastructure and tooling (e.g., PyTorch, TensorFlow, Vertex AI, SageMaker, Kubernetes, etc.)
  • Strong software development skills in languages like Python, or C++
  • Deep understanding of distributed systems fundamentals: scheduling, orchestration, storage, networking, and compute optimization
Job Responsibility
Job Responsibility
  • Architect and build scalable, resilient, and high-performance backend infrastructure to support distributed training, inference, and data processing pipelines
  • Lead technical design discussions, mentor other engineers, and establish best practices for building and operating large-scale ML infrastructure
  • Design and implement core backend services (e.g., job schedulers, resource managers, autoscalers, model serving layers) with a focus on efficiency and low latency
  • Drive infrastructure optimization initiatives, including compute cost reduction, storage lifecycle management, and network performance tuning
  • Collaborate cross-functionally with ML, DevOps, and product teams to translate research and product needs into robust infrastructure solutions
  • Continuously evaluate and integrate cloud-native and open-source technologies (e.g., Kubernetes, Ray, Kubeflow, MLFlow) to enhance our platform’s capabilities and reliability
  • Own end-to-end systems from design to deployment and observability, with a strong emphasis on reliability, fault tolerance, and operational excellence
What we offer
What we offer
  • Meaningful equity in a fast-growing startup
  • Competitive salary
  • Comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Performance Optimization

We're looking for a Software Engineer focused on Performance Optimization to hel...
Location
Location
United States , San Mateo
Salary
Salary:
175000.00 - 220000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent practical experience
  • 5+ years of experience working on performance optimization or high-performance computing systems
  • Proficiency in CUDA or ROCm and experience with GPU profiling tools (e.g., Nsight, nvprof, CUPTI)
  • Familiarity with PyTorch and performance-critical model execution
  • Experience with distributed system debugging and optimization in multi-GPU environments
  • Deep understanding of GPU architecture, parallel programming models, and compute kernels
Job Responsibility
Job Responsibility
  • Optimize system and GPU performance for high-throughput AI workloads across training and inference
  • Analyze and improve latency, throughput, memory usage, and compute efficiency
  • Profile system performance to detect and resolve GPU- and kernel-level bottlenecks
  • Implement low-level optimizations using CUDA, Triton, and other performance tooling
  • Drive improvements in execution speed and resource utilization for large-scale model workloads (LLMs, VLMs, and video models)
  • Collaborate with ML researchers to co-design and tune model architectures for hardware efficiency
  • Improve support for mixed precision, quantization, and model graph optimization
  • Build and maintain performance benchmarking and monitoring infrastructure
  • Scale inference and training systems across multi-GPU, multi-node environments
  • Evaluate and integrate optimizations for emerging hardware accelerators and specialized runtimes
What we offer
What we offer
  • Meaningful equity in a fast-growing startup
  • Competitive salary
  • Comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Senior Staff Machine Learning Engineer

Help design our AI platform and develop our next generation of machine learning ...
Location
Location
United States , San Francisco
Salary
Salary:
216500.00 - 324500.00 USD / Year
gofundme.com Logo
GoFundMe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 9+ years of hands-on experience in machine learning engineering, AI development, software engineering, or related fields
  • Experience emphasizing secure, large-scale, distributed system design, AI/ML pipeline development, and implementation
  • Extensive experience designing, developing, and operating scalable backend systems
  • Experience applying software engineering best practices such as domain-driven design, event-driven architectures, and microservices
  • Deep expertise in agentic workflows, AI evaluation solutions, prompt management, and secure AI development and testing practices
  • Strong knowledge of relational and document-based databases, data storage paradigms, and efficient RESTful API design
  • Experience establishing robust CI/CD pipelines, automated testing (unit and integration), and deployment practices
  • Strong leadership skills, including effective planning and management of complex projects, mentoring of team members, and fostering a collaborative, high-performing engineering culture
  • Excellent communicator, able to articulate complex technical concepts clearly to both technical and non-technical stakeholders
  • Bachelor's degree in Computer Science, Software Engineering, or a related technical field (preferred)
Job Responsibility
Job Responsibility
  • Design and implement AI platforms to enable scalable and secure access to LLMs from multiple model providers for diverse use cases
  • Design and implement agentic workflows, agentic tool ecosystems, and LLM prompt management solutions
  • Design, build, and optimize scalable model training, fine tuning, and inference pipelines, ensuring robust integration with production systems
  • Influence technical strategy and approach to developing embedding stores, vector databases, and other reusable assets
  • Lead initiatives to streamline ML and AI workflows, improve operational efficiency, and establish standardized procedures to achieve consistent, high-quality results across our AI systems
  • Design and develop backend services and RESTful APIs using Python and FastAPI, integrating seamlessly with ML pipelines and services
  • Take operational responsibility for team-owned services, including performance monitoring, optimization, troubleshooting, and participation in an on-call rotation
  • Collaborate with both technical and non-technical colleagues, including data and applied scientists, software engineers, product managers, and business stakeholders, to deliver reliable and scalable ML-driven products
  • Coach and mentor fellow ML engineers, promoting a culture of collaboration, continuous improvement, and engineering excellence within the team
  • Employ a diverse set of tools and platforms including Python, AWS, Databricks, Docker, Kubernetes, FastAPI, Terraform, Snowflake, Coralogix, and GitHub to build, deploy, and maintain scalable, highly available machine learning infrastructure
What we offer
What we offer
  • Competitive pay
  • Comprehensive healthcare benefits
  • Financial assistance for things like hybrid work, family planning
  • Generous parental leave
  • Flexible time-off policies
  • Mental health and wellness resources
  • Learning, development, and recognition programs
  • Fulltime
Read More
Arrow Right

Staff Backend Engineer

OpenSea is the gateway to web3’s next chapter—where NFTs, fungible tokens, and e...
Location
Location
United States
Salary
Salary:
190000.00 - 345000.00 USD / Year
opensea.io Logo
OpenSea
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 6 years of experience as a software engineer
  • Strong fluency in event-driven system design patterns, leveraging technologies such as Kafka, Flink, and Debezium, as well as distributed systems more generally
  • Strong fluency in JVM languages like Kotlin or Java
  • Strong fluency and opinionated in database choice and schema design for highly performant and scalable applications
  • Intrinsic interest in leveraging new tools (e.g. AI) to increase work efficiency and quality
  • Passion for (blockchain) technology, NFTs, and the potential of digital ownership is a huge plus
Job Responsibility
Job Responsibility
  • Form, communicate, and execute on a technical vision for the next generation of OpenSea’s architecture that leverages highly scalable and event-driven patterns
  • Write reliable, low latency marketplace infrastructure software that will eventually process billions of dollars a day worth of transaction volume, including highly performant blockchain indexing systems, order management systems at scale, and REST & Websocket API endpoints
  • Raise the bar for internal understanding of best practices in building highly performant & event-driven systems with “live” characteristics
  • Mentor and train other team members and act as an internal thought leader and agitator for architectural rigor and cleanliness
What we offer
What we offer
  • Health Benefits: We cover 100% Dental/Vision/Medical for employees and 90% for dependents
  • Flexible Time Off Policy
  • Parental Leave: 16 Weeks of Paid Parental Bonding & up to 8 additional weeks for the birthing parent
  • Mental Health: We offer access to Spring Health, covering 8 therapy & 8 coaching sessions per year
  • 11 Company Holidays
  • Fidelity 401K Plan
  • Internet/Mobile Reimbursement Plan
  • Reimbursement or Monthly Snack Delivery
  • Company & Team retreats to get together for fun and collaboration
  • Team Member Co-Working and Gathering Expense
  • Fulltime
Read More
Arrow Right

Staff Machine Learning Engineer

We are seeking a Staff Machine Learning Engineer to join our Foundation AI team....
Location
Location
United States , Boston
Salary
Salary:
170000.00 - 230000.00 USD / Year
whoop.com Logo
Whoop
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Advanced degree (Master’s or Ph.D.) in Computer Science, Machine Learning, Electrical Engineering, or a related field, or equivalent professional experience
  • 7+ years of experience in applied ML, AI research, or large-scale modeling, with a track record of delivering production systems
  • Expertise in modern deep learning (e.g., transformers, state space models) and multimodal model training
  • Proficiency in Python and deep learning frameworks (e.g., PyTorch, TensorFlow)
  • Experience building and scaling large datasets and training large models in distributed compute environments
  • Strong applied experience with representation learning, self-supervised methods, and fine-tuning for downstream applications
  • Familiarity with MLOps best practices including model versioning, evaluation, CI/CD for ML, and cloud-based compute
  • Excellent communication skills and ability to collaborate cross-functionally with engineers, researchers, and product teams
  • Passion for WHOOP’s mission to improve human performance and extend healthspan through science and technology
Job Responsibility
Job Responsibility
  • Design, train, and optimize large-scale multimodal foundation models that integrate wearable sensor data, text, biomarkers, and behavioral data
  • Conduct applied research in self-supervised learning, representation learning, and downstream task fine tuning to advance WHOOP’s core model capabilities
  • Develop scalable, distributed training pipelines for large models on high-performance compute environments
  • Collaborate with MLOps, data engineering, and software engineering teams to operationalize models for production deployment, ensuring robustness, reproducibility, and observability
  • Partner with product and research teams to translate foundation model capabilities into downstream features that deliver meaningful member value
  • Contribute to the technical roadmap and architectural direction for foundation model development at WHOOP
  • Serve as a technical mentor for other data scientists, sharing best practices in deep learning, large-scale training, and multimodal data integration
  • Ensure models adhere to WHOOP’s standards for ethical, transparent, and privacy-preserving AI
What we offer
What we offer
  • competitive base salaries
  • meaningful equity
  • benefits
  • generous equity package
  • Fulltime
Read More
Arrow Right
New

Member of Technical Staff, Training Infra Engineer

Contribute in and provide strong support for model training pipelines, ship stat...
Location
Location
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extremely strong software engineering skills
  • Proficiency in Python and related ML frameworks such as JAX, Pytorch and XLA/MLIR
  • Experience with distributed training infrastructures (Kubernetes, Slurm) and associated frameworks (Ray)
  • Experience using large-scale distributed training strategies
  • Hands on experience on training large model at scale and having contributed to the tooling and/or setup of the training infrastructure
Job Responsibility
Job Responsibility
  • Design and write high-performant and scalable software for training
  • Improve our training setup from an infrastructure and codebase performance standpoint
  • Craft and implement tools to speed up our training cycles and improve the overall efficacy of our training infrastructure
  • Research, implement, and experiment with ideas on our supercompute and data infrastructure
  • Learn from and work with the best researchers in the field
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right
New

Member of Technical Staff, Integration/RL Team (Research Engineer)

The integration team is responsible for developing and scaling machine learning ...
Location
Location
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extremely strong software engineering skills
  • Value test-driven development methods, clean code, and strive to reduce technical debts at all levels
  • Proficiency in Python and related ML frameworks such as JAX, Pytorch and/or XLA/MLIR
  • Experience using and debugging large-scale distributed training strategies (memory/speed profiling)
  • [Bonus] Experience with distributed training infrastructures (Kubernetes) and associated frameworks (Ray)
  • [Bonus] Hands-on experience with the post-training phase of model training, with a strong emphasis on scalability and performance
  • [Bonus] Experience in ML, LLM and RL academic research
Job Responsibility
Job Responsibility
  • Design and write high-performing and scalable software for training models
  • Develop new tools to support and accelerate research and LLM training
  • Coordinate with other engineering teams (Infrastructure, Efficiency, Serving) and the scientific teams (Agent, Multimodal, Multilingual, etc.) to create a strong and integrated post-training ecosystem
  • Craft and implement techniques to improve performance and speed up our training cycles, both on SFT, offline preference, and the RL regime
  • Research, implement, and experiment with ideas on our cluster and data infrastructure
  • Collaborate, Collaborate, and Collaborate with other scientists, engineers, and teams!
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right