This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
This is a research engineering role with direct production impact. You will translate new RL algorithms, scheduling methods, and inference optimizations into production-grade systems that power Together’s API. Success means shipping measurable improvements in latency, throughput, cost, and model quality at scale. The Core ML (Turbo) team sits at the intersection of efficient inference and post‑training / RL systems, building and operating the systems behind Together’s API.
Job Responsibility:
Advance inference efficiency end‑to‑end
Design and prototype algorithms, architectures, and scheduling strategies for low‑latency, high‑throughput inference
Implement and maintain changes in high‑performance inference engines
Profile and optimize performance across GPU, networking, and memory layers
Unify inference with RL / post‑training
Design and operate RL and post‑training pipelines
Make RL and post‑training workloads more efficient with inference‑aware training loops
Co‑design algorithms and infrastructure
Run ablations and scale‑up experiments to understand trade‑offs
Own critical systems at production scale
Profile, debug, and optimize inference and post-training services under real production workloads
Drive roadmap items that require real engine modification
Establish metrics, benchmarks, and experimentation frameworks
Provide technical leadership (Staff level)
Set technical direction for cross‑team efforts
Mentor other engineers and researchers
Requirements:
3+ years of experience working on ML systems, large‑scale model training, inference, or adjacent areas (or equivalent experience via research / open source)
Advanced degree in Computer Science, EE, or a related field, or equivalent practical experience
Strong expertise in at least one of the following: Large‑scale inference systems (e.g., SGLang, vLLM, FasterTransformer, TensorRT, custom engines, or similar), GPU performance, distributed serving
RL / post‑training for LLMs or large models (e.g., GRPO, RLHF/RLAIF, DPO‑like methods, reward modeling)
Model architecture design for Transformers or other large neural nets
Distributed systems / high‑performance computing for ML
Strong coding ability in Python
Experience profiling and optimizing performance across GPU, networking, and memory layers
Track record of impactful work in ML systems, RL, or large‑scale model training (papers, open‑source projects, or production systems)
Nice to have:
Bias toward implementation and shipping
Comfortable working from algorithms to engines
Able to take a new sampling method, scheduler, or RL update and turn it into a production‑grade implementation
Solid research foundation in your area(s) of depth
Can read new RL / post‑training papers, understand their implications on the stack, and design minimal, correct changes
Operate well as a full‑stack problem solver
Enjoy collaborating with infra, research, and product teams