Senior ML Systems Engineer, Frameworks & Tooling Job at Cohere

Job Description

We’re looking for a senior engineer to help build, maintain and evolve the training framework that powers our frontier-scale language models. This role sits at the intersection of large-scale training, distributed systems, and HPC infrastructure. You will design and maintain the core components that enable fast, reliable, and scalable model training — and build the tooling that connects research ideas to thousands of GPUs. If you enjoy working across the full stack of ML systems, this role gives you the opportunity and autonomy to have massive impact.

Job Responsibility

Build and own the training framework responsible for large-scale LLM training
Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing)
Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100)
Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics
Collaborate closely with infra teams to ensure our cluster, container environments, and hardware configurations support high-performance training
Investigate and resolve performance bottlenecks across the ML systems stack
Build robust systems that ensure reproducible, debuggable, large-scale runs

Requirements

Strong engineering experience in large-scale distributed training or HPC systems
Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops
Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar)
Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines
Experience working with containerized environments (Docker, Singularity/Apptainer)
A track record of building tools that increase developer velocity for ML teams
Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability
Strong collaboration skills — you’ll work closely with infra, research, and deployment teams

Nice to have

Experience with training LLMs or other large transformer architectures
Contributions to ML frameworks (PyTorch, JAX, DeepSpeed, Megatron, xFormers, etc.)
Familiarity with evaluation and serving frameworks (vLLM, TensorRT-LLM, custom KV caches)
Experience with data pipeline optimization, sharded datasets, or caching strategies
Background in performance engineering, profiling, or low-level systems
Bonus: paper at top-tier venues (such as NeurIPS, ICML, ICLR, AIStats, MLSys, JMLR, AAAI, Nature, COLING, ACL, EMNLP)

What we offer

An open and inclusive culture and work environment
Work closely with a team on the cutting edge of AI research
Weekly lunch stipend, in-office lunches & snacks
Full health and dental benefits, including a separate budget to take care of your mental health
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
6 weeks of vacation (30 working days!)

Cohere - All Job Offers

Select Country

Senior ML Systems Engineer, Frameworks & Tooling

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Senior ML Systems Engineer, Frameworks & Tooling

Senior Engineer / Lead Engineer - Virtual Engineering - AI ML

Senior Ml Engineer

Senior ML Engineer (GenAI, AWS)

Senior ML Engineer - AI Platform & Agents

Senior ML Engineer (GenAI, AWS)

Senior ML Engineer - AI Platform & Agents

Senior Ml Engineer (Genai, Aws)

Senior ML Engineer (GenAI, AWS)

Our AI answers in your language