Member of Technical Staff, Training Performance Engineer Job at Cohere

Member of Technical Staff, Training Infra Engineer

Contribute in and provide strong support for model training pipelines, ship stat...

Location

Salary:

Not provided

Cohere

Expiration Date

Until further notice

Requirements

Extremely strong software engineering skills
Proficiency in Python and related ML frameworks such as JAX, Pytorch and XLA/MLIR
Experience with distributed training infrastructures (Kubernetes, Slurm) and associated frameworks (Ray)
Experience using large-scale distributed training strategies
Hands on experience on training large model at scale and having contributed to the tooling and/or setup of the training infrastructure

Job Responsibility

Design and write high-performant and scalable software for training
Improve our training setup from an infrastructure and codebase performance standpoint
Craft and implement tools to speed up our training cycles and improve the overall efficacy of our training infrastructure
Research, implement, and experiment with ideas on our supercompute and data infrastructure
Learn from and work with the best researchers in the field

What we offer

An open and inclusive culture and work environment
Work closely with a team on the cutting edge of AI research
Weekly lunch stipend, in-office lunches & snacks
Full health and dental benefits, including a separate budget to take care of your mental health
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
6 weeks of vacation (30 working days!)

Fulltime

Member of Technical Staff - GPU Performance Engineer

Our models and workflows require performance work that generic frameworks don’t ...

Location

United States , San Francisco; Boston

Salary:

Not provided

Liquid AI

Expiration Date

Until further notice

Requirements

Authored custom CUDA kernels (not only calling cuDNN/cuBLAS)
Strong understanding of GPU architecture and performance: memory hierarchy, warps, shared memory/register pressure, bandwidth vs compute limits
Proficiency with low-level profiling (Nsight Systems/Compute) and performance methodology
Strong C/C++ skills

Job Responsibility

Write high-performance GPU kernels for our novel model architectures
Integrate kernels into PyTorch pipelines (custom ops, extensions, dispatch, benchmarking)
Profile and optimize training and inference workflows to eliminate bottlenecks
Build correctness tests and numerics checks
Build/maintain performance benchmarks and guardrails to prevent regressions
Collaborate closely with researchers to turn promising ideas into shipped speedups

What we offer

Competitive base salary with equity in a unicorn-stage company
We pay 100% of medical, dental, and vision premiums for employees and dependents
401(k) matching up to 4% of base pay
Unlimited PTO plus company-wide Refill Days throughout the year

Fulltime

Member of Technical Staff - Distributed Training Engineer

Our Training Infrastructure team is building the distributed systems that power ...

Location

United States , San Francisco; Boston

Salary:

Not provided

Liquid AI

Expiration Date

Until further notice

Requirements

Hands-on experience building distributed training infrastructure (PyTorch Distributed DDP/FSDP, DeepSpeed ZeRO, Megatron-LM TP/PP)
Experience diagnosing performance bottlenecks and failure modes (profiling, NCCL/collectives issues, hangs, OOMs, stragglers)
Understanding of hardware accelerators and networking topologies
Experience optimizing data pipelines for ML workloads

Job Responsibility

Design and build core systems that make large training runs fast and reliable
Build scalable distributed training infrastructure for GPU clusters
Implement and tune parallelism/sharding strategies for evolving architectures
Optimize distributed efficiency (topology-aware collectives, comm/compute overlap, straggler mitigation)
Build data loading systems that eliminate I/O bottlenecks for multimodal datasets
Develop checkpointing mechanisms balancing memory constraints with recovery needs
Create monitoring, profiling, and debugging tools for training stability and performance

What we offer

Competitive base salary with equity in a unicorn-stage company
We pay 100% of medical, dental, and vision premiums for employees and dependents
401(k) matching up to 4% of base pay
Unlimited PTO plus company-wide Refill Days throughout the year

Fulltime

Member of Technical Staff, High Performance Computing Engineer

Microsoft AI is looking for experienced Member of Technical Staff, High Performa...

Location

United Kingdom , London

Salary:

Not provided

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, or related technical field AND 4+ years technical engineering experience with deploying or operating on-premise or cloud high-performance clusters
4+ years experience working with high-scale training clusters (ex. working with frameworks/tools such as nvidia InfiniBand clusters, SLURM, Kubernetes, Ray, etc.)
4+ years experience building scalable services on top of public cloud infrastructure like Azure, AWS, or GCP
OR equivalent experience

Job Responsibility

Design, operate, and maintain large-scale HPC environments
Own the deployment, configuration, and day-to-day operation of HPC schedulers (e.g., SLURM, Kubernetes)
Serve as a technical owner for at least one core HPC domain (GPU compute, high-performance storage, networking, or similar)
Develop and maintain automation and tooling using Bash and/or Python
Partner closely with researchers and engineers to support their workloads, troubleshoot cluster usage issues, and triage failed or underperforming jobs
Drive work forward independently by navigating ambiguity and technical roadblocks
Enjoy working in a fast-paced, design-driven product development environment
Embody our Culture and Values

Fulltime

Member of Technical Staff - Post Training - MAI Superintelligence Team

At Microsoft AI, we are on a mission to develop the most cutting-edge algorithms...

Location

United States , Mountain View

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science, Machine Learning, Mathematics, or related technical discipline AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Have experience with reward modeling, RL, or other post-training techniques

Job Responsibility

Develop data collection, evaluation, and post-training methods for models
Design hypotheses and experiment plans for rapidly iterating on model performance

Fulltime

Member of Technical Staff - ML Engineer / Scientist (JP Localization)

At Liquid, we’re not just building AI models—we’re redefining the architecture o...

Location

Japan , Tokyo

Salary:

Not provided

Liquid AI

Expiration Date

Until further notice

Requirements

Deep understanding of the Japanese model evaluation landscape and familiarity with Japanese pre-training data sources
Experience using modeling and inference tools such as Huggingface inference, vLLM, and cloud APIs

Job Responsibility

Identify, collect, and curate diverse high-quality Japanese text, audio, and multimodal datasets
Design methods to synthetically generate or augment Japanese training data when needed
Ensure datasets meet enterprise-grade quality, coverage, and compliance requirements
Train and fine-tune language and vision models to achieve state-of-the-art performance for Japanese enterprise use cases
Adapt existing LFMs for Japanese language, culture, and enterprise-specific workflows
Implement evaluation frameworks to benchmark model quality on Japanese datasets
Design evaluation datasets and metrics for Japanese enterprise applications
Conduct thorough error analysis and iteratively improve model performance
Ensure robustness, fairness, and reliability in Japanese-language outputs

What we offer

Hands-on experience with state-of-the-art technology at a leading AI company
The opportunity to directly shape foundation model performance in one of the world’s most complex and nuanced languages
A collaborative, fast-paced environment where your work drives the next generation of LFMs

Fulltime

Member of Technical Staff, Pre-Training Infrastructure

Microsoft AI is looking for a Member of Technical Staff, Pre-Training Infrastruc...

Location

United States , Mountain View

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
OR equivalent experience
Experience in distributed computing and large-scale systems
Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch
Proven ability to profile, benchmark, and optimize performance-critical systems
Experience in leading technical projects and supporting architectural decisions with data
Experience building infrastructure for large-scale machine learning or generative AI workloads
Experience in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms
Track record of contributing to high-performance computing or large-scale AI infrastructure projects

Job Responsibility

Design, implement, test, and optimize distributed training infrastructure in Python and C++ for large-scale GPU clusters
Profile, benchmark, and debug performance bottlenecks across compute, memory, networking, and storage subsystems
Optimize collective communication libraries (e.g., NCCL) for emerging NVLink and InfiniBand topologies
Collaborate with hardware teams to optimize for next-generation accelerators (NVIDIA, AMD, and beyond)
Gather data and insights to develop the pretraining compute roadmap
Care deeply about conversational AI and its deployment
Actively contribute to the development of AI models powering our innovative products
Find solutions to overcome roadblocks and deliver your work to users quickly and iteratively
Enjoy working in a fast-paced, design-driven product development cycle
Embody our Culture and Values

Fulltime

Member of Technical Staff, AI Platform Engineer

As Microsoft continues to push the boundaries of AI, we are on the lookout for p...

Location

United States , Mountain View

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science, or related technical discipline AND 4+ years technical engineering experience with coding in languages including, but not limited to TypeScript, Python, C, C++, C#, Java
OR equivalent experience
Bachelor’s degree in computer science, or related technical discipline AND 6+ years technical engineering experience building web services with coding in languages including, but not limited to: Python, Golang, Java/Scala, Rust
6+ years' experience in building and releasing production software at the platform level
Deep experience with all of the following languages: Golang, Java/Scala, Typescript (React/Next.js)
Experience in model pretraining, post-training, evaluation, and inference
Experience using Machine Learning frameworks, including experience using, deploying, and scaling language learning models, either personally or professionally
Ability to clearly communicate complex technical concepts to both technical and non-technical stakeholders
Demonstrated interpersonal skills and ability to work closely with cross-functional teams, including product managers, designers, and other engineers
Experience going from zero-to-one as well as working with developed systems

Job Responsibility

Design, develop, and maintain platform-level software solutions
Collaborate with cross-functional teams to integrate AI capabilities into various products
Ensure the reliability, scalability, and performance of platform components
Stay updated with the latest advancements in AI and engineering
Work alongside the technical staff and AI researchers to improve model development flows
Embody our Culture and Values

Fulltime

Select Country

Member of Technical Staff, Training Performance Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?