This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The internal infrastructure team is responsible for building world-class infrastructure and tools used to train, evaluate and serve Cohere's foundational models. By joining our team, you will work in close collaboration with AI researchers to support their AI workload needs on the cutting edge, with a strong focus on stability, scalability, and observability. You will be responsible for building and operating Kubernetes GPU superclusters across multiple clouds. Your work will directly accelerate the development of industry-leading AI models that power Cohere's platform North.
Job Responsibility:
Build and operate Kubernetes compute superclusters across multiple clouds
Partner with cloud providers to optimize infrastructure costs, performance, and reliability for AI workloads
Work closely with research teams to understand their infrastructure needs and identify ways to improve stability, performance, and efficiency of novel model training techniques
Design and build resilient, scalable systems for training AI models, focusing on creating intuitive user interfaces that empower researchers to self-serve to troubleshoot and resolve problems
Encourage software best practices across our company and participate in team processes such as knowledge sharing, reviews, and on-call
Requirements:
Have deep experience running Kubernetes clusters at scale and/or scaling and troubleshooting Cloud Native infrastructure, including Infrastructure as Code
Have strong programming skills in Go or Python
Prefer contributing to Open Source solutions rather than building solutions from the ground up
Are self-directed and adaptable, and excel at identifying and solving key problems
Draw motivation from building systems that help others be more productive
See mentorship, knowledge transfer, and review as essential prerequisites for a healthy team
Have excellent communication skills and thrive in fast-paced environments
Nice to have:
You've previously worked with ML training infrastructure and GPU workloads and have familiarity with RDMA networking
You have expertise to support and troubleshoot low level Linux systems
You have experience collaborating with research teams or machine learning engineers
What we offer:
An open and inclusive culture and work environment
Work closely with a team on the cutting edge of AI research
Weekly lunch stipend, in-office lunches & snacks
Full health and dental benefits, including a separate budget to take care of your mental health
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend