This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. We are seeking a highly skilled and experienced AI Infrastructure Operations Engineer to manage and operate our cutting-edge machine learning compute clusters. These clusters would provide the candidate an opportunity to work with the world's largest computer chip, the Wafer-Scale Engine (WSE), and the systems that harness its unparalleled power. You will play a critical role in ensuring the health, performance, and availability of our infrastructure, maximizing compute capacity, and supporting our growing AI initiatives.
Job Responsibility:
Manage and operate multiple advanced AI compute infrastructure clusters
Monitor and oversee cluster health, proactively identifying and resolving potential issues
Maximize compute capacity through optimization and efficient resource allocation
Deploy, configure, and debug container-based services using Docker
Provide 24/7 monitoring and support, leveraging automated tools and performing hands-on troubleshooting as needed
Handle engineering escalations and collaborate with other teams to resolve complex technical challenges
Contribute to the development and improvement of our monitoring and support processes
Stay up-to-date with the latest advancements in AI compute infrastructure and related technologies
Requirements:
6-8 years of relevant experience in managing and operating complex compute infrastructure, preferably in the context of machine learning or high-performance computing
Strong proficiency in Python scripting for automation and system administration
Deep understanding of Linux-based compute systems and command-line tools
Extensive knowledge of Docker containers and container orchestration platforms like k8s and SLURM
Proven ability to troubleshoot and resolve complex technical issues in a timely and efficient manner
Experience with monitoring and alerting systems
Should have a proven track record to own and drive challenges to completion
Excellent communication and collaboration skills
Ability to work effectively in a fast-paced environment
Willingness to participate in a 24/7 on-call rotation
Nice to have:
Operating large scale GPU clusters
Knowledge of technologies like Ethernet, RoCE, TCP/IP, etc. is desired
Knowledge of cloud computing platforms (e.g., AWS, GCP, Azure)
Familiarity with machine learning frameworks and tools
Experience with cross-functional team projects
What we offer:
Build a breakthrough AI platform beyond the constraints of the GPU
Publish and open source their cutting-edge AI research
Work on one of the fastest AI supercomputers in the world
Enjoy job stability with startup vitality
Our simple, non-corporate work culture that respects individual beliefs