AI Infrastructure Operations Engineer Job at Cerebras Systems

AI Infrastructure Operations Engineer

Cerebras Systems

Location:

Category:
IT - Administration

Contract Type:
Not provided

Salary:

Not provided

Save Job

Apply Position

Job Description:

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. We are seeking a highly skilled and experienced AI Infrastructure Operations Engineer to manage and operate our cutting-edge machine learning compute clusters. These clusters would provide the candidate an opportunity to work with the world's largest computer chip, the Wafer-Scale Engine (WSE), and the systems that harness its unparalleled power. You will play a critical role in ensuring the health, performance, and availability of our infrastructure, maximizing compute capacity, and supporting our growing AI initiatives.

Job Responsibility:

Manage and operate multiple advanced AI compute infrastructure clusters
Monitor and oversee cluster health, proactively identifying and resolving potential issues
Maximize compute capacity through optimization and efficient resource allocation
Deploy, configure, and debug container-based services using Docker
Provide 24/7 monitoring and support, leveraging automated tools and performing hands-on troubleshooting as needed
Handle engineering escalations and collaborate with other teams to resolve complex technical challenges
Contribute to the development and improvement of our monitoring and support processes
Stay up-to-date with the latest advancements in AI compute infrastructure and related technologies

Requirements:

6-8 years of relevant experience in managing and operating complex compute infrastructure, preferably in the context of machine learning or high-performance computing
Strong proficiency in Python scripting for automation and system administration
Deep understanding of Linux-based compute systems and command-line tools
Extensive knowledge of Docker containers and container orchestration platforms like k8s and SLURM
Proven ability to troubleshoot and resolve complex technical issues in a timely and efficient manner
Experience with monitoring and alerting systems
Should have a proven track record to own and drive challenges to completion
Excellent communication and collaboration skills
Ability to work effectively in a fast-paced environment
Willingness to participate in a 24/7 on-call rotation

Nice to have:

Operating large scale GPU clusters
Knowledge of technologies like Ethernet, RoCE, TCP/IP, etc. is desired
Knowledge of cloud computing platforms (e.g., AWS, GCP, Azure)
Familiarity with machine learning frameworks and tools
Experience with cross-functional team projects

What we offer:

Build a breakthrough AI platform beyond the constraints of the GPU
Publish and open source their cutting-edge AI research
Work on one of the fastest AI supercomputers in the world
Enjoy job stability with startup vitality
Our simple, non-corporate work culture that respects individual beliefs

Additional Information:

Job Posted:
February 17, 2026

Cerebras Systems - All Job Offers

Job Link Share:

AI Infrastructure Operations Engineer

Cerebras Systems

Location:

Category:
IT - Administration

Contract Type:
Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:
February 17, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for AI Infrastructure Operations Engineer

Senior AI Infrastructure Engineer

AI Research Engineer, Data Infrastructure

Senior Engineering Manager - AI Core Platform

Engineering Manager - Machine Learning Infrastructure

Senior Software Engineer - ML Infrastructure

Is Data Center Operations Engineer

Engineering Manager, Infrastructure

Director of AI Engineering

AI Infrastructure Operations Engineer

Cerebras Systems

Location:

Category:IT - Administration

Contract Type:Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:February 17, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for AI Infrastructure Operations Engineer

Senior AI Infrastructure Engineer

AI Research Engineer, Data Infrastructure

Senior Engineering Manager - AI Core Platform

Engineering Manager - Machine Learning Infrastructure

Senior Software Engineer - ML Infrastructure

Is Data Center Operations Engineer

Engineering Manager, Infrastructure

Director of AI Engineering

Category:
IT - Administration

Contract Type:
Not provided

Job Posted:
February 17, 2026