This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We’re looking for a hands-on Reliability Tech Lead (IC) to own the mission of making Cerebras Inference the most reliable AI service in the world. You will drive reliability strategy and execution across our inference stack, from client SDKs and public-cloud multi-region deployments to wafer-scale systems in specialized data centers. In this role, you will define SLOs and incident-response frameworks, design and implement reliability mechanisms at scale, and partner across hundreds of engineers to ensure our service meets world-class reliability standards.
Job Responsibility:
Define and drive reliability strategy: establish SLOs and ensure alignment across engineering
Design and implement reliability mechanisms: build and evolve systems for fault detection, graceful degradation, failover, throttling, and recovery across multiple regions and data centers
Lead large-scale incident management: own postmortems, root-cause analysis, and prevention loops for reliability-related incidents
Architect for reliability and observability: influence system design for redundancy, durability, and debuggability
Develop reliability tooling: create internal tools and frameworks for chaos testing, load simulation, and distributed fault injection
Collaborate broadly: work across software, infrastructure, and hardware teams to ensure reliability is embedded into every layer of our inference service
Monitor and communicate reliability metrics: build dashboards and alerts that measure service health and provide actionable insights
Mentor and influence: guide engineers and set best practices for designing, testing, and operating reliable large-scale systems
Requirements:
Bachelor's or master's degree in computer science or related field
7+ years of experience in backend, infrastructure, or reliability engineering for large-scale distributed systems
Strong programming skills in at least one popular backend programming language such as Python, C++, Go, or Rust
Deep and hard-earned experience of reliability principles: SLO/SLI/SLA design, incident response, and postmortem culture
Excellent communication and cross-functional leadership skills
Nice to have:
prior experience building large-scale AI infrastructure systems
What we offer:
Build a breakthrough AI platform beyond the constraints of the GPU
Publish and open source their cutting-edge AI research
Work on one of the fastest AI supercomputers in the world
Enjoy job stability with startup vitality
Our simple, non-corporate work culture that respects individual beliefs