Staff Site Reliability Engineer Job at Replit

Job Description

Join our Site Reliability Engineering (SRE) team and help ensure the reliability, scalability, and performance of Replit's infrastructure that serves millions of developers worldwide. As a Staff Site Reliability Engineer, you will bridge the gap between development and operations, implementing automation and establishing best practices that enable our platform to scale efficiently while maintaining high availability.

Job Responsibility

Architect and Implement Observability
Define and Drive Reliability Standards
Lead Incident Management and Response
Drive Automation and Infrastructure as Code
Optimize Performance on Kubernetes
Debug and Harden Distributed Systems
Provide Staff-Level Guidance
Educate and Mentor
Build and Integrate

Requirements

8-10 years of experience in Site Reliability Engineering or similar roles (e.g., DevOps, Systems Engineering, Infrastructure Engineering)
Strong programming skills in languages like Python or Go
Deep understanding of distributed systems
Deep experience with container orchestration platforms, specifically Kubernetes, and cloud-native technologies
Proven track record of designing, implementing, and maintaining sophisticated monitoring and observability solutions
Strong incident management skills with extensive experience leading incident response for complex systems
Experience with infrastructure as code (e.g., Terraform, Pulumi) and configuration management tools
Excellent written and verbal communication skills
Strong interpersonal skills, with experience working with and mentoring engineers
A willingness to dive into understanding, debugging, and improving any layer of the stack
Passionate about making software creation accessible and empowering the next generation of builders

Nice to have

Deep experience with Google Cloud Platform (GCP) services and tools
Expert-level knowledge of modern observability platforms (e.g., Prometheus, Grafana, Datadog, OpenTelemetry)
Experience designing and building reliable systems capable of handling high throughput and low latency
Significant experience with Go and Terraform
Familiarity with working in rapid-growth, startup environments
Experience writing company-facing blog posts and training materials

What we offer

Competitive Salary & Equity
401(k) Program with a 4% match
Health, Dental, Vision and Life Insurance
Short Term and Long Term Disability
Paid Parental, Medical, Caregiver Leave
Commuter Benefits
Monthly Wellness Stipend
Autonomous Work Environment
In Office Set-Up Reimbursement
Flexible Time Off (FTO) + Holidays
Quarterly Team Gatherings
In Office Amenities

Replit - All Job Offers

Select Country

Staff Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Our AI answers in your language