Member of Technical Staff - Site Reliability Engineer, Endor Labs

Endor Labs

Location:
United States, Palo Alto

Category:
IT - Software Development

Contract Type:
Employment contract

Salary:

170000.00 - 220000.00 USD / Year

Save Job

Apply Position

Job Description:

As a Site Reliability Engineer at Endor Labs, you’ll play a pivotal role in shaping the reliability, performance, and scalability of our systems. You’ll partner with engineering teams across the company to define and implement best practices that improve operational excellence, reduce incidents, and foster a culture of accountability and continuous improvement.

Job Responsibility:

Lead the definition and rollout of SRE practices across engineering, including SLAs, SLOs, and error budgets
Design and build monitoring, alerting, and observability frameworks that empower teams to own the reliability of their services
Establish incident response protocols and lead post-incident reviews to drive learning and remediation
Collaborate with product and platform teams to improve system architecture with reliability and performance in mind
Advocate for automation of deployments, scaling, and failover procedures across services
Create tooling and dashboards that give teams visibility into system health, latency, and error rates
Foster a blameless culture and partner closely with engineering leadership to drive a proactive approach to reliability
Champion operational readiness for new services before they go to production
Mentor engineers and help scale reliability thinking across the organization

Requirements:

8+ years of software engineering or infrastructure experience, with 3+ years in an SRE or DevOps capacity
Strong experience designing and scaling production systems in cloud-native environments
Proficiency with observability tooling such as Prometheus, Grafana, Datadog, OpenTelemetry, etc.
Experience setting and managing SLAs/SLOs and driving improvements in reliability metrics
Proficient in programming/scripting languages such as Go, Python
Experience with container orchestration (Kubernetes, Helm) and infrastructure-as-code (Terraform, Pulumi, etc.)
Familiarity with CI/CD pipelines and deployment strategies
Exceptional communication skills and a collaborative mindset—able to influence and educate across teams
A mindset of ownership, humility, and learning

Additional Information:

Job Posted:
May 20, 2025

Employment Type:

Fulltime

Work Type:

On-site work

View All Jobs In This Company

Job Link Share:

Member of Technical Staff - Site Reliability Engineer