CrawlJobs Logo

Member of Technical Staff - Site Reliability Engineer

https://www.endorlabs.com Logo

Endor Labs

Location Icon

Location:
United States, Palo Alto

Category Icon
Category:
IT - Software Development

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

170000.00 - 220000.00 USD / Year

Job Description:

As a Site Reliability Engineer at Endor Labs, you’ll play a pivotal role in shaping the reliability, performance, and scalability of our systems. You’ll partner with engineering teams across the company to define and implement best practices that improve operational excellence, reduce incidents, and foster a culture of accountability and continuous improvement.

Job Responsibility:

  • Lead the definition and rollout of SRE practices across engineering, including SLAs, SLOs, and error budgets
  • Design and build monitoring, alerting, and observability frameworks that empower teams to own the reliability of their services
  • Establish incident response protocols and lead post-incident reviews to drive learning and remediation
  • Collaborate with product and platform teams to improve system architecture with reliability and performance in mind
  • Advocate for automation of deployments, scaling, and failover procedures across services
  • Create tooling and dashboards that give teams visibility into system health, latency, and error rates
  • Foster a blameless culture and partner closely with engineering leadership to drive a proactive approach to reliability
  • Champion operational readiness for new services before they go to production
  • Mentor engineers and help scale reliability thinking across the organization

Requirements:

  • 8+ years of software engineering or infrastructure experience, with 3+ years in an SRE or DevOps capacity
  • Strong experience designing and scaling production systems in cloud-native environments
  • Proficiency with observability tooling such as Prometheus, Grafana, Datadog, OpenTelemetry, etc.
  • Experience setting and managing SLAs/SLOs and driving improvements in reliability metrics
  • Proficient in programming/scripting languages such as Go, Python
  • Experience with container orchestration (Kubernetes, Helm) and infrastructure-as-code (Terraform, Pulumi, etc.)
  • Familiarity with CI/CD pipelines and deployment strategies
  • Exceptional communication skills and a collaborative mindset—able to influence and educate across teams
  • A mindset of ownership, humility, and learning

Additional Information:

Job Posted:
May 20, 2025

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:
Welcome to CrawlJobs.com
Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.