This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As a Site Reliability Engineer at Endor Labs, you’ll play a pivotal role in shaping the reliability, performance, and scalability of our systems. You’ll partner with engineering teams across the company to define and implement best practices that improve operational excellence, reduce incidents, and foster a culture of accountability and continuous improvement.
Job Responsibility:
Lead the definition and rollout of SRE practices across engineering, including SLAs, SLOs, and error budgets
Design and build monitoring, alerting, and observability frameworks that empower teams to own the reliability of their services
Establish incident response protocols and lead post-incident reviews to drive learning and remediation
Collaborate with product and platform teams to improve system architecture with reliability and performance in mind
Advocate for automation of deployments, scaling, and failover procedures across services
Create tooling and dashboards that give teams visibility into system health, latency, and error rates
Foster a blameless culture and partner closely with engineering leadership to drive a proactive approach to reliability
Champion operational readiness for new services before they go to production
Mentor engineers and help scale reliability thinking across the organization
Requirements:
8+ years of software engineering or infrastructure experience, with 3+ years in an SRE or DevOps capacity
Strong experience designing and scaling production systems in cloud-native environments
Proficiency with observability tooling such as Prometheus, Grafana, Datadog, OpenTelemetry, etc.
Experience setting and managing SLAs/SLOs and driving improvements in reliability metrics
Proficient in programming/scripting languages such as Go, Python
Experience with container orchestration (Kubernetes, Helm) and infrastructure-as-code (Terraform, Pulumi, etc.)
Familiarity with CI/CD pipelines and deployment strategies
Exceptional communication skills and a collaborative mindset—able to influence and educate across teams
Welcome to CrawlJobs.com – Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.
We use cookies to enhance your experience, analyze traffic, and serve personalized content. By clicking “Accept”, you agree to the use of cookies.