This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Site Reliability Engineer role requires 5–8 years of experience in ensuring system reliability and performance. Candidates should have expertise in observability tools like Splunk and Prometheus, along with cloud platforms such as AWS and Azure. Responsibilities include incident management, automation, and collaboration with engineering teams. Strong scripting skills in Python are essential.
Job Responsibility:
Implement and maintain observability across metrics, logs, traces, and events
Build and optimize monitoring dashboards and service health indicators using Splunk or similar tools
Configure, fine-tune, and maintain proactive alerts with high signal-to-noise ratio
Lead incident response, conduct root cause analysis (RCA), and drive long-term corrective measures
Define, measure, and enhance SLIs, SLOs, reliability KPIs, and error budgets
Improve system performance, scalability, and availability across environments
Automate monitoring, alerting, and operational workflows to reduce manual toil
Standardize and maintain telemetry instrumentation across services
Own and optimize logging pipelines, ingestion, parsing, indexing, and retention
Collaborate with engineering teams to integrate reliability best practices into application development
Participate in on-call rotations and ensure timely incident resolution
Partner with cloud/platform teams to enhance deployment readiness and operational stability
Requirements:
5–8 years of experience in SRE, DevOps, or system reliability roles
Strong hands-on experience with Splunk (queries, dashboards, alerts, ingestion)