This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking an experienced Senior Site Reliability Engineer (SRE) to design, build, and operate highly scalable and reliable cloud-based systems. The ideal candidate will have a strong background in DevOps, distributed systems, and cloud infrastructure, with a focus on automation, observability, and system reliability. This role involves working in a fast-paced environment to ensure system uptime, performance, and operational excellence.
Job Responsibility:
Design, implement, and manage highly available, distributed systems
Maintain and optimize cloud infrastructure (AWS/GCP)
Develop automation scripts using Python, Go, Java, or Bash
Manage containerized environments using Docker and Kubernetes
Define and monitor SLIs, SLOs, and error budgets
Implement monitoring, logging, and alerting solutions
Lead incident management, root cause analysis (RCA), and postmortems
Ensure system security and compliance within operational workflows
Improve system reliability through performance tuning and optimization
Collaborate with engineering teams to enhance deployment and release processes
Create and maintain runbooks, dashboards, and operational documentation
Requirements:
8+ years of experience in SRE, DevOps, or Systems Engineering
Strong expertise in Linux/Unix systems and system internals
Proficiency in at least one programming/scripting language (Python, Go, Java, Bash)
Experience designing and operating distributed systems
Hands-on experience with cloud platforms (AWS or GCP)
Experience with Docker and Kubernetes
Strong understanding of monitoring, alerting, and logging concepts
Experience managing SLIs, SLOs, and error budgets
Experience with incident management and RCA processes
Nice to have:
Experience with observability tools (Prometheus, Grafana, Datadog, Splunk, Application Insights)
Experience supporting 24x7 production environments and on-call rotations
Knowledge of chaos engineering and resiliency testing
Experience with canary deployments, feature flags, and progressive delivery