This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are looking for a Lead Site Reliability Engineer (SRE) with strong experience in managing production systems, distributed architectures, and cloud-native environments. This role focuses on ensuring system reliability, scalability, and performance while driving SRE best practices across teams. You will work closely with engineering and product teams to improve system resilience, automate operations, and lead incident management, while mentoring junior engineers and owning reliability initiatives end-to-end.
Job Responsibility
Lead troubleshooting and resolution of complex production issues in distributed systems
Drive reliability engineering practices, ensuring high availability and performance of systems
Manage and optimize messaging systems like Apache Kafka, RabbitMQ, and Redis
Architect, manage, and optimize Kubernetes clusters for scalability and resilience
Manage CI/CD pipelines and drive deployment automation
Implement and maintain monitoring, alerting, and observability using Prometheus, Grafana, and ELK stack
Lead incident management, root cause analysis (RCA), and post-mortem reviews
Mentor junior engineers and collaborate with cross-functional teams to improve system design and reliability
Requirements
5+ years of experience in SRE / DevOps / Production Engineering roles
Strong expertise in troubleshooting distributed systems and microservices architecture
Hands-on experience with Kafka, RabbitMQ, and Redis
Strong knowledge of Kubernetes and container orchestration
Experience with CI/CD pipelines and deployment automation
Solid understanding of Linux, networking, and cloud platforms (AWS / Azure / GCP)
Experience with Infrastructure as Code (Terraform, Ansible)
Strong scripting skills (Python, Bash, or similar)
Database experience: MySQL / Oracle / MongoDB
Strong problem-solving, ownership mindset, and ability to lead initiatives
What we offer
Impactful Work: Play a key role in ensuring reliability and scalability of platforms that handle large-scale, real-time communication systems
Tremendous Growth Opportunities: Accelerate your career by leading critical reliability initiatives and working on high-scale distributed systems
Innovative Environment: Work in a fast-paced ecosystem that embraces automation, cloud-native technologies, and continuous improvement