This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are looking for a Site Reliability Engineer who views 'manual effort' as a bug to be fixed. In this role, you won't just be keeping the lights on; you will be the architect of our system’s resilience. We need a proactive engineer who is obsessed with Kubernetes and Cloud infrastructure, but also has a visionary streak—someone eager to experiment with AI-driven operations (AIOps) to predict failures and automate responses. If you enjoy building self-healing systems and staying ahead of the tech curve, this is the place for you.
Job Responsibility:
Designing and implementing self-healing infrastructure using Kubernetes to maintain high uptime and system integrity
Optimizing our cloud footprint (AWS/GCP/Azure) to ensure our platforms can handle rapid growth
Proactively identifying opportunities to integrate AI tools into our observability stack to automate incident detection and root-cause analysis
Writing clean, efficient code to automate repetitive operational tasks
Building advanced monitoring and alerting frameworks that provide deep insights into system health and performance
Requirements:
Extensive experience managing production-grade K8s environments, including ingress, service mesh, and container security
Deep understanding of cloud networking, storage, and compute services within a major provider (AWS, Azure, or GCP)
Proactive mindset
Active interest in the AI landscape and a desire to leverage LLMs or machine learning to improve SRE workflows
Experience with at least one language (such as Java, Python, Go, or Ruby) to bridge the gap between software engineering and operations