This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking an experienced Site Reliability Engineer II to help build, maintain, and scale our cloud‑native (Azure) environment. This role partners closely with development and operations teams to ensure high reliability, scalability, security, and efficiency. The ideal engineer is passionate about automation, observability, cloud infrastructure, and SRE best practices.
Job Responsibility:
Design, implement, and manage Azure cloud infrastructure using Terraform and Terragrunt
Maintain, monitor, and optimize Kubernetes clusters (AKS)
Build and manage CI/CD pipelines using GitHub Actions/Workflows and ArgoCD in a GitOps model
Enhance reliability through monitoring, alerting, and observability using Grafana (Prometheus, Loki, Tempo is a plus)
Automate operational tasks to reduce manual toil
Participate in on-call rotations, incident response, and post-mortem reviews
Collaborate with development teams to improve application reliability, performance, and scalability
Implement and advocate for SRE practices including SLIs, SLOs, and error budgets
Continuously improve infrastructure performance, cost efficiency, and security posture
Requirements:
3+ years experience in SRE, DevOps, or Cloud Infrastructure roles
Strong hands-on experience with Microsoft Azure services
Advanced experience with Terraform and Terragrunt
Proficiency with Kubernetes/AKS and container orchestration
Experience with CI/CD tools including GitHub Actions and ArgoCD
Solid understanding of observability tooling, especially Grafana
Hands-on experience with Java environments (for app debugging/support)