This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking a skilled Site Reliability Engineer (SRE) to join our team and help build, maintain, and scale cloud‑native infrastructure in Microsoft Azure. This role partners closely with development and operations teams to ensure systems are reliable, scalable, secure, and cost‑efficient. The ideal candidate is passionate about automation, infrastructure‑as‑code, GitOps, and observability, and thrives in a collaborative, fast‑paced environment. You will play a critical role in improving system resilience and establishing strong SRE practices from the ground up.
Job Responsibility:
Design, implement, and manage Azure cloud infrastructure using Terraform and Terragrunt
Maintain, operate, and optimize Kubernetes clusters on Azure Kubernetes Service (AKS)
Build and manage CI/CD pipelines using GitHub Actions / GitHub Workflows
Implement GitOps-based deployments using ArgoCD
Enhance system reliability by implementing monitoring, alerting, and observability solutions using Grafana
Automate operational tasks to reduce toil and improve team efficiency
Participate in on-call rotations, incident response, root cause analysis, and post-mortems
Partner with development teams to improve application performance, scalability, and resilience
Implement and promote SRE best practices, including: Service Level Indicators (SLIs)
Service Level Objectives (SLOs)
Error budgets
Continuously improve system performance, security posture, and cloud cost efficiency
Requirements:
3+ years of experience in an SRE, DevOps, or Cloud Infrastructure role
Strong hands-on experience with Microsoft Azure
Infrastructure-as-Code experience using Terraform and Terragrunt
Experience designing and managing cloud-native environments
Proficiency with Kubernetes (preferably AKS)
Experience supporting containerized workloads and orchestration patterns
Exposure to Databricks environments is required
Experience with GitHub Actions / GitHub Workflows
Hands-on experience with ArgoCD and GitOps-based deployment strategies
Solid understanding of Grafana
Hands-on experience with Java in a production or platform context
Nice to have:
Experience with Prometheus is a plus
Familiarity with Loki and Tempo is a plus
What we offer:
medical, vision, dental, and life and disability insurance