This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We're hiring a Senior Site Reliability Engineer to lead reliability strategy and drive AI-powered automation at scale. This role involves owning complex systems, shaping architecture, and influencing cross-functional teams.
Job Responsibility:
Define and evolve SLOs, SLIs, and resilience patterns
Build AI-driven automation for detection, remediation, and forecasting
Lead cloud infrastructure and Kubernetes platforms
Drive incident response and operational excellence
Mentor engineers and influence org-wide reliability practices
Requirements:
8+ years of hands-on experience in Site Reliability Engineering, DevOps, or large-scale production operations.
Advanced expertise in AWS, including architecture design across services such as EC2, EKS, VPC, IAM, RDS, S3, and CloudWatch.
Deep experience with Infrastructure-as-Code using Terraform, including complex modules, state management, and governance.
Strong programming and automation skills using Python and Shell
experience building production-grade automation systems.
Expert-level Linux systems knowledge, including performance tuning, security hardening, and deep troubleshooting.
Proven experience operating distributed systems and data streaming platforms such as Kafka in high-throughput environments.
Demonstrated ability to work independently on complex, ambiguous problems with broad organizational impact.
Proven technical leadership experience driving large, cross-team reliability or infrastructure initiatives, including setting technical direction, influencing design decisions, and mentoring engineers to deliver measurable outcomes at scale.
Practical experience designing or implementing AI/ML-driven automation in operations, reliability, or platform engineering.
Experience integrating AI capabilities into monitoring, alerting, incident response, or workflow automation systems.
Strong understanding of how AI can be safely and effectively applied in production environments.
Nice to have:
Experience with advanced observability platforms (Prometheus, Grafana, ELK, or similar) enhanced with AI-driven insights.
Familiarity with predictive analytics, anomaly detection, or AIOps platforms.
Experience influencing architectural decisions at a platform or product level.
Prior experience operating in a 24/7, global, high-availability SaaS environment.
What we offer:
Competitive compensation, variable bonus and performance-based reward opportunities, and retirement programs
Medical, dental, and vision insurance
Generous, flexible time off, plus paid holidays, wellness days, and a company-wide year-end break
Paid parental leave (including fully paid leave for eligible ZEOs, subject to local policy)
Learning & development stipend to support ongoing growth
Opportunities to volunteer and give back, including charitable donation matching where available