This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We’re seeking a Staff Site Reliability Engineer to serve as a technical leader within our infrastructure organization. In this role, you’ll help shape the reliability strategy across our engineering teams, drive adoption of best practices, and tackle our most complex infrastructure challenges. You’ll be part of an international, highly engaged and technical group that is well-versed in building enterprise-ready and extremely secure software systems.
Job Responsibility:
Define and drive the technical vision for infrastructure reliability across the organization
Architect large-scale, fault-tolerant systems on AWS using Terraform
Lead cross-functional initiatives to improve system reliability, scalability, and efficiency
Establish standards for infrastructure-as-code, CI/CD, and deployment practices
Design and implement solutions for our most complex operational challenges
Lead incident response for critical outages and drive systemic improvements
Mentor senior engineers and help grow the SRE team’s capabilities
Evaluate and introduce new technologies that improve operational excellence
Influence engineering culture around reliability, observability, and operational maturity
Requirements:
5+ years of experience in SRE, DevOps, or systems engineering, with demonstrated technical leadership
Expert-level knowledge of Terraform, including module design, state management, and scaling IaC across teams
Deep expertise in AWS architecture and services at scale, with strong focus on ECS
Proven experience designing and operating containerized workloads on ECS, including capacity planning, service scaling, and task placement strategies
Strong experience designing and implementing CI/CD systems with GitHub Actions or similar tools
Track record of leading complex, cross-team technical initiatives
Advanced proficiency in Python, Ruby, Javascript, or similar languages
Strong understanding of distributed systems principles
Excellent written and verbal communication skills
Proven ability to balance long-term technical strategy with immediate operational needs
Nice to have:
Experience building internal developer platforms or self-service infrastructure tooling
Knowledge of FedRAMP
Background in cost optimization and FinOps practices
Contributions to open-source infrastructure projects
Experience scaling infrastructure organizations and processes
Experience defining and implementing SLO frameworks