This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking a Senior Site Reliability Engineer to drive the reliability, scalability, and operational excellence of Network commerce systems, including subscriptions and personalization services. You will collaborate closely with product and engineering teams to enhance system architecture, deployment safety, observability, and overall performance. As part of the centralized Technical Operations team, you will have horizontal responsibility for production reliability, participate in the on-call rotation for critical commerce services, and influence systems that serve millions of users globally.
Job Responsibility:
Identify systemic reliability risks and implement preventative solutions
Define and maintain SLIs, SLOs, and error budgets aligned with business and user outcomes
Lead incident management, post-incident reviews, and remediation planning
Review and advise on system architecture to improve scalability, availability, and fault isolation
Design strategies for high availability, graceful degradation, and disaster recovery across multi-region environments
Quantify trade-offs between performance, cost, and operational risk
Enhance deployment pipelines and implement automation to reduce risk and accelerate delivery
Apply safe deployment patterns such as canary, blue/green, and progressive delivery
Ensure robust rollback and recovery mechanisms
Build and evolve monitoring, logging, and tracing solutions to provide actionable insights
Collaborate to reduce alert fatigue and improve signal quality
Diagnose performance bottlenecks across infrastructure and applications
Operate cloud-native and containerized workloads at scale
Use Infrastructure as Code tools to deploy and manage resilient platforms
Develop automation frameworks to reduce manual toil and operational risk
Mentor mid-level engineers and advocate SRE best practices across teams
Partner with engineering, product, and security teams to embed reliability into system design
Requirements:
Bachelor's degree in Computer Science, Engineering, or equivalent experience
7+ years in site reliability, production engineering, or systems engineering roles
Strong understanding of distributed systems, consistency models, failure modes, and fault isolation strategies
Hands-on experience with AWS, GCP, or Azure, including multi-region deployments
Proficiency in Kubernetes and large-scale container orchestration
Programming experience in Go, Python, or Java, building automation or reliability systems
Experience designing and operating CI/CD pipelines with deployment safety guardrails
Proven track record leading high-severity incidents and driving systemic remediation
Excellent interpersonal skills with experience influencing cross-team decisions
Nice to have:
Experience with multi-cloud or multi-region resilience architecture
Proficiency in monitoring and observability tools (Prometheus, Grafana, Datadog)
Prior mentorship or technical leadership experience
Familiarity with Infrastructure as Code tools (Terraform, CloudFormation)
Experience using AI-assisted tools for incident analysis, operational efficiency, or observability