This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We’re now on the lookout for a Cloud Site Reliability Engineer to strengthen our Technology team that is working on delivering robust, scalable, and reliable cloud infrastructure and services. In your role as Cloud Site Reliability Engineer, you’ll work at the heart of our platform operations, ensuring high availability, reliability, and performance of our cloud-based systems. You’ll be responsible for automating infrastructure, implementing resilience strategies, and supporting our global client base with best-in-class reliability engineering. This is a London based role collaborates with production support, development, cloud platform, and architecture teams to deliver operational excellence and continuous improvement.
Job Responsibility:
Manage and optimise cloud infrastructure to ensure high availability and system reliability
Design, deploy, and maintain scalable infrastructure on AWS using Kubernetes, Docker, and Infrastructure as Code (Terraform, CloudFormation)
Implement and automate resilience testing strategies using chaos engineering tools (e.g., AWS Fault Injection, Gremlin, Chaos Monkey, LitmusChaos)
Monitor and observe systems using tools such as Prometheus, Grafana, Datadog, New Relic, and Elastic Stack
Automate operational processes using scripting languages (Python, Go, Shell, Ruby, Java)
Participate in incident response, triage, mitigation, and root cause analysis, ensuring minimal downtime and continuous improvement
Develop playbooks for common incidents, reducing Mean Time to Resolution (MTTR)
Design and test disaster recovery strategies, conduct DR drills, and implement multi-region failover and data replication
Define and manage Service Level Objectives (SLOs), Service Level Agreements (SLAs), and Service Level Indicators (SLIs)
Collaborate across teams to improve platform resilience and performance, and mentor others in SRE best practices
Ensure compliance with GBST policies, statutory requirements, and industry standards (e.g., PCI DSS, GDPR, ISO 27001)
Deliver 24/7 support via on-call rotation for after-hours issues
Requirements:
ITIL Foundation Certification
AWS Certified Cloud Practitioner (CCP)
Terraform Associate
Hands-on experience with AWS cloud administration and automation technologies