This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking a lead Site Reliability Engineer to join our dynamic Capital Markets Platform Engineering team. This role will focus on designing, building, and maintaining a secure, scalable, and resilient platform infrastructure with Kubernetes and Google Kubernetes Engine (GKE) as the core technologies. The ideal candidate will have expertise in container orchestration, cloud infrastructure, automation, and CI/CD pipelines.
Requirements:
Kubernetes & GKE Management: Design, deploy, and maintain Kubernetes clusters on GKE / on-premises infrastructure for mission-critical capital markets applications, ensuring high availability, scalability, and compliance with security standards
Platform Automation: Develop Infrastructure-as-Code (IaC) solutions using tools like Terraform, Helm, and Ansible to automate cluster provisioning, scaling, and maintenance
Cloud Infrastructure Management: Architect and manage cloud infrastructure within Google Cloud Platform (GCP) with a focus on cost optimization, network security, and performance
CI/CD Enablement: Implement and optimize CI/CD pipelines for containerized applications using tools like Jenkins, ArgoCD, and GitLab CI ensuring rapid and secure software delivery
Observability & Monitoring: Implement monitoring, logging, and alerting solutions using tools such as Dynatrace, Prometheus, ELK Stack, and Google Cloud Operations Suite
Security & Compliance: Ensure platforms meet security, regulatory, and compliance requirements for capital markets, including RBAC, service mesh, encryption, and network policies
Collaboration & Mentoring: Work closely with development, security, and operations teams to promote a DevOps culture and mentor junior engineers on platform engineering best practices
Incident Management: Participate in L2/L3 on-call rotations and help with incident management, root cause analysis, and proactive risk mitigation