This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Our client is a forward-thinking technology company dedicated to providing robust internal cloud services. Role & Responsibilities:You will be a key player responsible for the operational excellence and continuous improvement of our client extensive Kubernetes Container Platform. This role involves ensuring the high availability, security, and performance of a complex infrastructure supporting over 100 clusters and 10,000+ nodes.
Job Responsibility:
Lead the creation, approval, and execution of operational procedures for node and cluster management
Drive improvements in operational methods and documentation for enhanced efficiency and reliability
Manage and resolve alerts and incidents, ensuring minimal disruption to services
Implement and maintain security requirements through regular OS and middleware updates and patching
Participate in on-call rotations, including late-night releases and monitoring
Provide expert-level support for escalated user inquiries related to the container platform
Requirements:
Minimum of 3 years of experience operating production Kubernetes environments
Proven ability to create, review, and execute detailed operational procedures for large-scale infrastructure
Experience in handling alerts and incidents in a production environment
Willingness to perform on-call duties, including late-night operations
Proficiency in Linux system administration
Strong understanding of infrastructure provisioning and operation tools (e.g., Ansible, Helm)
Familiarity with monitoring tools such as Prometheus and Grafana
Experience with CI/CD pipelines and GitOps practices
A proactive approach to continuous improvement of operational processes