This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Perform routine maintenance: Kafka ecosystem upgrades (controllers, brokers, connect, and schema registry), rolling restarts, etc
Create and maintain runbooks, runbook automation, and post-incident reports
Optimize performance and resource utilization
benchmark and tune clusters
Support Kafka Connect/Schema Registry service and troubleshoot connector issues
Contribute to CI/CD pipeline improvements for infrastructure and deployment automation
Requirements:
Production-grade Apache Kafka operations experience, managing, maintaining and upgrading Kafka clusters in production environments with a focus on high availability, disaster recovery, fail-over and overall reliability
Proficiency in installing and configuring monitoring systems using Grafana (building dashboards), Prometheus, Splunk , JMX metrics
Automation and orchestration experience: Terraform , Ansible, Helm, Kubernetes (EKS/AKS/GKE)
Strong Linux system administration experience, including troubleshooting, automation and scripting for efficient infrastructure management
Experience in Production Support (ITIL processes followed) and participating in 24x7 on-call rotations , documenting incidents/postmortems
Experience in supporting JVM tuning, GC Analysis, network and disk I/O diagnostics
Experience in TCP/IP, routing, switching and firewall configurations relevant to Kafka operations
Nice to have:
Deep Kafka performance tuning and capacity planning experience
Knowledge of message delivery semantics and guarantees (at-least-once, exactly-once)