This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking a Senior Software Engineer to join our Security Product team, focused on improving the reliability and resilience of our platform across customer environments. You will be embedded within the engineering team, investigating system outages and failures, identifying recurring patterns, and driving fixes - either independently or in collaboration with service owners. You will work closely with production engineering and SRE teams to build playbooks, conduct post-incident reviews, and ensure problems are properly addressed at their root cause.
Job Responsibility:
Investigate system outages and production failures across customer environments (SaaS and self-hosted), spanning RabbitMQ, Kubernetes, Docker, Postgres, and cloud infrastructure (AWS, Azure, GCP)
Identify recurring failure patterns and systemic weaknesses from incident data, and drive them to resolution - whether by writing Go code yourself (resilience features, infrastructure fixes, observability) or by collaborating with service owners to prioritize and address reliability gaps
Lead and participate in post-incident reviews - document root causes, corrective actions, and follow through to ensure issues are properly resolved
Collaborate with production engineering and SRE teams to develop and maintain operational playbooks and runbooks that reduce time-to-resolution
Diagnose root causes across the full stack - message queue failures, container lifecycle issues, cloud networking, disk and memory pressure, and deployment topology mismatches
Design and implement data migrations and lifecycle management for infrastructure components such as queue management and vhost operations
Emit and monitor operational metrics to proactively detect infrastructure degradation and measure service reliability
Requirements:
7+ years of experience in software engineering, with at least 3+ years focused on debugging and solving infrastructure-level problems in distributed systems
Strong proficiency in Go
familiarity with Python and Helm is a plus
Deep hands-on experience with RabbitMQ or similar message brokers (Kafka, ActiveMQ) - including queue management, clustering, monitoring, and production troubleshooting
Solid working knowledge of Kubernetes (pod lifecycle, resource management, networking, debugging CrashLoopBackOff / OOMKilled scenarios) and Docker
Experience investigating production incidents and conducting post-incident reviews with clear root cause analysis and follow-through
Strong understanding of Linux systems, networking fundamentals, and cloud infrastructure (AWS, Azure, or GCP)
Ability to read and interpret logs, thread dumps, heap dumps, and system metrics to isolate root causes under time pressure
Excellent analytical and problem-solving skills with a methodical approach to debugging
Strong written and verbal communication skills - ability to produce clear incident reports, root cause analyses, and playbooks, and to communicate effectively across engineering, SRE, and customer-facing teams
Nice to have:
Experience with artifact management or software supply chain tools (e.g., JFrog Artifactory, JFrog Xray)
Experience with observability stacks (Prometheus, Grafana, ELK/OpenSearch, Coralogix)
Experience with infrastructure-as-code tools (Terraform, Helm, Ansible)
Prior experience in a customer-facing technical role (escalation engineering, support engineering, or field engineering)
Familiarity with AI-assisted development tools - experience with skills, rules, hooks, and setting up Agents for developer workflows