This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We’re looking for an SRE with strong Kafka experience and a deep understanding of SRE best practices. You’ll combine hands‑on technical improvements with the ability to delegate work effectively to EventBus developers. You’ll collaborate closely with the EventBus, Kafka, Telemetry, and Incident Response teams, while also working independently to improve monitoring, reduce noise, strengthen alerting, and track remediation progress. This role sits at the centre of a global platform used by hundreds of developers and joins a fast‑growing, experienced SRE group based in Edinburgh.
Job Responsibility:
Staying informed on all EventBus incidents, including impact, root cause, detection, and ongoing remediation
Responding to incidents calmly and efficiently, communicating clearly with reporters and partner teams, and recommending remediations based on urgency and impact
Proposing improvements informed by prior incidents, potential risks, and industry standards—e.g., new metrics, SLOs, fallback mechanisms
Leading incident retrospectives and sharing insights with the wider team
Creating and distributing postmortems for high‑impact operational events
Collaborating with developers to write, maintain, and promote runbooks and playbooks
Improving alert quality and reducing alert fatigue by tuning signal‑to‑noise ratios
Designing and implementing automated recovery solutions for known issues
Building a roadmap toward 24/7 availability, rapid failover recovery, self‑detection, and automated resolution of common issues
Helping EventBus users diagnose issues with their own producers and consumers
Requirements:
3+ years in an SRE role, including experience with defining and managing SLOs