This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Planet DDS is seeking a Manager, Site Reliability Engineering and Incident Management, to manage our Site Reliability Engineering function as well as our external incident response function for our production operations. To be successful, the manager will need to be self-motivated, communicate clearly, and operate with a sense of urgency in a fast-paced environment.
Job Responsibility:
Lead and mentor a team of SREs and Incident Managers
Foster a culture of reliability, accountability, and continuous improvement
Collaborate with engineering teams to design resilient platform architectures
Oversee the incident response process for outages and service disruptions
Ensure timely detection, escalation, and resolution of incidents
Drive post-incident reviews (PIRs) and root cause analysis
Implement improvements based on lessons learned to prevent recurrence
Mature and enforce best practices for incident response and runbooks
Automate operational tasks to reduce toil and improve efficiency