CrawlJobs Logo
Briefcase Icon
Category Icon

Filters

×

Manager, Site Reliability Engineering and Incident Management Jobs

1 Job Offers

Filters
Manager, Site Reliability Engineering and Incident Management
Save Icon
Lead our Site Reliability Engineering and Incident Management team in Atlanta. You will drive platform resilience, oversee critical incident response, and mentor a skilled team. This role requires deep cloud expertise and a passion for building reliable, scalable systems in a fast-paced SaaS envi...
Location Icon
Location
United States , Atlanta
Salary Icon
Salary
118000.00 - 160000.00 USD / Year
planetdds.com Logo
Planet DDS
Expiration Date
Until further notice
A Manager of Site Reliability Engineering (SRE) and Incident Management is a critical leadership role at the intersection of technology, operations, and business continuity. This professional is responsible for building and leading teams that ensure the reliability, scalability, and performance of critical digital services and platforms. In today's always-on digital economy, the demand for skilled leaders in this field is high, making Manager, Site Reliability Engineering and Incident Management jobs a pivotal career path for those who thrive under pressure and are passionate about operational excellence. Professionals in this role typically oversee two interconnected functions. The Site Reliability Engineering aspect involves applying software engineering principles to operational and infrastructure challenges. They lead SRE teams in designing, building, and maintaining highly available and scalable systems. This includes implementing robust monitoring, alerting, and logging solutions; automating manual operational tasks to reduce toil; and collaborating with development teams to instill reliability best practices from the initial design phase. They are champions of a culture that balances feature development with system stability, often using Service Level Objectives (SLOs) to guide engineering decisions. The Incident Management facet focuses on preparing for and responding to service disruptions. The manager owns the incident response process, ensuring it is clear, efficient, and effective. They are responsible for assembling and mentoring incident response teams, establishing escalation procedures, and ensuring timely communication during outages. A core part of their responsibility is to drive the post-incident review process, facilitating blameless root cause analyses and ensuring that actionable improvements are implemented to prevent recurrence. They mature the organization's incident command structure and runbooks, turning chaos into a controlled, learning-oriented process. Common responsibilities for this position include leading and mentoring a team of SREs and Incident Managers, fostering a culture of accountability and continuous improvement, overseeing capacity planning and disaster recovery testing, and driving initiatives like chaos engineering to proactively uncover system weaknesses. They act as a bridge between technical teams, executive leadership, and sometimes customers during major incidents. Typical skills and requirements for these leadership jobs include extensive experience in SRE, DevOps, or infrastructure roles, with several years in a direct leadership or management capacity. A deep technical understanding of cloud platforms (like AWS, Azure, or GCP), networking fundamentals, and modern software architecture is essential. Equally important are strong soft skills: exceptional communication, collaboration, and the ability to remain calm and decisive during high-stress situations. A background in developing and enforcing operational best practices, along with a proven track record in improving system reliability and incident response times, is typically sought after. For those seeking to guide organizations through the complexities of modern digital operations, Manager, Site Reliability Engineering and Incident Management jobs offer a challenging and highly impactful career.

Filters

×
Countries
Category
Location
Work Mode
Salary