Manager, Site Reliability Engineering and Incident Management Job at Planet DDS (Atlanta)

Site Reliability Engineering Manager

Hewlett Packard Enterprise (HPE) is looking for a Site Reliability Engineering M...

Location

India , Bangalore

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

7–10 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
Minimum 2 years of experience managing or leading cloud operations teams
Deep understanding of cloud platforms (AWS, GCP, or Azure) and cloud-native architectures
Hands-on experience with Kubernetes, containers, infrastructure as code (e.g., Terraform), and configuration management tools
Strong foundation in observability (monitoring, logging, tracing), automation using Python, and incident response
Familiarity with modern CI/CD automation and tools
Excellent communication, stakeholder management, and team-building skills
Experience scaling SRE practices in high-growth or large-scale environments
Ability to balance long-term reliability initiatives with short-term delivery needs.

Job Responsibility

Lead and mentor a team of Site Reliability Engineers, supporting their growth, performance, and well-being
Own the reliability strategy for SASE cloud infrastructure systems, including incident management, SLIs/SLOs, and capacity planning
Partner with Engineering, Product, and Security teams to design and deliver highly available, scalable, and resilient cloud-native services
Guide the team in building automation, improving observability, and improve operational efficiency of our cloud infrastructure
Drive adoption of best practices in monitoring, alerting, on-call operations, and runbook development
Build and maintain a strong engineering culture based on ownership, collaboration, and continuous learning
Define and track key reliability metrics, and report on team performance and system health to leadership
Contribute to hiring, onboarding, and career development for SREs.

What we offer

Health & Wellbeing benefits for physical, financial, and emotional wellbeing
Personal & Professional Development programs
Unconditional inclusion in the workplace.

Fulltime

Cloud Security Site Reliability Engineer

This role sits within the Cloud Security team responsible for Private and Public...

Location

Singapore , Singapore

Salary:

Not provided

Citi

Expiration Date

Until further notice

Requirements

Bachelor’s degree or equivalent work experience
6+ years of relevant work experience
Highly motivated self-starter with excellent interpersonal and communication skills
Certification or formal training in site reliability engineering concepts and practices
Prior experience working towards SLIs, SLOs and observability capabilities at a large scale
4+ years experience in Python (preferable) or Java, on large scale systems alongside Linux based scripting languages
Experience working on observability, logging and metrics toolsets
Experience of k8s and container technologies such as Docker, Openshift and EKS
Experience with public cloud technologies such as AWS, GCP or Azure
Experience with Secrets products such as HashiCorp Vault or CyberArk

Job Responsibility

Working across Container products and Secrets products, across Public and Private Cloud, as well as Cloud native specific products
Architecting and building tools and platforms that provide capabilities for SRE
Collaboration with multiple stakeholders and partners across Engineering and Operations as well as partner teams within the wider Citi organisation
Actively owning production level incidents till resolution.

What we offer

Equal opportunity employer
Accessibility support for persons with disabilities.

Fulltime

Principal Site Reliability Engineer

Location

United States , Ft. Meade

Salary:

Not provided

CipherLogix

Expiration Date

Until further notice

Requirements

Fourteen (14) years experience in software development/engineering, including requirements analysis, software development, installation, integration, evaluation, enhancement, maintenance, testing, and problem diagnosis/resolution
Ten (10) years experience in system engineering/architecture
Ten (10) years experience working with products that support highly distributed, massively parallel computation needs such as Hbase, Hadoop, CloudBase/Acumulo, Big Table, Cassandra, Scality etc
At least ten (10) years experience writing software scripts using scripting languages such as Perl, Python, or Ruby for software automation
At least four (4) years experience managing and monitoring large Cloud System (>200 nodes). Cloud Systems Administrator or Developer Certification
Experience in performing and providing technical direction for the development, engineering, interfacing, integration, and testing of complete hardware/software systems to include monitoring technical health of a system, improving organizational processes, implementation of postmortem (failure) analysis and incident management
Ten (10) years experience in the cleared environment
Ten (10) years demonstrated experience developing software for one of the following: Windows, UNIX, or Linux OS
Knowledge and experience with developing distributed storage routing and querying algorithms
Experience in developing documentation required to support a program’s technical issues and training situations

Fulltime

Staff Site Reliability Engineer

We are looking for a Site Reliability Engineer to own our internal systems infra...

Location

United States , Sunnyvale

Salary:

175000.00 - 250000.00 USD / Year

Figure

Expiration Date

Until further notice

Requirements

Strong experience with Linux/Unix systems administration
Proficiency in programming/scripting
Extensive experience with cloud platforms (Azure, AWS, GCP) and on-prem hardware architectures
Experience designing, deploying, and operating high-availability, fault-tolerant, and distributed systems
Mastery of infrastructure as code (Terraform, CloudFormation, Ansible…)
Familiarity with monitoring, logging, and alerting tools (Prometheus, Grafana, Datadog…)
Solid understanding of networking fundamentals (TCP/IP, DNS, HTTP, load balancers, firewalls)
Experience defining Service Level Objectives (SLO), developing runbooks/incident response plans, facilitating post-mortems and managing systems assets
Ability to work in cross-functional teams with developers, infra, and product teams
Excellent verbal and written communication skills

Job Responsibility

Be the go to person for mission critical infrastructure enabling critical operations such as Source Configuration Management, CI/CD systems, software distribution, supplier portals, manufacturing and more
Migrate SaaS to self-hosted solutions to enhance security and reliability
Implement monitoring and alerting systems, and define incident response plans and runbooks
Reduce human workload through automation to automate deployment and scaling
Establish strong relationships with stakeholders to identify infrastructure needs and establish Service Level Objectives
Use a data driven approach to demonstrate service robustness and track optimization work
Partner with the security team to ensure that security remediations and updates are applied in a timely manner

Fulltime

Site Reliability Engineering Manager

The Wikimedia Foundation is looking for an Engineering Manager to join our SRE t...

Location

United States of America

Salary:

132439.00 - 208378.00 USD / Year

Wikimedia Foundation

Expiration Date

Until further notice

Requirements

Prior experience managing teams
Prior hands-on experience with software or reliability engineering (within the last 3 years preferred)
Ability to analyze complex systems, troubleshoot issues, and devise effective solutions under pressure
Proficiency in project management methodologies to effectively plan, execute, and track new and existing initiatives
Strong understanding of cloud computing, networking, Linux systems administration, containerization (e.g., Docker, Kubernetes), and infrastructure as code (e.g., Terraform, Ansible) to be able to provide technical support to the team
Aptitude for automation and streamlining of tasks
Communicate effectively in both spoken and written English
Ability to work independently, as an effective part of a globally distributed team
Ability to travel several times a year for occasional in-person meetings
B.S. or M.S. in Computer Science or the equivalent in related work experience

Job Responsibility

Managing one to two globally distributed teams within Wikimedia’s Site Reliability Engineering organization
Providing guidance, mentorship, and support to ensure the team's effectiveness and growth
Working with team members to set individual performance goals, and supporting them in meeting and evolving their goals and career path
Recruiting, hiring, and helping onboard new team members
Triaging incoming workload, maintaining focus on priorities, and setting realistic expectations for both peers and team members
Coordinating and communicating with other members of the Wikimedia product & engineering teams on relevant projects, executing complex projects and contributing to the organizational strategy
Continuously developing the roadmap of the team in alignment with other SRE and Product & Technology teams, and helping to draft and execute the team’s annual and quarterly plans
Project managing new and existing initiatives
Leading the definition, refinement, and execution of the processes through which the team manages and performs work
Leading incident response, diagnosis, and follow-up on system alerts and outages across Wikimedia’s production infrastructure

Fulltime

Senior Site Reliability Engineer

Affirm is reinventing credit to make it more honest and friendly, giving consume...

Location

Spain

Salary:

85000.00 - 115000.00 EUR / Year

Affirm

Expiration Date

Until further notice

Requirements

4+ years of experience designing, developing and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin
A track record of developing highly available distributed systems using technologies like AWS, MySQL and Kubernetes
Meaningful experience contributing in or driving parts of the Incident Lifecycle process, enabling actionable insights that improve the quality culture, reliability, resilience, and system performance
4+ years working in a Site Reliability or Production Engineering team
Experience defining a technical plan for the delivery of a significant feature or system component with an elegant, simple and extensible design
Experience in making impactful changes in a large code base, and have developed a suite of tools and practices that enable you and your team to do so safely
Strong verbal and written communication skills that support effective collaboration with our global engineering team
On-Call Rotation - There would be an on-call rotation for this role as a requirement

Job Responsibility

You will be responsible for owning and delivering quarterly goals for your team, leading engineers on your team through ambiguity to solve open-ended problems, and ensuring that everyone is supported throughout delivery
You will support your peers and stakeholders in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics by participating in ideation, articulating technical constraints, and partnering on decisions that properly consider risks and trade-offs
You will proactively identify technical solutions and operational processes that strengthen incident readiness, response, and post-incident analysis
You will support the operations and availability of your team’s artifacts by creating and monitoring metrics, escalating when needed, and supporting “keep the lights on” & on-call efforts
You will foster a culture of quality and ownership on your team by setting or improving code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks
You will help develop talent on your team by providing feedback and guidance, and leading by example

What we offer

Flexible Spending Wallets for tech, food and lifestyle
Away Days - wellness days to take off work and recharge
Learning & Development programs
Parental benefit
Employee Resource & Community Groups
Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount

Fulltime

Staff Site Reliability Engineer

At Ledger, we are looking for an experienced Reliability Engineer to join our SR...

Location

France , Paris

Salary:

Not provided

Ledger

Expiration Date

Until further notice

Requirements

8+ years on cloud engineering at scale, on organizations operating SaaS solutions
Proficiency in working in Unix/Linux environments, Git, Python, Terraform, Kubernetes, AWS cloud solutions and architectures, CI/CD tools, Argocd, Ansible, configuration management, etc.
Strong knowledge on observability practices, with experience implementing and managing Logging, Monitoring and Alerting framework with solutions such as Datadog or Prometheus/Grafana/Loki.
Experience of cross-functional work and the ability to demonstrate a collaborative approach with regards to building key relationships across the organization and define projects scope, goals, plan and deliverables
Customer focused with the ability to identify and understand both internal and external customer's needs
Creative problem-solving and analysis skills with an ability to identify, develop, and implement solutions to meet the needs of the business
Excellent presentation and written communication
Ability to deal with ambiguity, high level of pressure and rapidly changing environments
Engineering degree.

Job Responsibility

Participate in building a DevOps / SRE culture and enable the transition to modern infrastructure management and deployment practices
Participate in building the SRE team roadmap (vision and delivery accountability). Anticipate stakeholder needs, game-changing technologies emergence and challenge scope / deadlines
Perform integration of platform software components
Participate to design and deliver solutions to improve the availability, scalability, latency, and efficiency of systems
Influence and create standards & best practices in support of service level objectives
Automate key SRE metrics including SLOs/SLAs and error budgets
Provide expert support to our level-2/application support team, to troubleshoot priority incidents, and conduct post-mortems
Apply analytics on past incidents and usage patterns to predict issues and take proactive actions
Ensure control of technical debt and promote quality practices
Follow SRE and chaos engineering approaches across all strategic systems to predict in coordination with Service Design and prevent outages and improve solution availability

What we offer

Equity: Employees are the foundation of our success, and we award stock options so you can share in that success as we grow
Flexibility: A hybrid work policy
Social: Annual company outing for Ledgerdary Days, plus frequent social events, snacks and drinks
Medical: Comprehensive health insurance policy offering extensive medical, dental and vision care coverage
Well-being: Personal development, coaching & fitness with our dedicated partners
Vacation: Five weeks of paid leave per year, in addition to national holidays and rest & relaxation (RTT) days
High tech: Access to high performance office equipment and gadgets, including Apple products
Transport: Ledger reimburses part of your preferred means of transportation
Discounts: Employee discount on all our products.

Fulltime

Staff Site Reliability Engineer

Affirm is reinventing credit to make it more honest and friendly, giving consume...

Location

Spain

Salary:

101000.00 - 131000.00 EUR / Year

Affirm

Expiration Date

Until further notice

Requirements

8+ years of experience designing, developing, advocating as a point subject of reference, and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin
Extensive track record of developing highly available distributed systems using technologies like AWS, MySQL, Spark and Kubernetes
Track record of managing, driving and improving the Incident Livecycle process from live incident management through retrospective and post-incident analysis to provide actional insights to enhance overall system reliability, resilience, and performance
7+ years experience in Site Reliability or Production Engineering teams
Experience delivering major features, system components or deprecating existing functionality in a system through the definition of a technical and execution plan
Ability to write high quality code that is easily understood and used by others
Strong verbal and written communication skills that support effective collaboration with our global engineering team and key stakeholders of an organization
Equivalent practical experience or a Bachelor’s degree in a related field
Based in Spain for the role

Job Responsibility

Set technical strategy vision for your team on a multi year-long time scale, and help your team tie it together with critical, business-impacting projects
Collaborate across teams in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics to ensure technical sustainability, risks and trade-offs are well understood and managed
Act as a force-multiplier for your team through your definition and advocacy of technical solutions and operational processes
Take ownership of your team’s operations and availability by ensuring you have the right monitoring, triage rotations, playbooks, policies, testing and alerting in place to support “keep the lights on” & on-call efforts
Foster a culture of quality and ownership on your team by setting code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks
Help develop talent on your team by providing feedback and guidance, and leading by example
Participate in an on-call rotation

What we offer

Flexible Spending Wallets for tech, food and lifestyle
Away Days - wellness days to take off work and recharge
Learning & Development programs
Parental benefit
Employee Resource & Community Groups
Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount
Visa sponsorship

Fulltime

Select Country

Manager, Site Reliability Engineering and Incident Management

Planet DDS

Location:
United States , Atlanta

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:
December 11, 2025

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Manager, Site Reliability Engineering and Incident Management

Site Reliability Engineering Manager

Cloud Security Site Reliability Engineer

Principal Site Reliability Engineer

Staff Site Reliability Engineer

Site Reliability Engineering Manager

Senior Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Our AI answers in your language

Manager, Site Reliability Engineering and Incident Management

Planet DDS

Location:United States , Atlanta

Category:IT - Software Development

Contract Type:Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:December 11, 2025

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Manager, Site Reliability Engineering and Incident Management

Site Reliability Engineering Manager

Cloud Security Site Reliability Engineer

Principal Site Reliability Engineer

Staff Site Reliability Engineer

Site Reliability Engineering Manager

Senior Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Location:
United States , Atlanta

Category:
IT - Software Development

Contract Type:
Not provided

Job Posted:
December 11, 2025