This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are looking for a visionary Senior Manager of Site Reliability Engineering to lead our global SRE organization across the US and India. This isn't just a 'keep the lights on' role; you will be the primary architect of our AI-driven Autonomous SRE transformation at Palo Alto Networks. You will bridge the gap between infrastructure products and operational excellence, gathering complex requirements from product teams and translating them into automated, intelligent self-service platform capabilities to ensure our systems are not just reliable, but self-healing.
Job Responsibility:
Directly manage and scale a high-performing, multi-geographical SRE team (US and India), fostering a culture of psychological safety, continuous learning, and 'operational pride'
Standardize SRE practices globally while respecting local nuances, ensuring 24/7 coverage models (Follow-the-Sun) are seamless and burnout-resistant
Manage the financial aspects of global headcount and cloud infrastructure spend
Drive the Autonomous SRE Roadmap: Transition the organization from reactive monitoring to proactive, AI-driven observability and incident remediation using machine learning to reduce Mean Time to Recovery (MTTR)
Act as the lead consultant for infrastructure product teams to define what 'reliability' looks like for next-gen AI services
Partner with the Platform Engineering team to build and internalize 'Golden Paths' that bake in SLOs, error budgets, and automated canary analysis
Work hand-in-hand with InfoSec and Compliance to automate guardrails (Policy-as-Code) and ensure global data sovereignty requirements are met
Influence R&D leadership to prioritize non-functional requirements and technical debt reduction
Requirements:
10+ years in SRE, Infrastructure or DevOps environments
5+ years managing global teams of 15+ engineers across multiple time zones
Deep understanding of Cloud Native ecosystems (Azure/AWS/GCP), Kubernetes and CI/CD pipelines
Proven track record of implementing ML-driven monitoring (e.g., anomaly detection, automated root cause analysis, event correlation)
Exceptional ability to translate 'deep tech' into business value for C-suite stakeholders
Experience using AI tools like Claude, Gemini or Copilot to build solutions is mandatory