CrawlJobs Logo

Principal Site Reliability Engineer (AIOps)

United States, Santa Clara Employment contract 151600.00 - 245300.00 USD / Year · Job Posted April 24, 2026
Apply Position
Job Link Share

Job Description

Palo Alto Networks runs a large hybrid infrastructure and is one of the largest GCP customers. As a Site Reliability Engineer, you will be part of a team supporting the services running on this infrastructure. This includes automation, architecture, performance, metrics, troubleshooting, security, and reliability. Our stack includes Kubernetes, Docker, GCP, AWS, Ansible, Terraform, Vault, Gitlab, Spinnaker, Tensorflow, Datadog, Elasticsearch, Kafka, Hadoop, MySQL, Percona, MongoDB, Python, and Go. We don’t expect you to know all these, but we do expect you to learn the ones needed for this role.

Job Responsibility

  • Contribute to the success of SRE and DevOps
  • Develop expertise in new technologies
  • Work with developers, researchers, data scientists, and security experts
  • Design, build and operate reliable, secure Cloud infrastructure
  • Ensure that applications are production-ready, scalable, and reliable
  • Develop tools and automation frameworks
  • Automate robust deployment of robust services
  • Orchestrate end-to-end monitoring and alerting
  • Participate with SRE and Dev teams in the on-call rotation
  • Lead root cause analysis of critical business and production issues
  • Mentor and champion SRE culture
  • Participate in design reviews

Requirements

  • BS or MS in Computer Science, a related field, or equivalent professional experience
  • Expertise in configuration management with a framework such as Ansible, Terraform, Helm
  • Experience in Production Engineering, DevOps, or Site Reliability
  • Expertise in private or public cloud
  • Strong Linux administration, internals, and network troubleshooting
  • Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
  • Familiarity with CI/CD pipelines, GitLab and GitHub preferred
  • Ability to diagnose and troubleshoot complex distributed systems handling high volume transactions
  • Excellent written and verbal communication, able to collaborate and rally support
  • Self-disciplined, self-managed, self-motivated and strong sense of ownership, urgency, and drive
  • Passion for infrastructure and monitoring as code
  • Ready to understand and dissect new technology stacks quickly

Nice to have

  • GitLab
  • GitHub

What we offer

  • restricted stock units
  • bonus
  • employee benefits

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Principal Site Reliability Engineer (AIOps)

8 matching positions

Principal Site Reliability Engineer

As a Principal Site Reliability Engineer for the ADEM (Autonomous Digital Experi...
Location
Location
United States , Santa Clara
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years as an engineer in Infrastructure, Operations, DevOps, or System Engineering
  • The candidate must be familiar with and demonstrate proficiency in using code assist and AI productivity tools such as Claude code, Cursor, Windsurf, or GitHub Copilot to accelerate development and troubleshooting
  • Expertise in building high-availability, scalable cloud-native applications on GCP (preferred) or AWS
  • Expertise in configuration management and IaC (Terraform, Helm, Ansible)
  • Strong proficiency in programming languages like Python, Go, or Java
  • Deep experience in Kubernetes (GKE/EKS), container networking, and Linux internals
  • Experience with GitOps principles and tools like GitLab CI and ArgoCD
  • Familiarity with compliance and security frameworks (FedRAMP, SOC2) and automating policy-as-code
  • Excellent communication skills, with a "rally support" mindset to collaborate across multi-functional teams
  • BS or MS in Computer Science, a related field, or equivalent professional/military experience
Job Responsibility
Job Responsibility
  • Drive the success of SRE and DevOps through expert contributions in CI/CD and AIOps initiatives, moving the organization toward self-healing infrastructure
  • Architect "Golden Paths" for service delivery, ensuring that SLOs, error budgets, and automated canary analysis are integrated by default
  • Design, build, and operate reliable, secure Cloud infrastructure that supports high-scale synthetic monitoring and Real User Monitoring (RUM)
  • Ensure applications are production-ready, scalable, and resilient, collaborating closely with developers, researchers, and data scientists
  • Develop tools and automation frameworks that champion Infrastructure as Code (IaC) and Monitoring as Code (MaC)
  • Lead root cause analysis (RCA) of critical business and production issues, driving improvements that prevent recurrence
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer (AI-first SRE)

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Peru
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Colombia
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering
  • 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Ecuador
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer (AI-first SRE)

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Solutions Consultant

As the Principal Solutions Consultant for APAC, you will be a key strategic lead...
Location
Location
Australia , Sydney
Salary
Salary:
Not provided
https://www.pagerduty.com Logo
PagerDuty
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years of experience in technical leadership, solutions architecture, or product management within the SaaS, DevOps, or Cloud infrastructure space
  • Proven track record of engaging with and influencing C-level executives at ASX 100 or equivalent global enterprises, navigating complex organizations
  • Proven ability to support and influence sales cycles with solid grasp of sales processes and customer buying journeys
  • Deep domain expertise in at least three of the following: Incident Management, SRE (Site Reliability Engineering), AIOps, Cloud Native Architectures, ITSM/ITIL, or Cybersecurity Incident Response
  • Exceptional public speaking and written communication skills
  • Ability to travel across APAC and occasionally overseas
Job Responsibility
Job Responsibility
  • Act as a peer and advisor to customer CTOs and CIOs, helping them navigate complex digital transformations and operationalizing the PagerDuty Operations Cloud within their organizations
  • Represent PagerDuty as a thought leader at major industry events and through digital content, white-papers, and local media
  • Partner closely with our Global Product and Engineering leadership in San Francisco to ensure the unique requirements of the APAC markets are represented in our long-term roadmap
  • Provide architectural oversight and executive presence for our most complex, high-stakes enterprise opportunities in the ANZ region
  • Foster a local community of DevOps and SRE leaders, sharing best practices on operational excellence, incident response, and the ethical application of AI in operations
  • Stay ahead of the curve on the APAC competitive landscape, identifying market shifts and technical trends that PagerDuty can capitalize on
What we offer
What we offer
  • Competitive salary
  • Comprehensive benefits package
  • Flexible work arrangements
  • Company equity
  • ESPP (Employee Stock Purchase Program)
  • Retirement or pension plan
  • Generous paid vacation time
  • Paid holidays and sick leave
  • Dutonian Wellness Days & HibernationDuty
  • Paid parental leave
  • Fulltime
Read More
Arrow Right
New

Environmental Specialist

Part-time position, 20 hours per week with benefits. Performs technical inspecti...
Location
Location
United States , Canton
Salary
Salary:
Not provided
mma.org Logo
Massachusetts Municipal Association
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in environmental science, soil science, wetland science, or related field and one to three years of experience in a related field
  • or an equivalent combination of education and experience
  • Driver’s license required.
Job Responsibility
Job Responsibility
  • Performs technical inspection work including field visits, attending Commission meetings as required and ensuring compliance with applicable federal, state and local wetlands regulations, codes and bylaws
  • Parttime
Read More
Arrow Right