CrawlJobs Logo

Principal Site Reliability Engineer (AIOps)

paloaltonetworks.com Logo

Palo Alto Networks

Location Icon

Location:
United States , Santa Clara

Category Icon

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

151600.00 - 245300.00 USD / Year

Job Description:

Palo Alto Networks runs a large hybrid infrastructure and is one of the largest GCP customers. As a Site Reliability Engineer, you will be part of a team supporting the services running on this infrastructure. This includes automation, architecture, performance, metrics, troubleshooting, security, and reliability. Our stack includes Kubernetes, Docker, GCP, AWS, Ansible, Terraform, Vault, Gitlab, Spinnaker, Tensorflow, Datadog, Elasticsearch, Kafka, Hadoop, MySQL, Percona, MongoDB, Python, and Go. We don’t expect you to know all these, but we do expect you to learn the ones needed for this role.

Job Responsibility:

  • Contribute to the success of SRE and DevOps
  • Develop expertise in new technologies
  • Work with developers, researchers, data scientists, and security experts
  • Design, build and operate reliable, secure Cloud infrastructure
  • Ensure that applications are production-ready, scalable, and reliable
  • Develop tools and automation frameworks
  • Automate robust deployment of robust services
  • Orchestrate end-to-end monitoring and alerting
  • Participate with SRE and Dev teams in the on-call rotation
  • Lead root cause analysis of critical business and production issues
  • Mentor and champion SRE culture
  • Participate in design reviews

Requirements:

  • BS or MS in Computer Science, a related field, or equivalent professional experience
  • Expertise in configuration management with a framework such as Ansible, Terraform, Helm
  • Experience in Production Engineering, DevOps, or Site Reliability
  • Expertise in private or public cloud
  • Strong Linux administration, internals, and network troubleshooting
  • Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
  • Familiarity with CI/CD pipelines, GitLab and GitHub preferred
  • Ability to diagnose and troubleshoot complex distributed systems handling high volume transactions
  • Excellent written and verbal communication, able to collaborate and rally support
  • Self-disciplined, self-managed, self-motivated and strong sense of ownership, urgency, and drive
  • Passion for infrastructure and monitoring as code
  • Ready to understand and dissect new technology stacks quickly

Nice to have:

  • GitLab
  • GitHub
What we offer:
  • restricted stock units
  • bonus
  • employee benefits

Additional Information:

Job Posted:
April 24, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Principal Site Reliability Engineer (AIOps)

Principal Site Reliability Engineer (AI-first SRE)

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Peru
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Colombia
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering
  • 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Ecuador
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer

As a Principal Site Reliability Engineer for the ADEM (Autonomous Digital Experi...
Location
Location
United States , Santa Clara
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years as an engineer in Infrastructure, Operations, DevOps, or System Engineering
  • The candidate must be familiar with and demonstrate proficiency in using code assist and AI productivity tools such as Claude code, Cursor, Windsurf, or GitHub Copilot to accelerate development and troubleshooting
  • Expertise in building high-availability, scalable cloud-native applications on GCP (preferred) or AWS
  • Expertise in configuration management and IaC (Terraform, Helm, Ansible)
  • Strong proficiency in programming languages like Python, Go, or Java
  • Deep experience in Kubernetes (GKE/EKS), container networking, and Linux internals
  • Experience with GitOps principles and tools like GitLab CI and ArgoCD
  • Familiarity with compliance and security frameworks (FedRAMP, SOC2) and automating policy-as-code
  • Excellent communication skills, with a "rally support" mindset to collaborate across multi-functional teams
  • BS or MS in Computer Science, a related field, or equivalent professional/military experience
Job Responsibility
Job Responsibility
  • Drive the success of SRE and DevOps through expert contributions in CI/CD and AIOps initiatives, moving the organization toward self-healing infrastructure
  • Architect "Golden Paths" for service delivery, ensuring that SLOs, error budgets, and automated canary analysis are integrated by default
  • Design, build, and operate reliable, secure Cloud infrastructure that supports high-scale synthetic monitoring and Real User Monitoring (RUM)
  • Ensure applications are production-ready, scalable, and resilient, collaborating closely with developers, researchers, and data scientists
  • Develop tools and automation frameworks that champion Infrastructure as Code (IaC) and Monitoring as Code (MaC)
  • Lead root cause analysis (RCA) of critical business and production issues, driving improvements that prevent recurrence
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer (AI-first SRE)

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right
New

Stagiaires audit

Coffra group is one of the first multidisciplinary firms in France deploying suc...
Location
Location
France , Paris
Salary
Salary:
Not provided
coffra-group.com Logo
Coffra Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You are preparing a Master I / Master II, a gap year in Business School, are in a CCA / DSCG course or ideally looking for a final year internship
  • You have initial internships in Finance/Accounting/Management Control
  • You are looking for a 6-month internship in Audit from October 2026 to March 2027 or from January to June 2027
  • You are available for frequent travel in France
  • Student speaking English, ideally with knowledge of German.
Job Responsibility
Job Responsibility
  • Under the supervision of our seniors or managers, you will carry out statutory or contractual audit assignments for an international clientele
  • You will discover the audit profession quickly and completely: audit of simple cycles (fixed assets, purchases/suppliers, sales/customers, bank/financing), circularisations/inventories, analysis of legal documents, assistance in auditing complex cycles, verification of appendices and management reports, interviews with clients, etc.
  • Fulltime
Read More
Arrow Right
New

Onsite Endoscopic Specialist

At KARL STORZ, we are driven by a mission to enhance global health through innov...
Location
Location
United States , Arlington
Salary
Salary:
Not provided
karlstorz.com Logo
KARL STORZ
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A minimum of high school diploma or equivalent
  • Experience in Sterile Processing, Sales, or other Surgical Technology/Medical role
  • Our successful candidate will have excellent written and spoken English language business communication skills. They will also have demonstrated success working in a collaborative, service-oriented team environment.
  • Effective communicator, collaborative, and effective time management
  • Possess exceptional organizational skills and the ability to multi-task
  • MS Office - proficient user as the role will need to work with Excel spreadsheets and reporting
  • Role requires the completion of a drug screening for safety-sensitive positions
  • Must be able to lift/push/pull up to 25lbs
Job Responsibility
Job Responsibility
  • Face-to-face customer support, including OR, SPD and Biomed
  • Video tower/system set-up and support
  • Inspection, repair, troubleshooting and replacement of KARL STORZ devices
  • Monitoring, reporting, and facilitating repair/ exchange transactions
  • Transporting, cleaning/sterilization and packaging of instruments after use
  • Trouble shoot video and instrument issues in the O.R.
  • Instrument/equipment repair management
What we offer
What we offer
  • Relocation Support
  • Professional Growth & Development
  • Collaborative & Dynamic Work Environment
  • Access to Cutting-Edge Medical Technologies
  • Medical / Dental / Vision including a state-of-the-art wellness program and pet insurance, too
  • 3 weeks vacation, 11 holidays plus paid sick time
  • Up to 8 weeks of 100% paid company parental leave
  • 401(k) retirement savings plan providing a match of 60% of the employee’s first 6% contribution (up to IRS limits)
  • Section 125 Flexible Spending Accounts
  • Life, STD, LTD & LTC Insurance
  • Fulltime
Read More
Arrow Right