This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Your Career: Palo Alto Networks runs a large infrastructure and is one of the largest GCP customers. As a Principal Site Reliability Engineer, you will be part of a team supporting the services running on this infrastructure for Sovereign Cloud. This includes automation, architecture, performance, observability, troubleshooting, security, and reliability. Our Infrastructure Platform stack includes Terraform, Kubernetes, GitLab CI, ArgoCD, Prometheus, Grafana, Loki, Docker, GCP, AWS, Vault, Kafka, MySQL, Python, Bash, and Go.
Job Responsibility
Contribute to the success of SRE and DevOps
Develop expertise in new technologies
Work with developers, researchers, data scientists, and security experts
Design, build and operate reliable, secure Cloud infrastructure
Ensure that applications are production-ready, scalable, and reliable
Develop tools and automation frameworks
Automate robust deployment of robust services
Orchestrate end-to-end monitoring and alerting
Participate in on-call rotations to support critical business and production systems
Lead root cause analysis of critical business and production issues
Requirements
7+ years as an engineer in Infrastructure, Operations, DevOps, or System Engineering
7+ years building high availability, scalable cloud native applications on AWS or GCP
BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience required
Expertise in configuration management with a framework such as Ansible, Terraform, Helm
Expertise in infrastructure automation tasks using Python and shell scripting
Experience in Site Reliability Engineering, Production Engineering, or DevOps
Expertise in public or private cloud
Solid experience in Kubernetes and containers
Linux administration, internals, and network troubleshooting
Proficiency with programming languages like Python, Java, Golang, and shell scripting to automate tasks
Ability to diagnose and troubleshoot complex distributed systems handling high volume transactions
Excellent written and verbal communication, able to collaborate and rally support
Self-disciplined, self-managed, self-motivated, strong sense of ownership, urgency, and drive
Passion for infrastructure and monitoring as code
Ready to understand and dissect new technology stacks quickly
Nice to have
Experience with CI/CD pipelines, GitLab and ArgoCD preferred