CrawlJobs Logo

Principal Site Reliability Engineer

United States, Santa Clara Employment contract 151600.00 - 245300.00 USD / Year · Job Posted May 27, 2026
Apply Position
Job Link Share

Job Description

Palo Alto Networks runs a large hybrid infrastructure and is one of the largest GCP customers. As a Site Reliability Engineer, you will be part of a team supporting the services running on this infrastructure. This includes automation, architecture, performance, metrics, troubleshooting, security, and reliability. Our stack includes Kubernetes, Docker, GCP, AWS, Ansible, Terraform, Vault, Gitlab, Spinnaker, Pub/sub, Bigtable, Memorystore, Bigquery, RabbitMq, Kafka, MySQL, Python, and Go. We don’t expect you to know all these, but we do expect you to learn the ones needed for this role.

Job Responsibility

  • Contribute to the success of SRE and DevOps
  • Develop expertise in new technologies
  • Work with developers, researchers, data scientists, and security experts
  • Design, build, and operate reliable, secure Cloud infrastructure
  • Ensure that applications are production-ready, scalable, and reliable
  • Develop tools and automation frameworks
  • Automate robust deployment of robust services
  • Orchestrate end-to-end monitoring and alerting
  • Participate with SRE and Dev teams in the on-call rotation
  • Lead root cause analysis of critical business and production issues
  • Mentor and champion SRE culture
  • Participate in design reviews

Requirements

  • BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience
  • Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
  • Proficient in Python and/or Go
  • Expertise in managing applications in the Kubernetes cluster with autoscaling enabled
  • Experience in Production Engineering, DevOps, or Site Reliability
  • Expertise in the public cloud (GCP or AWS), especially in GCP
  • Strong Linux administration, internals, and network troubleshooting
  • Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
  • Experience with CI/CD pipelines, GitLab, and GitHub preferred
  • Ability to diagnose and troubleshoot complex distributed systems handling high-volume transactions
  • Excellent written and verbal communication, able to collaborate and rally support
  • Self-disciplined, self-managed, self-motivated, and strong sense of ownership, urgency, and drive
  • Passion for infrastructure and monitoring as code
  • Ready to understand and dissect new technology stacks quickly

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Principal Site Reliability Engineer

8 matching positions

Principal Site Reliability Engineer

We are looking for a Principal Engineer to join our SDWAN engineering team. You ...
Location
Location
Bulgaria , Sofia
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years as DevOps engineer with a passion for technology, strong motivation and responsibility
  • Proficiency in DevOps and Platform Engineering with expertise in AWS, GCP, Terraform, ArgoCD, Kubernetes, and related tools
  • Experience in developing and maintaining CI/CD pipelines for continuous delivery in agile environments
  • Skilled in managing cloud infrastructure, particularly with AWS and GCP, and adept in infrastructure as code practices using Terraform/Terragrunt
  • Demonstrated capability in supporting high-scale SaaS applications, focusing on scalability, reliability, and performance
  • Excellent written and verbal communication, able to collaborate and rally support
  • Self-disciplined, self-managed, self-motivated, strong sense of ownership, urgency, and drive
  • Passion for infrastructure and monitoring as code
  • Ready to understand and dissect new technology stacks quickly
Job Responsibility
Job Responsibility
  • Implement and optimize CI/CD pipelines and cloud infrastructure using our technology stack, ensuring efficient and reliable deployment to production
  • Participate in the deployment of monitoring and alerting systems to maintain high system performance and reliability
  • Collaborate with software development and other cross-functional teams to streamline and enhance processes, aiming for efficiency and alignment with business goals
  • Contribute to the management of the cloud infrastructure, utilizing Infrastructure as Code principles
  • Participate in on-call rotations to support critical business and production systems
  • Fulltime
Read More
Arrow Right
New

Principal Site Reliability Engineer

We're looking for a site reliability engineer at the intersection of software de...
Location
Location
EMEA (Europe, Middle East and Africa region, employer unspecified country)
Salary
Salary:
Not provided
copper.co Logo
Copper.co
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience in designing, analyzing, and troubleshooting distributed systems or micro-services architectures
  • Established expertise in observability and incident management
  • Proven experience in driving organizational Change
  • Excellent communication skills, with a systematic problem-solving approach
Job Responsibility
Job Responsibility
  • Shape SRE: Define how we think about reliability, observability, and operational excellence
  • Drive the adoption of SRE principles across the organization
  • Scale Through Automation: Champion architectural improvements that enhance both system reliability and deployment velocity
  • Drive Technical Excellence: Engage in and improve the lifecycle of microservices
  • Lead Through Influence: Partner with engineering and product leadership to embed reliability into our product development lifecycle
  • Conduct blameless postmortems
  • Mentor engineers across the organisation on SRE practices
What we offer
What we offer
  • 35 Days paid time off per annum, inclusive of annual leave and public holidays
  • Employees also receive one additional day of annual leave for each year of service
  • Private Health Insurance
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Substrate powers Microsoft 365. Keeping it up, resilient, and continuously impro...
Location
Location
United States , Multiple Locations
Salary
Salary:
142800.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration. OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration. OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration. OR equivalent experience.
Job Responsibility
Job Responsibility
  • Incident management excellence: Lead high-severity incident response, debug complex issues, drive incidents to resolution with clear communication and ownership. Ensure high-quality postmortems reports are created and enforce repair-item SLAs
  • Improve observability: Enhance telemetry, alerting, and dashboards using One Microsoft tooling to provide actionable insights and reduce detection time
  • Define and measure reliability: Partner with engineering teams to establish and track SLIs/SLOs for critical scenarios
  • Live site health reviews: Lead and facilitate live site health review meetings, translating business requirements into metrics and action
  • Engineering for prevention: Translate learnings into proactive tests, product fixes, rollout guardrails, and automation that reduce risk and improve service health
  • Reliability drills: Design and execute drills to simulate product failures, validate resilience and recovery, and develop resilience strategies
  • Define Policy: Draft process and policy documentation for how the organization prepares for, responds to, and prevents incidents
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Palo Alto Networks runs a large hybrid infrastructure and is one of the largest ...
Location
Location
United States , Santa Clara
Salary
Salary:
151600.00 - 245300.00 USD / Year
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience
  • Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
  • Proficient in Python and/or Go
  • Expertise in managing applications in the Kubenetes cluster with autoscaling enabled
  • Experience in Production Engineering, DevOps, or Site Reliability
  • Expertise in the public cloud (GCP or AWS), especially in GCP
  • Strong Linux administration, internals, and network troubleshooting
  • Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
  • Ability to diagnose and troubleshoot complex distributed systems handling high-volume transactions
  • Excellent written and verbal communication, able to collaborate and rally support
Job Responsibility
Job Responsibility
  • Contribute to the success of SRE and DevOps
  • Develop expertise in new technologies
  • Work with developers, researchers, data scientists, and security experts
  • Design, build, and operate reliable, secure Cloud infrastructure
  • Ensure that applications are production-ready, scalable, and reliable
  • Develop tools and automation frameworks
  • Automate robust deployment of robust services
  • Orchestrate end-to-end monitoring and alerting
  • Participate with SRE and Dev teams in the on-call rotation
  • Lead root cause analysis of critical business and production issues
What we offer
What we offer
  • restricted stock units and a bonus
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Our Site Reliability Engineering group within Enterprise Infrastructure combines...
Location
Location
United States , Westlake; Merrimack
Salary
Salary:
Not provided
fidelity.com Logo
Fidelity Investments
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree or higher in a technology related field (e.g. Engineering, Computer Science, etc.) required, master’s degree a plus
  • 5+ years of hands-on experience deploying and/or supporting highly distributed multi-tiered systems at scale
  • Strong experience in Cloud development (preferably AWS) and migration skills
  • Experience with building and operating highly resilient platforms in cloud environments
  • 2-4 years of experience in software development with Python, NodeJS, or Java with a focus on SDLC and automation
  • Hands-on experience with container orchestration, preferably with Kubernetes
  • Experience operating and implementing distributed & highly concurrent service-based
  • Ability to automate with various scripting languages (Python, Shell scripting, etc)
  • Experience managing systems using infrastructure as code tools (IAM, ARM, Terraform, Chef)
  • Solid understanding of Cloud Computing and DevOps concepts including CI/CD pipelines
Job Responsibility
Job Responsibility
  • Help define and execute a comprehensive reliability and observability strategy, ensuring that Fidelity’s systems are always available when our customers need them
  • Bring together technical, procedural, and financial data to reduce toil and increase efficiency
  • Execute plans for technical standardization and process refinement within the engineering organization, especially for Site Reliability Engineers
  • Coach peer SREs and development teams on how to build highly available systems
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Are you looking for an exciting chance to boost performance? Come aboard Fidelit...
Location
Location
United States , Merrimack, NH
Salary
Salary:
Not provided
fidelity.com Logo
Fidelity Investments
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Demonstrated expertise in site reliability with a strong background in managing large-scale systems
  • Experience in automation, monitoring, and incident management
  • Understand cloud infrastructure
  • Excel in Python scripting
  • Drive projects end-to-end
  • Extensive knowledge on Production on-call support for Cloud Infrastructure running in EKS platform
  • Experience with load balancer traffic distribution (especially Akamai and VMware Avi)
  • Extensive experiences in Change, Incident, Problem Management & on-call support
  • Extensive knowledge on observability tools (Preferable - DataDog) & Grafana
  • Experience in monitoring various aspects like Log, Metrics, APM, Event, Infrastructure & including of Dashboard creation
Job Responsibility
Job Responsibility
  • Guarantee the reliability and efficiency of our systems
  • Spearhead projects to better FFIO Market Data Services
  • Work closely with a team of committed experts
  • Create a substantial influence on our operations
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

As a Principal Site Reliability Engineer you will lead curial initiatives in the...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter
  • 7+ years technical experience working with large-scale cloud or distributed systems
  • Experience building or scaling incident response programs at organizational or enterprise scope
  • Background in SRE, production engineering, or platform reliability roles
  • Track record of reducing customer impact through improved incident handling, tooling, or prevention
Job Responsibility
Job Responsibility
  • Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high impact events
  • Act as the senior incident leader or sponsor for long running, high stakes, or cross service incidents, ensuring alignment on impact, risk, and recovery priorities
  • Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
  • Ensure high quality post incident reviews and drive accountability for repair items that reduce recurrence and systemic risk
  • Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths
  • Coach and help develop a team of Site Reliability Engineers serving as incident responders
  • Build a culture of calm execution, accountability, psychological safety, and continuous learning during and after incidents
  • Help hire and grow senior talent capable of operating as trusted leaders in high pressure, executive visible situations
  • Serve as a trusted advisor to engineering leaders and executives on live site risk, readiness, and incident response maturity
  • Communicate clearly and credibly with senior leadership during customer impacting events
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

As a Principal Site Reliability Engineer for the ADEM (Autonomous Digital Experi...
Location
Location
United States , Santa Clara
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years as an engineer in Infrastructure, Operations, DevOps, or System Engineering
  • The candidate must be familiar with and demonstrate proficiency in using code assist and AI productivity tools such as Claude code, Cursor, Windsurf, or GitHub Copilot to accelerate development and troubleshooting
  • Expertise in building high-availability, scalable cloud-native applications on GCP (preferred) or AWS
  • Expertise in configuration management and IaC (Terraform, Helm, Ansible)
  • Strong proficiency in programming languages like Python, Go, or Java
  • Deep experience in Kubernetes (GKE/EKS), container networking, and Linux internals
  • Experience with GitOps principles and tools like GitLab CI and ArgoCD
  • Familiarity with compliance and security frameworks (FedRAMP, SOC2) and automating policy-as-code
  • Excellent communication skills, with a "rally support" mindset to collaborate across multi-functional teams
  • BS or MS in Computer Science, a related field, or equivalent professional/military experience
Job Responsibility
Job Responsibility
  • Drive the success of SRE and DevOps through expert contributions in CI/CD and AIOps initiatives, moving the organization toward self-healing infrastructure
  • Architect "Golden Paths" for service delivery, ensuring that SLOs, error budgets, and automated canary analysis are integrated by default
  • Design, build, and operate reliable, secure Cloud infrastructure that supports high-scale synthetic monitoring and Real User Monitoring (RUM)
  • Ensure applications are production-ready, scalable, and resilient, collaborating closely with developers, researchers, and data scientists
  • Develop tools and automation frameworks that champion Infrastructure as Code (IaC) and Monitoring as Code (MaC)
  • Lead root cause analysis (RCA) of critical business and production issues, driving improvements that prevent recurrence
  • Fulltime
Read More
Arrow Right