Principal Site Reliability Engineer Job at Palo Alto Networks (Santa Clara)

Principal Site Reliability Engineer

We are looking for a Principal Engineer to join our SDWAN engineering team. You ...

Location

Bulgaria , Sofia

Salary:

Not provided

Palo Alto Networks

Expiration Date

Until further notice

Requirements

6+ years as DevOps engineer with a passion for technology, strong motivation and responsibility
Proficiency in DevOps and Platform Engineering with expertise in AWS, GCP, Terraform, ArgoCD, Kubernetes, and related tools
Experience in developing and maintaining CI/CD pipelines for continuous delivery in agile environments
Skilled in managing cloud infrastructure, particularly with AWS and GCP, and adept in infrastructure as code practices using Terraform/Terragrunt
Demonstrated capability in supporting high-scale SaaS applications, focusing on scalability, reliability, and performance
Excellent written and verbal communication, able to collaborate and rally support
Self-disciplined, self-managed, self-motivated, strong sense of ownership, urgency, and drive
Passion for infrastructure and monitoring as code
Ready to understand and dissect new technology stacks quickly

Job Responsibility

Implement and optimize CI/CD pipelines and cloud infrastructure using our technology stack, ensuring efficient and reliable deployment to production
Participate in the deployment of monitoring and alerting systems to maintain high system performance and reliability
Collaborate with software development and other cross-functional teams to streamline and enhance processes, aiming for efficiency and alignment with business goals
Contribute to the management of the cloud infrastructure, utilizing Infrastructure as Code principles
Participate in on-call rotations to support critical business and production systems

Fulltime

New

Principal Site Reliability Engineer

We're looking for a site reliability engineer at the intersection of software de...

Location

EMEA (Europe, Middle East and Africa region, employer unspecified country)

Salary:

Not provided

Copper.co

Expiration Date

Until further notice

Requirements

Experience in designing, analyzing, and troubleshooting distributed systems or micro-services architectures
Established expertise in observability and incident management
Proven experience in driving organizational Change
Excellent communication skills, with a systematic problem-solving approach

Job Responsibility

Shape SRE: Define how we think about reliability, observability, and operational excellence
Drive the adoption of SRE principles across the organization
Scale Through Automation: Champion architectural improvements that enhance both system reliability and deployment velocity
Drive Technical Excellence: Engage in and improve the lifecycle of microservices
Lead Through Influence: Partner with engineering and product leadership to embed reliability into our product development lifecycle
Conduct blameless postmortems
Mentor engineers across the organisation on SRE practices

What we offer

35 Days paid time off per annum, inclusive of annual leave and public holidays
Employees also receive one additional day of annual leave for each year of service
Private Health Insurance

Fulltime

Principal Site Reliability Engineer

Substrate powers Microsoft 365. Keeping it up, resilient, and continuously impro...

Location

United States , Multiple Locations

Salary:

142800.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration. OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration. OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration. OR equivalent experience.

Job Responsibility

Incident management excellence: Lead high-severity incident response, debug complex issues, drive incidents to resolution with clear communication and ownership. Ensure high-quality postmortems reports are created and enforce repair-item SLAs
Improve observability: Enhance telemetry, alerting, and dashboards using One Microsoft tooling to provide actionable insights and reduce detection time
Define and measure reliability: Partner with engineering teams to establish and track SLIs/SLOs for critical scenarios
Live site health reviews: Lead and facilitate live site health review meetings, translating business requirements into metrics and action
Engineering for prevention: Translate learnings into proactive tests, product fixes, rollout guardrails, and automation that reduce risk and improve service health
Reliability drills: Design and execute drills to simulate product failures, validate resilience and recovery, and develop resilience strategies
Define Policy: Draft process and policy documentation for how the organization prepares for, responds to, and prevents incidents

Fulltime

Principal Site Reliability Engineer

Palo Alto Networks runs a large hybrid infrastructure and is one of the largest ...

Location

United States , Santa Clara

Salary:

151600.00 - 245300.00 USD / Year

Palo Alto Networks

Expiration Date

Until further notice

Requirements

BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience
Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
Proficient in Python and/or Go
Expertise in managing applications in the Kubenetes cluster with autoscaling enabled
Experience in Production Engineering, DevOps, or Site Reliability
Expertise in the public cloud (GCP or AWS), especially in GCP
Strong Linux administration, internals, and network troubleshooting
Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
Ability to diagnose and troubleshoot complex distributed systems handling high-volume transactions
Excellent written and verbal communication, able to collaborate and rally support

Job Responsibility

Contribute to the success of SRE and DevOps
Develop expertise in new technologies
Work with developers, researchers, data scientists, and security experts
Design, build, and operate reliable, secure Cloud infrastructure
Ensure that applications are production-ready, scalable, and reliable
Develop tools and automation frameworks
Automate robust deployment of robust services
Orchestrate end-to-end monitoring and alerting
Participate with SRE and Dev teams in the on-call rotation
Lead root cause analysis of critical business and production issues

What we offer

restricted stock units and a bonus

Fulltime

Principal Site Reliability Engineer

Our Site Reliability Engineering group within Enterprise Infrastructure combines...

Location

United States , Westlake; Merrimack

Salary:

Not provided

Fidelity Investments

Expiration Date

Until further notice

Requirements

Bachelor’s degree or higher in a technology related field (e.g. Engineering, Computer Science, etc.) required, master’s degree a plus
5+ years of hands-on experience deploying and/or supporting highly distributed multi-tiered systems at scale
Strong experience in Cloud development (preferably AWS) and migration skills
Experience with building and operating highly resilient platforms in cloud environments
2-4 years of experience in software development with Python, NodeJS, or Java with a focus on SDLC and automation
Hands-on experience with container orchestration, preferably with Kubernetes
Experience operating and implementing distributed & highly concurrent service-based
Ability to automate with various scripting languages (Python, Shell scripting, etc)
Experience managing systems using infrastructure as code tools (IAM, ARM, Terraform, Chef)
Solid understanding of Cloud Computing and DevOps concepts including CI/CD pipelines

Job Responsibility

Help define and execute a comprehensive reliability and observability strategy, ensuring that Fidelity’s systems are always available when our customers need them
Bring together technical, procedural, and financial data to reduce toil and increase efficiency
Execute plans for technical standardization and process refinement within the engineering organization, especially for Site Reliability Engineers
Coach peer SREs and development teams on how to build highly available systems

Fulltime

Principal Site Reliability Engineer

Are you looking for an exciting chance to boost performance? Come aboard Fidelit...

Location

United States , Merrimack, NH

Salary:

Not provided

Fidelity Investments

Expiration Date

Until further notice

Requirements

Demonstrated expertise in site reliability with a strong background in managing large-scale systems
Experience in automation, monitoring, and incident management
Understand cloud infrastructure
Excel in Python scripting
Drive projects end-to-end
Extensive knowledge on Production on-call support for Cloud Infrastructure running in EKS platform
Experience with load balancer traffic distribution (especially Akamai and VMware Avi)
Extensive experiences in Change, Incident, Problem Management & on-call support
Extensive knowledge on observability tools (Preferable - DataDog) & Grafana
Experience in monitoring various aspects like Log, Metrics, APM, Event, Infrastructure & including of Dashboard creation

Job Responsibility

Guarantee the reliability and efficiency of our systems
Spearhead projects to better FFIO Market Data Services
Work closely with a team of committed experts
Create a substantial influence on our operations

Fulltime

Principal Site Reliability Engineer

As a Principal Site Reliability Engineer you will lead curial initiatives in the...

Location

United States , Multiple Locations

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter
7+ years technical experience working with large-scale cloud or distributed systems
Experience building or scaling incident response programs at organizational or enterprise scope
Background in SRE, production engineering, or platform reliability roles
Track record of reducing customer impact through improved incident handling, tooling, or prevention

Job Responsibility

Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high impact events
Act as the senior incident leader or sponsor for long running, high stakes, or cross service incidents, ensuring alignment on impact, risk, and recovery priorities
Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
Ensure high quality post incident reviews and drive accountability for repair items that reduce recurrence and systemic risk
Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths
Coach and help develop a team of Site Reliability Engineers serving as incident responders
Build a culture of calm execution, accountability, psychological safety, and continuous learning during and after incidents
Help hire and grow senior talent capable of operating as trusted leaders in high pressure, executive visible situations
Serve as a trusted advisor to engineering leaders and executives on live site risk, readiness, and incident response maturity
Communicate clearly and credibly with senior leadership during customer impacting events

Fulltime

Principal Site Reliability Engineer

As a Principal Site Reliability Engineer for the ADEM (Autonomous Digital Experi...

Location

United States , Santa Clara

Salary:

Not provided

Palo Alto Networks

Expiration Date

Until further notice

Requirements

7+ years as an engineer in Infrastructure, Operations, DevOps, or System Engineering
The candidate must be familiar with and demonstrate proficiency in using code assist and AI productivity tools such as Claude code, Cursor, Windsurf, or GitHub Copilot to accelerate development and troubleshooting
Expertise in building high-availability, scalable cloud-native applications on GCP (preferred) or AWS
Expertise in configuration management and IaC (Terraform, Helm, Ansible)
Strong proficiency in programming languages like Python, Go, or Java
Deep experience in Kubernetes (GKE/EKS), container networking, and Linux internals
Experience with GitOps principles and tools like GitLab CI and ArgoCD
Familiarity with compliance and security frameworks (FedRAMP, SOC2) and automating policy-as-code
Excellent communication skills, with a "rally support" mindset to collaborate across multi-functional teams
BS or MS in Computer Science, a related field, or equivalent professional/military experience

Job Responsibility

Drive the success of SRE and DevOps through expert contributions in CI/CD and AIOps initiatives, moving the organization toward self-healing infrastructure
Architect "Golden Paths" for service delivery, ensuring that SLOs, error budgets, and automated canary analysis are integrated by default
Design, build, and operate reliable, secure Cloud infrastructure that supports high-scale synthetic monitoring and Real User Monitoring (RUM)
Ensure applications are production-ready, scalable, and resilient, collaborating closely with developers, researchers, and data scientists
Develop tools and automation frameworks that champion Infrastructure as Code (IaC) and Monitoring as Code (MaC)
Lead root cause analysis (RCA) of critical business and production issues, driving improvements that prevent recurrence

Fulltime

Select Country

Principal Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Looking for more opportunities?

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Our AI answers in your language