Principal Site Reliability Engineer Job at Groupon

Principal Site Reliability Engineer

Palo Alto Networks runs a large hybrid infrastructure and is one of the largest ...

Location

United States , Santa Clara

Salary:

151600.00 - 245300.00 USD / Year

Palo Alto Networks

Expiration Date

Until further notice

Requirements

BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience
Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
Proficient in Python and/or Go
Expertise in managing applications in the Kubernetes cluster with autoscaling enabled
Experience in Production Engineering, DevOps, or Site Reliability
Expertise in the public cloud (GCP or AWS), especially in GCP
Strong Linux administration, internals, and network troubleshooting
Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
Experience with CI/CD pipelines, GitLab, and GitHub preferred
Ability to diagnose and troubleshoot complex distributed systems handling high-volume transactions

Job Responsibility

Contribute to the success of SRE and DevOps
Develop expertise in new technologies
Work with developers, researchers, data scientists, and security experts
Design, build, and operate reliable, secure Cloud infrastructure
Ensure that applications are production-ready, scalable, and reliable
Develop tools and automation frameworks
Automate robust deployment of robust services
Orchestrate end-to-end monitoring and alerting
Participate with SRE and Dev teams in the on-call rotation
Lead root cause analysis of critical business and production issues

Fulltime

Principal Site Reliability Engineer

Microsoft Substrate is the foundational cloud platform that powers many of Micro...

Location

United States , Redmond

Salary:

142800.00 - 304200.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
OR equivalent experience
Candidates must be able to meet Microsoft, customer and/or government security screening requirements required for this role
This role requires access to Microsoft Government cloud environments, including GCC Moderate (GCCM), GCC High (GCCH), and Department of Defense (DoD) environments
The successful candidate must be able to obtain and maintain the appropriate background investigations and customer screenings required for access to these environments
For access to GCCH and DoD environments, this role requires the ability to obtain and maintain a favorably adjudicated Tier 3 (T3) background investigation
For access to GCCM environments, this role requires the ability to meet Criminal Justice Information Services (CJIS) eligibility requirements
For manager-level roles, a Tier 5 (T5) background investigation is preferred

Job Responsibility

Define and drive reliability strategy, SLO frameworks, and operational best practices across Substrate workloads in highly regulated environments
Serve as an actively engaged senior on-call engineer (OCE), participating in on-call rotations and leading incident response for Substrate services in regulated environments
Provide hands-on leadership during the most complex or high-impact incidents, setting technical direction and response strategy
Drive high-quality post-incident reviews that result in durable, systemic engineering improvements across teams
Architect and deliver large-scale automation, observability, and self-healing solutions
Drive architectural decisions and define software engineering standards that make reliability, security, and compliance intrinsic to Substrate services
Influence service design and engineering decisions across organizational boundaries
Mentor senior and principal engineers and shape the long-term technical direction of the SRE discipline
Represent Substrate SRE perspectives with senior leadership and cross-functional partners

Fulltime

Principal Site Reliability Engineer

We're looking for a site reliability engineer at the intersection of software de...

Location

EMEA (Europe, Middle East and Africa region, employer unspecified country)

Salary:

Not provided

Copper.co

Expiration Date

Until further notice

Requirements

Experience in designing, analyzing, and troubleshooting distributed systems or micro-services architectures
Established expertise in observability and incident management
Proven experience in driving organizational Change
Excellent communication skills, with a systematic problem-solving approach

Job Responsibility

Shape SRE: Define how we think about reliability, observability, and operational excellence
Drive the adoption of SRE principles across the organization
Scale Through Automation: Champion architectural improvements that enhance both system reliability and deployment velocity
Drive Technical Excellence: Engage in and improve the lifecycle of microservices
Lead Through Influence: Partner with engineering and product leadership to embed reliability into our product development lifecycle
Conduct blameless postmortems
Mentor engineers across the organisation on SRE practices

What we offer

35 Days paid time off per annum, inclusive of annual leave and public holidays
Employees also receive one additional day of annual leave for each year of service
Private Health Insurance

Fulltime

Principal Site Reliability Engineer

Substrate powers Microsoft 365. Keeping it up, resilient, and continuously impro...

Location

United States , Multiple Locations

Salary:

142800.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration. OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration. OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration. OR equivalent experience.

Job Responsibility

Incident management excellence: Lead high-severity incident response, debug complex issues, drive incidents to resolution with clear communication and ownership. Ensure high-quality postmortems reports are created and enforce repair-item SLAs
Improve observability: Enhance telemetry, alerting, and dashboards using One Microsoft tooling to provide actionable insights and reduce detection time
Define and measure reliability: Partner with engineering teams to establish and track SLIs/SLOs for critical scenarios
Live site health reviews: Lead and facilitate live site health review meetings, translating business requirements into metrics and action
Engineering for prevention: Translate learnings into proactive tests, product fixes, rollout guardrails, and automation that reduce risk and improve service health
Reliability drills: Design and execute drills to simulate product failures, validate resilience and recovery, and develop resilience strategies
Define Policy: Draft process and policy documentation for how the organization prepares for, responds to, and prevents incidents

Fulltime

Principal Site Reliability Engineer

Palo Alto Networks runs a large hybrid infrastructure and is one of the largest ...

Location

United States , Santa Clara

Salary:

151600.00 - 245300.00 USD / Year

Palo Alto Networks

Expiration Date

Until further notice

Requirements

BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience
Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
Proficient in Python and/or Go
Expertise in managing applications in the Kubenetes cluster with autoscaling enabled
Experience in Production Engineering, DevOps, or Site Reliability
Expertise in the public cloud (GCP or AWS), especially in GCP
Strong Linux administration, internals, and network troubleshooting
Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
Ability to diagnose and troubleshoot complex distributed systems handling high-volume transactions
Excellent written and verbal communication, able to collaborate and rally support

Job Responsibility

Contribute to the success of SRE and DevOps
Develop expertise in new technologies
Work with developers, researchers, data scientists, and security experts
Design, build, and operate reliable, secure Cloud infrastructure
Ensure that applications are production-ready, scalable, and reliable
Develop tools and automation frameworks
Automate robust deployment of robust services
Orchestrate end-to-end monitoring and alerting
Participate with SRE and Dev teams in the on-call rotation
Lead root cause analysis of critical business and production issues

What we offer

restricted stock units and a bonus

Fulltime

Principal Site Reliability Engineer

Our Site Reliability Engineering group within Enterprise Infrastructure combines...

Location

United States , Westlake; Merrimack

Salary:

Not provided

Fidelity Investments

Expiration Date

Until further notice

Requirements

Bachelor’s degree or higher in a technology related field (e.g. Engineering, Computer Science, etc.) required, master’s degree a plus
5+ years of hands-on experience deploying and/or supporting highly distributed multi-tiered systems at scale
Strong experience in Cloud development (preferably AWS) and migration skills
Experience with building and operating highly resilient platforms in cloud environments
2-4 years of experience in software development with Python, NodeJS, or Java with a focus on SDLC and automation
Hands-on experience with container orchestration, preferably with Kubernetes
Experience operating and implementing distributed & highly concurrent service-based
Ability to automate with various scripting languages (Python, Shell scripting, etc)
Experience managing systems using infrastructure as code tools (IAM, ARM, Terraform, Chef)
Solid understanding of Cloud Computing and DevOps concepts including CI/CD pipelines

Job Responsibility

Help define and execute a comprehensive reliability and observability strategy, ensuring that Fidelity’s systems are always available when our customers need them
Bring together technical, procedural, and financial data to reduce toil and increase efficiency
Execute plans for technical standardization and process refinement within the engineering organization, especially for Site Reliability Engineers
Coach peer SREs and development teams on how to build highly available systems

Fulltime

Principal Site Reliability Engineer

Are you looking for an exciting chance to boost performance? Come aboard Fidelit...

Location

United States , Merrimack, NH

Salary:

Not provided

Fidelity Investments

Expiration Date

Until further notice

Requirements

Demonstrated expertise in site reliability with a strong background in managing large-scale systems
Experience in automation, monitoring, and incident management
Understand cloud infrastructure
Excel in Python scripting
Drive projects end-to-end
Extensive knowledge on Production on-call support for Cloud Infrastructure running in EKS platform
Experience with load balancer traffic distribution (especially Akamai and VMware Avi)
Extensive experiences in Change, Incident, Problem Management & on-call support
Extensive knowledge on observability tools (Preferable - DataDog) & Grafana
Experience in monitoring various aspects like Log, Metrics, APM, Event, Infrastructure & including of Dashboard creation

Job Responsibility

Guarantee the reliability and efficiency of our systems
Spearhead projects to better FFIO Market Data Services
Work closely with a team of committed experts
Create a substantial influence on our operations

Fulltime

Principal Site Reliability Engineer

As a Principal Site Reliability Engineer you will lead curial initiatives in the...

Location

United States , Multiple Locations

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter
7+ years technical experience working with large-scale cloud or distributed systems
Experience building or scaling incident response programs at organizational or enterprise scope
Background in SRE, production engineering, or platform reliability roles
Track record of reducing customer impact through improved incident handling, tooling, or prevention

Job Responsibility

Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high impact events
Act as the senior incident leader or sponsor for long running, high stakes, or cross service incidents, ensuring alignment on impact, risk, and recovery priorities
Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
Ensure high quality post incident reviews and drive accountability for repair items that reduce recurrence and systemic risk
Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths
Coach and help develop a team of Site Reliability Engineers serving as incident responders
Build a culture of calm execution, accountability, psychological safety, and continuous learning during and after incidents
Help hire and grow senior talent capable of operating as trusted leaders in high pressure, executive visible situations
Serve as a trusted advisor to engineering leaders and executives on live site risk, readiness, and incident response maturity
Communicate clearly and credibly with senior leadership during customer impacting events

Fulltime

Select Country

Principal Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Our AI answers in your language