CrawlJobs Logo

Principal Site Reliability Engineer

· Job Posted April 02, 2026
Apply Position
Job Link Share

Job Description

Arcadia’s customers rely on us to securely process and deliver high-value healthcare insights. Reliability, availability, performance, and security are foundational to trust—especially when systems support critical workflows and handle PHI. As a Principal Site Reliability Engineer, you’ll set reliability strategy across teams, drive cross-cutting platform improvements, and ensure we can scale delivery without scaling operational burden.

Job Responsibility

  • Act as the technical leader for reliability for one or more domains
  • set direction and standards while remaining hands-on where it matters most
  • Drive reliability strategy across critical services: define SLOs/SLIs, error budgets, and reliability KPIs aligned to customer journeys and outcomes
  • Own incident response maturity: lead complex incidents, improve incident command practices, and ensure high-quality RCAs with prioritized, tracked remediation
  • Architect and implement automation to reduce toil and risk: runbook automation, self-service tools, and safe operational workflows (Python + Argo Workflows)
  • Advance GitOps delivery practices using Argo CD: promotion strategies, progressive delivery/canaries, and guardrails that reduce deploy risk
  • Scale infrastructure management with Crossplane and Terraform: reusable patterns, policy controls, and paved roads for teams
  • Lead operational readiness and reliability reviews for new features/architectural changes
  • reinforce non-functional requirements (availability, latency, security, cost)
  • Improve performance and cost efficiency through capacity planning, load testing, right-sizing, and architecture recommendations across AWS services
  • Champion infrastructure security best practices for environments that handle PHI (least privilege, secrets management, auditability, and defense-in-depth)
  • Mentor Staff and Senior engineers through design reviews, code reviews, pairing, and documentation
  • raise reliability standards across teams

Requirements

  • 8+ years of experience in SRE, platform engineering, systems engineering, or related roles operating production services at scale
  • Demonstrated principal-level impact: leading cross-team initiatives, influencing architecture decisions, and driving sustained improvements in reliability and operations
  • Expertise in Kubernetes operations and troubleshooting, including safe rollout/rollback patterns, workload debugging, and operational guardrails
  • Strong GitOps experience with Argo CD
  • experience building delivery workflows and automation using Argo Workflows
  • Strong infrastructure orchestration and provisioning experience with Crossplane and Terraform
  • ability to define reusable platform patterns and controls
  • Deep AWS experience (IAM, networking/VPC, compute, storage, managed services, observability) and strong understanding of reliability and failure modes in cloud systems
  • Proficiency in Python for building automation, tooling, and reliability improvements
  • Strong incident management and on-call leadership experience, including measurable improvements (availability, MTTR, alert quality, cost, or operational maturity)
  • Excellent communication skills: can translate technical risk and reliability tradeoffs to engineering leadership, product, and stakeholders
  • produces high-quality docs/runbooks

Nice to have

  • Experience with ScyllaDB or similar distributed databases (e.g., Cassandra) and their reliability/performance characteristics
  • Experience with Spark or data processing platforms, including reliability and cost considerations for large-scale workloads
  • Familiarity with agentic coding practices and principles (safe automation, reviewable changes, guardrail-first workflows)
  • Strong infrastructure security knowledge: threat modeling for cloud/Kubernetes, RBAC/IAM design, secrets management, supply chain security, and security observability

What we offer

  • Pet Insurance
  • Health Insurance
  • Dental Insurance
  • Vision Insurance
  • FSA
  • HSA
  • HSA With Employer Contribution
  • Life Insurance
  • Short-Term Disability
  • Long-Term Disability
  • Fitness Subsidies
  • Mental Health Benefits
  • Family Support Resources
  • Non-Birth Parent Or Paternity Leave
  • Adoption Leave
  • Fertility Benefits
  • Birth Parent Or Maternity Leave
  • Hybrid Work Opportunities
  • Flexible Work Hours
  • Remote Work Opportunities
  • Casual Dress
  • Pet-Friendly Office
  • Snacks
  • Company Outings
  • Commuter Benefits Program
  • Paid Vacation
  • Unlimited Paid Time Off
  • Paid Holidays
  • Personal/Sick Days
  • Leave Of Absence
  • 401(K) With Company Matching
  • 401(K)
  • Performance Bonus
  • Work Visa Sponsorship
  • Promote From Within
  • Access To Online Courses
  • Lunch And Learns
  • Diversity, Equity, And Inclusion Program
  • Employee Resource Groups (ERG)

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Principal Site Reliability Engineer

8 matching positions

Principal Site Reliability Engineer

Palo Alto Networks runs a large hybrid infrastructure and is one of the largest ...
Location
Location
United States , Santa Clara
Salary
Salary:
151600.00 - 245300.00 USD / Year
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience
  • Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
  • Proficient in Python and/or Go
  • Expertise in managing applications in the Kubernetes cluster with autoscaling enabled
  • Experience in Production Engineering, DevOps, or Site Reliability
  • Expertise in the public cloud (GCP or AWS), especially in GCP
  • Strong Linux administration, internals, and network troubleshooting
  • Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
  • Experience with CI/CD pipelines, GitLab, and GitHub preferred
  • Ability to diagnose and troubleshoot complex distributed systems handling high-volume transactions
Job Responsibility
Job Responsibility
  • Contribute to the success of SRE and DevOps
  • Develop expertise in new technologies
  • Work with developers, researchers, data scientists, and security experts
  • Design, build, and operate reliable, secure Cloud infrastructure
  • Ensure that applications are production-ready, scalable, and reliable
  • Develop tools and automation frameworks
  • Automate robust deployment of robust services
  • Orchestrate end-to-end monitoring and alerting
  • Participate with SRE and Dev teams in the on-call rotation
  • Lead root cause analysis of critical business and production issues
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Microsoft Substrate is the foundational cloud platform that powers many of Micro...
Location
Location
United States , Redmond
Salary
Salary:
142800.00 - 304200.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • Candidates must be able to meet Microsoft, customer and/or government security screening requirements required for this role
  • This role requires access to Microsoft Government cloud environments, including GCC Moderate (GCCM), GCC High (GCCH), and Department of Defense (DoD) environments
  • The successful candidate must be able to obtain and maintain the appropriate background investigations and customer screenings required for access to these environments
  • For access to GCCH and DoD environments, this role requires the ability to obtain and maintain a favorably adjudicated Tier 3 (T3) background investigation
  • For access to GCCM environments, this role requires the ability to meet Criminal Justice Information Services (CJIS) eligibility requirements
  • For manager-level roles, a Tier 5 (T5) background investigation is preferred
Job Responsibility
Job Responsibility
  • Define and drive reliability strategy, SLO frameworks, and operational best practices across Substrate workloads in highly regulated environments
  • Serve as an actively engaged senior on-call engineer (OCE), participating in on-call rotations and leading incident response for Substrate services in regulated environments
  • Provide hands-on leadership during the most complex or high-impact incidents, setting technical direction and response strategy
  • Drive high-quality post-incident reviews that result in durable, systemic engineering improvements across teams
  • Architect and deliver large-scale automation, observability, and self-healing solutions
  • Drive architectural decisions and define software engineering standards that make reliability, security, and compliance intrinsic to Substrate services
  • Influence service design and engineering decisions across organizational boundaries
  • Mentor senior and principal engineers and shape the long-term technical direction of the SRE discipline
  • Represent Substrate SRE perspectives with senior leadership and cross-functional partners
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

We're looking for a site reliability engineer at the intersection of software de...
Location
Location
EMEA (Europe, Middle East and Africa region, employer unspecified country)
Salary
Salary:
Not provided
copper.co Logo
Copper.co
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience in designing, analyzing, and troubleshooting distributed systems or micro-services architectures
  • Established expertise in observability and incident management
  • Proven experience in driving organizational Change
  • Excellent communication skills, with a systematic problem-solving approach
Job Responsibility
Job Responsibility
  • Shape SRE: Define how we think about reliability, observability, and operational excellence
  • Drive the adoption of SRE principles across the organization
  • Scale Through Automation: Champion architectural improvements that enhance both system reliability and deployment velocity
  • Drive Technical Excellence: Engage in and improve the lifecycle of microservices
  • Lead Through Influence: Partner with engineering and product leadership to embed reliability into our product development lifecycle
  • Conduct blameless postmortems
  • Mentor engineers across the organisation on SRE practices
What we offer
What we offer
  • 35 Days paid time off per annum, inclusive of annual leave and public holidays
  • Employees also receive one additional day of annual leave for each year of service
  • Private Health Insurance
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Substrate powers Microsoft 365. Keeping it up, resilient, and continuously impro...
Location
Location
United States , Multiple Locations
Salary
Salary:
142800.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration. OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration. OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration. OR equivalent experience.
Job Responsibility
Job Responsibility
  • Incident management excellence: Lead high-severity incident response, debug complex issues, drive incidents to resolution with clear communication and ownership. Ensure high-quality postmortems reports are created and enforce repair-item SLAs
  • Improve observability: Enhance telemetry, alerting, and dashboards using One Microsoft tooling to provide actionable insights and reduce detection time
  • Define and measure reliability: Partner with engineering teams to establish and track SLIs/SLOs for critical scenarios
  • Live site health reviews: Lead and facilitate live site health review meetings, translating business requirements into metrics and action
  • Engineering for prevention: Translate learnings into proactive tests, product fixes, rollout guardrails, and automation that reduce risk and improve service health
  • Reliability drills: Design and execute drills to simulate product failures, validate resilience and recovery, and develop resilience strategies
  • Define Policy: Draft process and policy documentation for how the organization prepares for, responds to, and prevents incidents
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Palo Alto Networks runs a large hybrid infrastructure and is one of the largest ...
Location
Location
United States , Santa Clara
Salary
Salary:
151600.00 - 245300.00 USD / Year
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience
  • Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
  • Proficient in Python and/or Go
  • Expertise in managing applications in the Kubenetes cluster with autoscaling enabled
  • Experience in Production Engineering, DevOps, or Site Reliability
  • Expertise in the public cloud (GCP or AWS), especially in GCP
  • Strong Linux administration, internals, and network troubleshooting
  • Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
  • Ability to diagnose and troubleshoot complex distributed systems handling high-volume transactions
  • Excellent written and verbal communication, able to collaborate and rally support
Job Responsibility
Job Responsibility
  • Contribute to the success of SRE and DevOps
  • Develop expertise in new technologies
  • Work with developers, researchers, data scientists, and security experts
  • Design, build, and operate reliable, secure Cloud infrastructure
  • Ensure that applications are production-ready, scalable, and reliable
  • Develop tools and automation frameworks
  • Automate robust deployment of robust services
  • Orchestrate end-to-end monitoring and alerting
  • Participate with SRE and Dev teams in the on-call rotation
  • Lead root cause analysis of critical business and production issues
What we offer
What we offer
  • restricted stock units and a bonus
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Our Site Reliability Engineering group within Enterprise Infrastructure combines...
Location
Location
United States , Westlake; Merrimack
Salary
Salary:
Not provided
fidelity.com Logo
Fidelity Investments
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree or higher in a technology related field (e.g. Engineering, Computer Science, etc.) required, master’s degree a plus
  • 5+ years of hands-on experience deploying and/or supporting highly distributed multi-tiered systems at scale
  • Strong experience in Cloud development (preferably AWS) and migration skills
  • Experience with building and operating highly resilient platforms in cloud environments
  • 2-4 years of experience in software development with Python, NodeJS, or Java with a focus on SDLC and automation
  • Hands-on experience with container orchestration, preferably with Kubernetes
  • Experience operating and implementing distributed & highly concurrent service-based
  • Ability to automate with various scripting languages (Python, Shell scripting, etc)
  • Experience managing systems using infrastructure as code tools (IAM, ARM, Terraform, Chef)
  • Solid understanding of Cloud Computing and DevOps concepts including CI/CD pipelines
Job Responsibility
Job Responsibility
  • Help define and execute a comprehensive reliability and observability strategy, ensuring that Fidelity’s systems are always available when our customers need them
  • Bring together technical, procedural, and financial data to reduce toil and increase efficiency
  • Execute plans for technical standardization and process refinement within the engineering organization, especially for Site Reliability Engineers
  • Coach peer SREs and development teams on how to build highly available systems
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Are you looking for an exciting chance to boost performance? Come aboard Fidelit...
Location
Location
United States , Merrimack, NH
Salary
Salary:
Not provided
fidelity.com Logo
Fidelity Investments
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Demonstrated expertise in site reliability with a strong background in managing large-scale systems
  • Experience in automation, monitoring, and incident management
  • Understand cloud infrastructure
  • Excel in Python scripting
  • Drive projects end-to-end
  • Extensive knowledge on Production on-call support for Cloud Infrastructure running in EKS platform
  • Experience with load balancer traffic distribution (especially Akamai and VMware Avi)
  • Extensive experiences in Change, Incident, Problem Management & on-call support
  • Extensive knowledge on observability tools (Preferable - DataDog) & Grafana
  • Experience in monitoring various aspects like Log, Metrics, APM, Event, Infrastructure & including of Dashboard creation
Job Responsibility
Job Responsibility
  • Guarantee the reliability and efficiency of our systems
  • Spearhead projects to better FFIO Market Data Services
  • Work closely with a team of committed experts
  • Create a substantial influence on our operations
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

As a Principal Site Reliability Engineer you will lead curial initiatives in the...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter
  • 7+ years technical experience working with large-scale cloud or distributed systems
  • Experience building or scaling incident response programs at organizational or enterprise scope
  • Background in SRE, production engineering, or platform reliability roles
  • Track record of reducing customer impact through improved incident handling, tooling, or prevention
Job Responsibility
Job Responsibility
  • Own execution quality for Substrate high severity incidents, ensuring clear command, decisive leadership, and forward momentum during high impact events
  • Act as the senior incident leader or sponsor for long running, high stakes, or cross service incidents, ensuring alignment on impact, risk, and recovery priorities
  • Partner closely with Incident Managers, Subject Matter Experts, and service leaders to ensure effective diagnosis, escalation, and mitigation when ownership is unclear or action is blocked
  • Ensure high quality post incident reviews and drive accountability for repair items that reduce recurrence and systemic risk
  • Ensure consistent application of severity and priority models, outage declaration criteria, and executive escalation paths
  • Coach and help develop a team of Site Reliability Engineers serving as incident responders
  • Build a culture of calm execution, accountability, psychological safety, and continuous learning during and after incidents
  • Help hire and grow senior talent capable of operating as trusted leaders in high pressure, executive visible situations
  • Serve as a trusted advisor to engineering leaders and executives on live site risk, readiness, and incident response maturity
  • Communicate clearly and credibly with senior leadership during customer impacting events
  • Fulltime
Read More
Arrow Right