Principal Site Reliability Engineer Job at The Muse

Job Description

Arcadia’s customers rely on us to securely process and deliver high-value healthcare insights. Reliability, availability, performance, and security are foundational to trust—especially when systems support critical workflows and handle PHI. As a Principal Site Reliability Engineer, you’ll set reliability strategy across teams, drive cross-cutting platform improvements, and ensure we can scale delivery without scaling operational burden.

Job Responsibility

Act as the technical leader for reliability for one or more domains
set direction and standards while remaining hands-on where it matters most
Drive reliability strategy across critical services: define SLOs/SLIs, error budgets, and reliability KPIs aligned to customer journeys and outcomes
Own incident response maturity: lead complex incidents, improve incident command practices, and ensure high-quality RCAs with prioritized, tracked remediation
Architect and implement automation to reduce toil and risk: runbook automation, self-service tools, and safe operational workflows (Python + Argo Workflows)
Advance GitOps delivery practices using Argo CD: promotion strategies, progressive delivery/canaries, and guardrails that reduce deploy risk
Scale infrastructure management with Crossplane and Terraform: reusable patterns, policy controls, and paved roads for teams
Lead operational readiness and reliability reviews for new features/architectural changes
reinforce non-functional requirements (availability, latency, security, cost)
Improve performance and cost efficiency through capacity planning, load testing, right-sizing, and architecture recommendations across AWS services
Champion infrastructure security best practices for environments that handle PHI (least privilege, secrets management, auditability, and defense-in-depth)
Mentor Staff and Senior engineers through design reviews, code reviews, pairing, and documentation
raise reliability standards across teams

Requirements

8+ years of experience in SRE, platform engineering, systems engineering, or related roles operating production services at scale
Demonstrated principal-level impact: leading cross-team initiatives, influencing architecture decisions, and driving sustained improvements in reliability and operations
Expertise in Kubernetes operations and troubleshooting, including safe rollout/rollback patterns, workload debugging, and operational guardrails
Strong GitOps experience with Argo CD
experience building delivery workflows and automation using Argo Workflows
Strong infrastructure orchestration and provisioning experience with Crossplane and Terraform
ability to define reusable platform patterns and controls
Deep AWS experience (IAM, networking/VPC, compute, storage, managed services, observability) and strong understanding of reliability and failure modes in cloud systems
Proficiency in Python for building automation, tooling, and reliability improvements
Strong incident management and on-call leadership experience, including measurable improvements (availability, MTTR, alert quality, cost, or operational maturity)
Excellent communication skills: can translate technical risk and reliability tradeoffs to engineering leadership, product, and stakeholders
produces high-quality docs/runbooks

Nice to have

Experience with ScyllaDB or similar distributed databases (e.g., Cassandra) and their reliability/performance characteristics
Experience with Spark or data processing platforms, including reliability and cost considerations for large-scale workloads
Familiarity with agentic coding practices and principles (safe automation, reviewable changes, guardrail-first workflows)
Strong infrastructure security knowledge: threat modeling for cloud/Kubernetes, RBAC/IAM design, secrets management, supply chain security, and security observability

What we offer

Pet Insurance
Health Insurance
Dental Insurance
Vision Insurance
FSA
HSA
HSA With Employer Contribution
Life Insurance
Short-Term Disability
Long-Term Disability
Fitness Subsidies
Mental Health Benefits
Family Support Resources
Non-Birth Parent Or Paternity Leave
Adoption Leave
Fertility Benefits
Birth Parent Or Maternity Leave
Hybrid Work Opportunities
Flexible Work Hours
Remote Work Opportunities
Casual Dress
Pet-Friendly Office
Snacks
Company Outings
Commuter Benefits Program
Paid Vacation
Unlimited Paid Time Off
Paid Holidays
Personal/Sick Days
Leave Of Absence
401(K) With Company Matching
401(K)
Performance Bonus
Work Visa Sponsorship
Promote From Within
Access To Online Courses
Lunch And Learns
Diversity, Equity, And Inclusion Program
Employee Resource Groups (ERG)

The Muse - All Job Offers

Select Country

Principal Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Our AI answers in your language