CrawlJobs Logo

Principal Architect, Site Reliability Engineering

United States, Southlake, TX 221000.00 - 252000.00 USD / Year · Job Posted May 28, 2026
Apply Position
Job Link Share

Job Description

At Schwab, you’re empowered to make an impact on your career. Here, innovative thought meets creative problem solving, helping us “challenge the status quo” and transform the finance industry together. We believe in the importance of in-office collaboration and fully intend for the selected candidate for this role to work on site in the specified location(s). Workplace Services Engineering (WSE) is an organization within Schwab Technology Services that is embarking on a major transformation. We support Workplace Services, and we’re shaping the future of how people experience financial well‑being at work. We partner with leading employers to deliver innovative retirement, equity, and workplace financial solutions that help millions of participants build stronger financial futures. This is a fast‑growing, high‑impact business where scale meets purpose—where your work directly influences how people plan, save, invest, and succeed. As a key growth engine for the firm, we’re investing more than ever to expand our capabilities, modernize platforms, and elevate the experiences we deliver to employers and their employees. Our teams work at the intersection of technology, service, and financial expertise—supporting workplace clients with solutions that scale, adapt, and deliver meaningful outcomes. Here, your ideas help shape what’s next for workplace financial services. If you’re energized by solving complex problems, collaborating across disciplines, and making a real difference in the workplace services industry, you’ll find your place here. As a Principal Architect, Site Reliability Engineering for Schwab's Technology Solutions organization, you will be responsible for building a purposeful, proactive, and sustainable approach to reliability on a foundation of SRE principles. You will partner with multiple support teams, architects, developers, and other stakeholders to develop common tools and guidance and drive adoption of key reliability engineering practices in support of large-scale and mission-critical services. Through your deep SRE knowledge and history of implementation, you will have open, candid conversations with senior leaders and engineers and play a pivotal role in establishing a foundational SRE practice at Schwab.

Job Responsibility

  • Evangelize SRE mindset and practice across the Schwab Technology Solutions organization
  • Partner with support, development, and business stakeholders to develop, measure, and leverage service level objectives
  • Design and develop solutions to eliminate toil and manual effort from day-to-day support responsibilities
  • Identify and implement improvements to logging, metrics, and tracing telemetry and triaging capabilities across a diverse technology stack
  • Lead complex triage and postmortem activities for critical issues and drive prioritization/resolution of remediation items
  • Perform chaos engineering experiments to improve application resilience to known and unknown failures
  • Document reliability guidance and best practices. Advocate for and drive adoption of said practices
  • Foster a culture of learning through coaching, mentoring, and knowledge sharing around reliability practices, processes, and tools
  • Develop tools, frameworks, and instrumentation to validate and increase release success for applications

Requirements

  • Minimum 5+ years in SRE role, with at least 3+ years in an architect or leadership position with a hands-on track record of operating mission-critical systems at scale
  • At least 3 or more years of experience designing and implementing highly scalable and fault tolerant systems
  • Deep practical expertise across observability, incident management, resilience engineering, and capacity planning, not just familiarity, but proven delivery in production environments
  • Demonstrated experience using AI tools to solve real reliability problems: anomaly detection, incident triage, noise reduction, postmortem acceleration, capacity forecasting, or auto-remediation and reduce repetitive operational toil
  • Proven ability to define and enforce technical standards across multiple engineering teams or business units without direct managerial authority
  • In-depth knowledge of resilience patterns (i.e. circuit breakers, timeouts, retries, etc.) and how to design and implement them
  • In-depth knowledge of CI/CD processes and tools to ensure software is delivered safely using known deployment strategies (i.e. blue/green, canary deployments, feature toggles, etc.)
  • Authored technical postmortems with root cause analyses and documented action items that resulted in measurable resiliency improvements
  • Contributed to the SLO strategy for at least 5 teams, ensuring alignment with business and client objectives
  • Three or more years hands-on experience with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk), with a proven track record of setting up dashboards and alerts
  • Led or participated in cross-functional SRE-focused initiatives that included key stakeholders from both technical and business units
  • Participated in resilience or chaos engineering exercises, with documentation showing a reduction in unplanned downtime
  • Presented findings or led training sessions to share SRE practices, enhancing team performance or adoption rates for reliability engineering methods
  • Mentored SRE engineers and engineering teams in SRE best practices, with improvements in incident resolution speed and reliability metrics
  • Authored and maintained comprehensive SRE documentation for critical systems or workflows, including incident response guides, runbooks, operational playbooks, SLO implementation, and observability

What we offer

  • 401(k) with company match and Employee stock purchase plan
  • Paid time for vacation, volunteering, and 28-day sabbatical after every 5 years of service for eligible positions
  • Paid parental leave and family building benefits
  • Tuition reimbursement
  • Health, dental, and vision insurance
  • Medical, dental and vision benefits
  • 401(k) and employee stock purchase plans
  • Tuition reimbursement to keep developing your career
  • Paid parental leave and adoption/family building benefits
  • Sabbatical leave available after five years of employment

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Principal Architect, Site Reliability Engineering

8 matching positions

Principal Site Reliability Engineer (AI-first SRE)

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Peru
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Colombia
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering
  • 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Ecuador
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer

We are looking for a Principal Site Reliability Engineer to join the CVML Platfo...
Location
Location
United States
Salary
Salary:
166000.00 - 293000.00 USD / Year
bluerivertechnology.com Logo
Blue River Technology
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience building infrastructure with K8S, AWS, and bare metal
  • 8+ years of experience working with Python and Go (with production experience)
  • 8+ years of experience working with infra automation tools: Terraform / Terragrunt (or Pulumi / CDK)
  • 8+ experience with Linux-based systems and networks, and a deep understanding of internal components, networking, and security aspects
  • Has a track record of building and maintaining scalable systems in production environments
  • Experience in building CI/CD pipelines using GitHub Actions (or GitLab / Jenkins) for application release and deployment
  • Experience in using AWS ECS, EKS, IAM, EC2, and RDS at production scale
  • Deep understanding of Kubernetes and its internals (kubelet, CRDs, etc) and experience with building and extending clusters from scratch
  • Strong problem-solving skills and ability to troubleshoot complex infrastructure and networking issues
  • Excellent communication skills to collaborate effectively with technical and non-technical stakeholders
Job Responsibility
Job Responsibility
  • System Design: Architect and implement various cloud and on-premise applications, systems, and infrastructure
  • Hybrid system integration: Integrate extremely diverse systems, configure stable integration, uptime, and monitoring
  • Edge device integration: work with edge devices of various formats and integrate them with on-prem and cloud workflows, including networking, low-level OS, and electrical/control integration
  • Low-level performance optimization: optimize the performance and throughput of the system at the filesystem, networking, and software levels
  • High-level optimisation of cost and stability: optimize cost, operational stability, and supportability of highly diverse platforms and tech stack
  • Product Mindset: Collaborate with cross-functional teams to design, develop, and maintain robust, scalable, and user-friendly web and mobile data-intensive applications
  • System Integration: Build tools that enable users to easily move between different applications and platforms to utilize the strengths of each in a coherent ecosystem
  • Collaboration: Work closely with cross-functional teams, including data scientists, analysts, software engineers, and product managers, to understand data requirements and deliver data solutions that align with business goals
  • Documentation: Create and maintain technical documentation, including data flow diagrams, architecture designs, and standard operating procedures
  • Technology Evaluation: Stay up-to-date with industry trends and emerging technologies related to data engineering, recommending and implementing new tools and frameworks as appropriate
What we offer
What we offer
  • eligibility for Blue River’s bonus and benefit programs
  • Fulltime
Read More
Arrow Right

Principal Software Engineer, Trusted Data Platform

As a Principal Software Engineer, you will be a technical leader and hands-on co...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related technical field
  • 10+ years of experience in backend software development, focusing on distributed systems and storage solutions
  • 5+ years of experience working with AWS storage services (S3, DynamoDB, EBS, EFS, FSx, Glacier)
  • Strong expertise in system design, architecture, and scalability for large-scale storage solutions
  • Proficiency in at least one major backend programming language (Kotlin, Java, Go, Rust, or Python)
  • Experience designing and implementing highly available, fault-tolerant, and cost-efficient storage architectures
  • Deep understanding of distributed systems, replication strategies, sharding, and caching
  • Knowledge of data security, encryption best practices, and compliance requirements (SOC2, GDPR, HIPAA)
  • Experience leading engineering teams, mentoring senior engineers, and driving technical roadmaps
  • Proficiency with observability tools, performance monitoring, and troubleshooting at scale
Job Responsibility
Job Responsibility
  • Designing and optimizing high-scale, distributed storage systems built on AWS storage technologies
  • Shaping the architecture, performance, and reliability of backend storage solutions that power critical applications at scale
  • Designing, implementing, and optimizing backend storage services that support high throughput, low latency, and fault tolerance
  • Working closely with senior engineers, architects, and cross-functional teams to drive scalability, availability, and efficiency improvements in large-scale storage solutions
  • Leading technical deep dives, architecture reviews, and root cause analyses to resolve complex production issues related to storage performance, consistency, and durability
  • Driving best practices in distributed system design, security, and cloud cost optimization
  • Mentoring senior engineers, contributing to technical roadmaps, and helping shape the long-term storage strategy
  • Collaborating with Site Reliability Engineers (SREs) to implement observability, monitoring, and disaster recovery strategies, ensuring high availability and compliance with industry standards
  • Advocating for automation, Infrastructure-as-Code (IaC), and DevOps best practices, leveraging tools like Terraform, AWS CloudFormation, Kubernetes (EKS), and CI/CD pipelines to enable scalable deployments and operational excellence
What we offer
What we offer
  • Atlassians can choose where they work – whether in an office, from home, or a combination of the two
  • Atlassians have more control over supporting their family, personal goals, and other priorities
  • We can hire people in any country where we have a legal entity
  • Interviews and onboarding are conducted virtually
  • Whatever your preference - working from home, an office, or in between - you can choose the place that's best for your work and your lifestyle
Read More
Arrow Right

Principal Site Reliability Engineering Manager

Are you a Principal Site Reliability Engineering Manager interested in improving...
Location
Location
United States , Redmond
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • 3+ years of people management experience
  • 5+ years of experience planning, designing, implementing, and delivering large initiatives spanning multiple engineers as the primary owner, including operating and improving production services at scale
  • Experience leading reliability engineering for developer-facing or platform services, including incident response, automation/toil reduction, and observability (metrics/logs/tracing) built on top of mature observability platforms and practices
  • Experience working across disciplines, groups, and teams to align reliability priorities and delivery plans
  • Experience architecting, deploying, and operating enterprise scale distributed cloud services (Azure preferred), including containerization and orchestration
  • Experience operating engineering systems outer loop processes (CI/CD, build, and release platforms) with reliability, safety, and governance practices
Job Responsibility
Job Responsibility
  • Partner with engineers, product managers, and partner teams to design, operate, and maintain reliable and resilient services, with clear operational requirements (monitoring, alerting, runbooks, capacity, and failure modes)
  • Drive cross-org alignment through partnerships and co-development following the “One Microsoft” philosophy, including shared reliability standards and operational tooling
  • Build, grow, and retain a team of Site Reliability Engineers
  • Provide mentorship and coaching on reliability engineering, incident response, and pragmatic automation—within and beyond your team
  • Define, implement, and operate SLOs/SLIs and error budgets for critical engineering systems services
  • use them to guide prioritization and continuous improvement
  • Lead incident management for your services, including on-call health, escalation paths, blameless post incident reviews, modeling follow-through on corrective and preventive actions
  • Drive automation to reduce toil and improve operational efficiency across build, validation, and deployment systems (e.g., self-healing, safe rollouts, and automated remediation)
  • Establish observability (metrics, logs, traces), capacity planning, and performance management to meet reliability and latency goals at scale
  • Foster a diverse and inclusive culture where everyone can bring their full and authentic self, while holding a high bar for customer impact and reliability
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer

Arcadia’s customers rely on us to securely process and deliver high-value health...
Location
Location
Salary
Salary:
Not provided
themuse.com Logo
The Muse
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in SRE, platform engineering, systems engineering, or related roles operating production services at scale
  • Demonstrated principal-level impact: leading cross-team initiatives, influencing architecture decisions, and driving sustained improvements in reliability and operations
  • Expertise in Kubernetes operations and troubleshooting, including safe rollout/rollback patterns, workload debugging, and operational guardrails
  • Strong GitOps experience with Argo CD
  • experience building delivery workflows and automation using Argo Workflows
  • Strong infrastructure orchestration and provisioning experience with Crossplane and Terraform
  • ability to define reusable platform patterns and controls
  • Deep AWS experience (IAM, networking/VPC, compute, storage, managed services, observability) and strong understanding of reliability and failure modes in cloud systems
  • Proficiency in Python for building automation, tooling, and reliability improvements
  • Strong incident management and on-call leadership experience, including measurable improvements (availability, MTTR, alert quality, cost, or operational maturity)
Job Responsibility
Job Responsibility
  • Act as the technical leader for reliability for one or more domains
  • set direction and standards while remaining hands-on where it matters most
  • Drive reliability strategy across critical services: define SLOs/SLIs, error budgets, and reliability KPIs aligned to customer journeys and outcomes
  • Own incident response maturity: lead complex incidents, improve incident command practices, and ensure high-quality RCAs with prioritized, tracked remediation
  • Architect and implement automation to reduce toil and risk: runbook automation, self-service tools, and safe operational workflows (Python + Argo Workflows)
  • Advance GitOps delivery practices using Argo CD: promotion strategies, progressive delivery/canaries, and guardrails that reduce deploy risk
  • Scale infrastructure management with Crossplane and Terraform: reusable patterns, policy controls, and paved roads for teams
  • Lead operational readiness and reliability reviews for new features/architectural changes
  • reinforce non-functional requirements (availability, latency, security, cost)
  • Improve performance and cost efficiency through capacity planning, load testing, right-sizing, and architecture recommendations across AWS services
What we offer
What we offer
  • Pet Insurance
  • Health Insurance
  • Dental Insurance
  • Vision Insurance
  • FSA
  • HSA
  • HSA With Employer Contribution
  • Life Insurance
  • Short-Term Disability
  • Long-Term Disability
Read More
Arrow Right