This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
At Schwab, you’re empowered to make an impact on your career. Here, innovative thought meets creative problem solving, helping us “challenge the status quo” and transform the finance industry together. In this role, you’ll lead the technical vision and architecture for our Site Reliability Engineering (SRE) and AIOps function, shaping how reliability, automation, and intelligent operations scale across the enterprise. This is not a traditional production support role. It requires engineering / coding experience. You’ll work at the intersection of cloud-native platforms, distributed systems, and AI-driven operations—partnering closely with Engineering, Product, Security, and Infrastructure leaders to build resilient, self-healing systems that support millions of clients. This is a highly visible leadership role where your expertise influences both technology strategy and how teams operate day to day.
Job Responsibility:
SRE Architecture & Reliability Strategy — Define and own the end-to-end reliability architecture, including SLO/SLI frameworks, error budget policies, observability standards, and resilience patterns across distributed microservices environments
AIOps Platform Architecture — Design and architect the AIOps platform encompassing ML-driven anomaly detection, predictive alerting, automated root cause analysis, event correlation, and intelligent remediation workflows
Infrastructure & Platform Design — Lead architecture decisions for cloud-native infrastructure (GCP/AWS/Azure), Kubernetes orchestration, service mesh (Istio/Envoy), infrastructure-as-code (Terraform/Pulumi), and multi-region disaster recovery strategies
Observability & Monitoring Architecture — Architect the unified observability stack integrating metrics, logs, traces, and events using technologies such as OpenTelemetry, Grafana, Datadog, and custom ML pipelines for intelligent alerting
Automation & Self-Healing Systems — Drive the architecture of automated remediation frameworks, self-healing infrastructure, chaos engineering pipelines, and progressive deployment strategies (canary, blue-green, feature flags) to achieve zero-touch operations
lead technical due diligence and drive consistency across SRE and platform teams
Team Development & Mentorship — Build, mentor, and grow a team of senior SRE architects and engineers
foster a culture of engineering excellence, continuous learning, and innovation in reliability and AI-driven operations
Stakeholder & Executive Engagement — Partner with Engineering, Product, Security, and Infrastructure leadership to align reliability and AIOps investments with business priorities
present technical strategies to executive stakeholders
Requirements:
12+ years of experience in software development and engineering, infrastructure, or SRE
5+ years in a senior architecture or technical leadership role
Deep expertise in distributed systems, cloud-native architectures, and large-scale production environments
Hands-on experience with Kubernetes, Docker, service mesh, CI/CD pipelines, and infrastructure-as-code tools
Strong understanding of ML/AI concepts and their application to operational intelligence
Proven experience designing observability platforms using OpenTelemetry, Prometheus, Grafana, Datadog, Splunk, or equivalent
Expertise in incident management frameworks, chaos engineering, and SLO-driven reliability practices
Experience with major cloud platforms (AWS, GCP, Azure) at scale
Strong communication and executive presence with the ability to translate complex technical concepts for non-technical stakeholders
What we offer:
401(k) with company match and Employee stock purchase plan
Paid time for vacation, volunteering, and 28-day sabbatical after every 5 years of service for eligible positions