CrawlJobs Logo

Senior Technical Architect – Site Reliability Engineering & AIOps

schwab.com Logo

Charles Schwab

Location Icon

Location:
United States , Austin

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

210000.00 - 240000.00 USD / Year

Job Description:

At Schwab, you’re empowered to make an impact on your career. Here, innovative thought meets creative problem solving, helping us “challenge the status quo” and transform the finance industry together. In this role, you’ll lead the technical vision and architecture for our Site Reliability Engineering (SRE) and AIOps function, shaping how reliability, automation, and intelligent operations scale across the enterprise. This is not a traditional production support role. It requires engineering / coding experience. You’ll work at the intersection of cloud-native platforms, distributed systems, and AI-driven operations—partnering closely with Engineering, Product, Security, and Infrastructure leaders to build resilient, self-healing systems that support millions of clients. This is a highly visible leadership role where your expertise influences both technology strategy and how teams operate day to day.

Job Responsibility:

  • SRE Architecture & Reliability Strategy — Define and own the end-to-end reliability architecture, including SLO/SLI frameworks, error budget policies, observability standards, and resilience patterns across distributed microservices environments
  • AIOps Platform Architecture — Design and architect the AIOps platform encompassing ML-driven anomaly detection, predictive alerting, automated root cause analysis, event correlation, and intelligent remediation workflows
  • Infrastructure & Platform Design — Lead architecture decisions for cloud-native infrastructure (GCP/AWS/Azure), Kubernetes orchestration, service mesh (Istio/Envoy), infrastructure-as-code (Terraform/Pulumi), and multi-region disaster recovery strategies
  • Observability & Monitoring Architecture — Architect the unified observability stack integrating metrics, logs, traces, and events using technologies such as OpenTelemetry, Grafana, Datadog, and custom ML pipelines for intelligent alerting
  • Automation & Self-Healing Systems — Drive the architecture of automated remediation frameworks, self-healing infrastructure, chaos engineering pipelines, and progressive deployment strategies (canary, blue-green, feature flags) to achieve zero-touch operations
  • Technical Leadership & Governance — Establish architecture review boards, technical standards, design patterns, and reference architectures
  • lead technical due diligence and drive consistency across SRE and platform teams
  • Team Development & Mentorship — Build, mentor, and grow a team of senior SRE architects and engineers
  • foster a culture of engineering excellence, continuous learning, and innovation in reliability and AI-driven operations
  • Stakeholder & Executive Engagement — Partner with Engineering, Product, Security, and Infrastructure leadership to align reliability and AIOps investments with business priorities
  • present technical strategies to executive stakeholders

Requirements:

  • 12+ years of experience in software development and engineering, infrastructure, or SRE
  • 5+ years in a senior architecture or technical leadership role
  • Deep expertise in distributed systems, cloud-native architectures, and large-scale production environments
  • Hands-on experience with Kubernetes, Docker, service mesh, CI/CD pipelines, and infrastructure-as-code tools
  • Strong understanding of ML/AI concepts and their application to operational intelligence
  • Proven experience designing observability platforms using OpenTelemetry, Prometheus, Grafana, Datadog, Splunk, or equivalent
  • Expertise in incident management frameworks, chaos engineering, and SLO-driven reliability practices
  • Experience with major cloud platforms (AWS, GCP, Azure) at scale
  • Strong communication and executive presence with the ability to translate complex technical concepts for non-technical stakeholders
What we offer:
  • 401(k) with company match and Employee stock purchase plan
  • Paid time for vacation, volunteering, and 28-day sabbatical after every 5 years of service for eligible positions
  • Paid parental leave and family building benefits
  • Tuition reimbursement
  • Health, dental, and vision insurance

Additional Information:

Job Posted:
March 01, 2026

Expiration:
March 07, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Technical Architect – Site Reliability Engineering & AIOps

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
United States
Salary
Salary:
150000.00 - 225000.00 USD / Year
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • At least 3+ years in a Senior+ SRE position
  • Strong background in running production SaaS systems at scale
  • Proficiency in at least one programming/scripting language (Python, Go, or similar)
  • Hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • Experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • Familiarity with advanced observability (OTEL, continuous profiling)
  • Proven incident management experience, including leading high-severity incidents and postmortems
  • Strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
What we offer
What we offer
  • Equity
  • Generous benefits program
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
Finland , Helsinki
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • at least 3+ of those years operating in a Senior+ SRE position
  • Strong background in running production SaaS systems at scale
  • Proficiency in at least one programming/scripting language (Python, Go, or similar)
  • Hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • Experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • Familiarity with advanced observability (OTEL, continuous profiling)
  • Proven incident management experience, including leading high-severity incidents and postmortems
  • Strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • At least 3+ of those years operating in a Senior+ SRE position
  • Strong background in running production SaaS systems at scale
  • Proficiency in at least one programming/scripting language (Python, Go, or similar)
  • Hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • Experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • Familiarity with advanced observability (OTEL, continuous profiling)
  • Proven incident management experience, including leading high-severity incidents and postmortems
  • Strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services in a “You Build It, You Run It” culture
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
India , Delhi
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • at least 3+ of those years operating in a Senior+ SRE position
  • strong background in running production SaaS systems at scale
  • proficiency in at least one programming/scripting language (Python, Go, or similar)
  • hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • familiarity with advanced observability (OTEL, continuous profiling)
  • proven incident management experience, including leading high-severity incidents and postmortems
  • strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services in a “You Build It, You Run It” culture
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
Read More
Arrow Right

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...
Location
Location
India , Pune
Salary
Salary:
Not provided
alpha-sense.com Logo
AlphaSense
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • at least 3+ of those years operating in a Senior+ SRE position
  • strong background in running production SaaS systems at scale
  • proficiency in at least one programming/scripting language (Python, Go, or similar)
  • hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
  • deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
  • experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
  • familiarity with advanced observability (OTEL, continuous profiling)
  • proven incident management experience, including leading high-severity incidents and postmortems
  • strong troubleshooting skills across the full stack
Job Responsibility
Job Responsibility
  • Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services in a “You Build It, You Run It” culture
  • Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
  • Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
  • Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
  • Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
  • Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing
Read More
Arrow Right
New

Dental Nurse

Join Our Team! Guinea Court Dental is a modern private practice with six state-o...
Location
Location
United Kingdom , Basingstoke
Salary
Salary:
Not provided
jobs.360resourcing.co.uk Logo
360 Resourcing Solutions
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • GDC registered Dental Nurse or Trainee
  • Reliable, organised, and a strong communicator
  • A team player who enjoys working in a supportive, professional environment
  • Committed to delivering excellent patient care
  • Flexible and able to adapt to the needs of a busy practice
What we offer
What we offer
  • Professional cover: GDC, indemnity, and CPD costs paid
  • Ongoing training and support for professional development
  • A friendly, supportive team environment in a purpose-built practice
  • Contributory pension scheme
  • Free on-site parking
  • Birthday leave – enjoy an extra paid day off
  • Option to purchase additional leave
  • Referral reward scheme
  • Fulltime
Read More
Arrow Right
New

Learning Support Worker

Learning Support Worker- A life changing role! Are you passionate about making a...
Location
Location
United Kingdom , Cranleigh, Surrey
Salary
Salary:
24033.00 - 25000.00 GBP / Year
https://www.randstad.com Logo
Randstad
Expiration Date
March 10, 2026
Flip Icon
Requirements
Requirements
  • Caring, patient, and committed to supporting autistic adults in developing independence and reaching their full potential
  • Experience in a support or care role is beneficial but not essential
  • Experience supporting learners with SEN would be desirable
Job Responsibility
Job Responsibility
  • Support learners to build confidence, communication skills, problem-solving abilities, and self-management strategies
  • Work closely with families and health & social care professionals to ensure every learner receives the best possible support tailored to their aspirations
What we offer
What we offer
  • 25 days paid holiday plus bank holidays
  • Full training is provided
Read More
Arrow Right
New

Deputy Manager

Deputy Manager – Jollyes Pets - Kidderminster. Supporting the Store Manager in t...
Location
Location
United Kingdom , Kidderminster
Salary
Salary:
26000.00 - 27000.00 GBP / Year
jobs.360resourcing.co.uk Logo
360 Resourcing Solutions
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A passion for pets and people
  • Previous retail management experience, where you have deputized for the Store Manager and taken a hands-on approach to managing the business on a daily basis
  • A team player, able to communicate effectively and build high performing and highly engaged teams
  • Demonstrate a proven track record in developing business performance and exceeding KPIs, whilst also delivering the highest levels of store standards including health & safety and legal compliance
  • You should be commercially aware and have a proactive approach with great planning and organisation skills
  • You should have your own transport with a UK driving licence
Job Responsibility
Job Responsibility
  • Support the Store Manager day-to-day and take full responsibility for all aspects of running a successful store in their absence
  • Managing people, financial performance, store standards, customer service
  • Lead by example creating an excellent culture and working environment for your team
  • Ensuring you and your team are delivering exceptional customer service by providing a great shopping experience for customers and displaying strong pet and product knowledge
  • Delivering the highest standards of pet care and ensure that the welfare of pets is a top priority, promoting responsible pet ownership
What we offer
What we offer
  • Annual bonus potential of £1-5k p.a.
  • Financial Wellbeing Package (Stream): Access earnings early, plus savings tools and discounts
  • Retail Trust Membership: Counselling, wellbeing, and financial support
  • Colleague Discounts: Treats at 800+ retailers, plus 30% off Jollyes products and pet services
  • Health & Wellbeing Support: Online GP, mental health services, fitness programs, dental care, and cancer support
  • Workplace Pension: Legal & General scheme (EE 3%, ER 5%)
  • Extra Time Off: Birthday, wedding, new pet days, plus buy/sell holiday options
  • Enhanced Family Leave: Maternity and paternity packages above statutory levels
  • Recognition & Rewards: Top Dog Award with extra day off and perks
  • Lifestyle Benefits: Cycle2Work scheme and discounted David Lloyd membership
  • Fulltime
Read More
Arrow Right