This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As a member of the CET SAvE organization, you will join the Production Operations team for Schwab’s Mobile Application while driving the adoption of Site Reliability Engineering (SRE) best practices. In this critical role, you will shape automation, tooling, observability, and reliability strategies across engineering teams to enhance service health and performance.
Job Responsibility:
Lead and Optimize Reliability – Drive tactical and strategic initiatives to improve service health, performance, and availability for Schwab’s Mobile Application
Champion SRE Best Practices – Implement key operational methodologies, including SLIs, SLOs, error budgets, blameless postmortems, and capacity planning
Enhance Observability & Automation – Develop and improve monitoring, telemetry, and alerting systems to proactively detect and resolve issues, reducing MTTD and MTTR
Drive Tooling & DevOps Innovation – Design and implement automation solutions that reduce toil, streamline deployments, and improve overall system resilience
Collaborate Cross-Functionally – Partner closely with Mobile Engineering, DevOps, and Infrastructure teams to enhance scalability, security, and reliability
Provide On-Call Support – Participate in an on-call rotation to ensure the reliability of Schwab’s Retail Web and Mobile applications
Requirements:
Bachelor of Science or equivalent in Computer Science or a related field
5+ years of experience in software development and site reliability engineering (SRE), with a strong focus on cloud technologies
5+ years in DevOps engineering, with expertise in automating production operations and developing self-healing systems
5+ years hands-on experience with CI/CD tools, logging, observability, and telemetry solutions such as Bitbucket, Bamboo, GitHub, Jenkins, AppDynamics, Splunk, Prometheus, and Grafana
3+ years of proven ability to implement SRE principles, including SLIs, SLOs, error budgets, monitoring, blameless postmortems, and toil reduction
Nice to have:
Strong proficiency in programming and automation using Python, Java, CloudFormation, or Terraform for Infrastructure-as-Code (IaC) solutions
Familiarity with Cloud Infrastructure platforms (AWS, GCP, and Azure)
Deep understanding of Compute, Storage, Networking, Load Balancing, CDN, DNS, and Security stacks in cloud environments
Ability to work independently in a fast-paced, high-impact environment while collaborating effectively across teams
Excellent verbal and written communication skills, with the ability to convey complex technical concepts to both technical and non-technical stakeholders
What we offer:
401(k) with company match and Employee stock purchase plan
Paid time for vacation, volunteering, and 28-day sabbatical after every 5 years of service for eligible positions