This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Production Support SRE Engineer is responsible for ensuring the reliability, stability, and operational excellence of Workplace Services applications and platforms. This role blends hands‑on production support with proactive reliability engineering—partnering with product, delivery, SRE, and infrastructure teams to reduce toil, improve observability, and strengthen service health. Each engineer will own one or more service areas, acting as the primary SRE contact and driving operational readiness, incident response, and continuous improvement initiatives. Success includes measurable improvements in availability, alert quality, automation adoption, and adherence to Schwab’s SRE standards.
Job Responsibility:
Serve as the primary production support engineer for assigned Workplace Services applications, ensuring high availability, rapid incident response, and effective participation in both market‑hour and after‑hours on‑call rotations
Lead root‑cause analysis, support SLO breach investigations, and partner with product and delivery teams to restore and maintain service health
Champion Schwab’s SRE principles by improving observability, structured ELI logging, meaningful alerting, automation, and standardized dashboard/reporting patterns
Ensure new features, releases, and operational changes meet reliability, monitoring, and readiness expectations
Develop and maintain runbooks, operational guides, incident playbooks, and service documentation
Identify sources of operational toil, drive automation efforts, rationalize alerts, and deliver data‑driven insights and trends to product and engineering teams for proactive reliability improvements
Act as the embedded SRE partner for your service area—attending key ceremonies, advising teams on operational risks, and promoting best practices in reliability engineering
Foster a culture of blameless postmortems, continuous learning, and cross‑team enablement
Requirements:
2+ yrs experience in production support, incident management, and real‑time troubleshooting for high‑availability systems
Solid understanding of SRE principles, including SLIs, SLOs, error budgets, and incident response frameworks
Hands-on experience with observability and monitoring tools such as Splunk, Grafana, Moogsoft, or xMatters
Proficiency with structured logging, log analysis, and alert tuning
Ability to create and maintain runbooks, operational guides, and incident playbooks
Familiarity with automation concepts and ability to identify and reduce operational toil through scripts, tooling, or process improvements
Strong communication skills with the ability to translate complex technical issues into clear, business-friendly language
Ability to partner with product, engineering, and delivery teams to embed reliability into the development lifecycle
Experience participating in on-call rotations, including market‑hours support and after‑hours escalations
Bachelor’s degree in Computer Science, Engineering, Information Systems, or equivalent practical experience
Prior background in production support, site reliability engineering, systems engineering, or operations
What we offer:
401(k) with company match and Employee stock purchase plan
Paid time for vacation, volunteering, and 28-day sabbatical after every 5 years of service for eligible positions