This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As a Manager – SRE & Operations in Client Data Technology (CDT), you will lead System Availability Engineering (SAvE) Teams for CAT & O2 applications, playing a critical role in ensuring availability of CDT eco systems, and guiding the development, automation, tooling and realization of SRE best practices.
Job Responsibility:
Identifying tactical and strategic opportunities to improve service health, performance, reliability, and telemetry across CDT Platform
Leading the team with data driven mindset focusing on addressing key performance metrics such as MTTD, MTTR, Availability in close collaboration with Trading development and IT Operations teams
Leading the design, architecture and implementation of availability and resiliency roadmap that delivers on modernized tooling and metrics
Working closely with development team to define a sustainable operating model for CDT applications and its DB focusing on platform scale, availability, fault tolerance and performance
Leading the automation and Infrastructure as Code(IaaC) practices to ensure teams are following patterns to ensure repeatability, consistency and portability
Identifying toil and technical debt, develop a comprehensive plan and lead the team through the process of execution
Driving a shift-left mindset and influence architectural decisions to ensure resiliency and scale at the outset of software development process
Being a hands-on technical leader who will lead the team from the front and be able to inspire thought leadership in the tea
Provide On-Call Support – Participate in an on-call rotation to ensure the reliability of CDT applications
Requirements:
9+ years of software development and site reliability engineering experience supporting production applications on prem & in any public cloud environment, PCF and IaaS
7+ years of experience in software development and site reliability engineering (SRE), with a strong focus on cloud technologies
5+ years in DevOps engineering, with expertise in automating production operations and developing self-healing systems
5+ years hands-on experience with CI/CD tools, logging, observability, and telemetry solutions such as Bitbucket, Bamboo, GitHub, Jenkins, AppDynamics, Splunk, Prometheus, and Grafana
3+ years of proven ability to implement SRE principles, including SLIs, SLOs, error budgets, monitoring, blameless postmortems, and toil reduction
1+ year of Schwab technology domain experience gained as a current or recent contractor
Strong proficiency in programming and automation using Python, Java, CloudFormation, or Terraform for Infrastructure-as-Code (IaC) solutions
Familiarity with Cloud Infrastructure platforms (AWS, GCP, and Azure)
Deep understanding of Compute, Storage, Networking, Load Balancing, CDN, DNS, and Security stacks in cloud environments
Ability to work independently in a fast-paced, high-impact environment while collaborating effectively across teams
Excellent verbal and written communication skills, with the ability to convey complex technical concepts to both technical and non-technical stakeholders
Proficient in programming languages to automate repeatable processes and building IaaC solutions (Python, CloudFormation, Terraform)
Knowledge of databases - (SQL, Aerospike, Postgres preferred)
Knowledge of IBM MQ, RabbitMQ and Kafka
What we offer:
401(k) with company match and Employee stock purchase plan
Paid time for vacation, volunteering, and 28-day sabbatical after every 5 years of service for eligible positions