Senior Site Reliability Engineer, HSBC

HSBC

Location:
China, Shanghai

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Not provided

Save Job

Apply Position

Job Description:

Digital Business Services (DBS) Our GCIO organisation plays a critical role for the bank. This team partners with the businesses to build the platforms, systems, and products that our customers use every day. We keep people’s money and data safe, and are at the forefront of driving innovation for our businesses, customers, and colleagues. We are currently seeking an experienced professional to join our team.

Job Responsibility:

Design, develop, and implement automation tools and scripts to reduce manual operational tasks ('toil') and enhance system resilience
Ensure high availability (e.g., 99.99% uptime) of critical banking applications, including core banking, payment systems, and global platforms/local system
Conduct capacity planning and chaos engineering to test and improve system resilience under failure conditions
Participate in on-call rotations to respond to production incidents, troubleshoot issues, and conduct post-mortems to prevent recurrence
Collaborate with production support teams for rapid incident resolution and escalate complex issues to application teams or vendors as needed
Work closely with production support teams to streamline incident handling and integrate automated solutions into support processes
Partner with application development teams to embed reliability practices into the software development lifecycle (SDLC)
Engage with the bank's operation resilience project team to align on initiatives for regulatory compliance, disaster recovery, and system robustness
Coordinate with global and regional SRE and DevOps teams to ensure consistency in tools, processes, and standards across distributed banking systems
Implement and maintain monitoring solutions to track service-level indicators (SLIs) and ensure service-level objectives (SLOs) are met
Analyze system performance metrics and proactively address potential issues to maintain operational stability
Drive continuous improvement in reliability practices, including automation, incident response, and problem management processes
Contribute to error budget discussions to balance reliability with innovation in banking systems
Ensure systems adhere to China's regulatory requirements (e.g., Cybersecurity Law, data localization) and global banking standards
Implement secure coding practices and collaborate with security teams to protect sensitive financial data

Requirements:

Bachelor's degree in computer science, Information Technology, or a related field. Advanced degrees or certifications (e.g., ITIL, AWS Certified Solutions Architect, Google SRE) are a plus
Minimum of 5 years of experience in site reliability engineering, software development, or systems engineering, preferably in a financial services environment
Proven experience in automating operational processes and managing high-availability systems
Experience collaborating with production support, application development, and global teams in a distributed environment
Programming: Proficiency in Python, Go, Java, or Ruby for automation and tool development
Systems: Deep knowledge of Linux/Unix systems for administration, performance tuning, and debugging
Cloud and Infrastructure: Expertise in AWS, Azure, or GCP, and Infrastructure as Code (IaC) tools like Terraform or Ansible
Containerization: Experience with Docker and Kubernetes for managing containerized banking applications
Monitoring: Proficiency in Prometheus, Grafana, Splunk, or Datadog for observability and performance monitoring
CI/CD: Familiarity with Jenkins, GitLab CI, or GitHub Actions for integrating reliability into deployment pipelines
Networking: Knowledge of TCP/IP, DNS, and load balancing for diagnosing connectivity issues
Chaos Engineering: Experience with tools like Chaos Monkey or Gremlin to test system resilience
Excellent verbal and written communication skills in English and Mandarin to engage with local teams, global/regional SRE and DevOps teams, and the operation resilience project team
Ability to explain complex technical concepts to non-technical stakeholders, including bank operations and compliance teams
Strong problem-solving skills and the ability to remain calm under pressure during critical incidents
Collaborative mindset with a focus on fostering teamwork across production support, application, and resilience project teams
Proactive approach to identifying and mitigating risks to system reliability
Willingness to participate in on-call rotations for incident response and support
Ability to work across time zones to collaborate with global and regional SRE/DevOps teams
Strong understanding of banking systems (e.g., core banking, payment platforms) and compliance with local and global regulations

Additional Information:

Job Posted:
August 05, 2025

Expiration:
September 02, 2025

Employment Type:

Fulltime

Work Type:

Hybrid work

View All Jobs In This Company

Job Link Share:

Senior Site Reliability Engineer

HSBC

Location:China, Shanghai

Category:IT - Software Development

Contract Type:Not provided