CrawlJobs Logo

Senior Site Reliability Engineer

https://www.hsbc.com Logo

HSBC

Location Icon

Location:
China, Shanghai

Category Icon
Category:
IT - Software Development

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

Digital Business Services (DBS) Our GCIO organisation plays a critical role for the bank. This team partners with the businesses to build the platforms, systems, and products that our customers use every day. We keep people’s money and data safe, and are at the forefront of driving innovation for our businesses, customers, and colleagues. We are currently seeking an experienced professional to join our team.

Job Responsibility:

  • Design, develop, and implement automation tools and scripts to reduce manual operational tasks ('toil') and enhance system resilience
  • Ensure high availability (e.g., 99.99% uptime) of critical banking applications, including core banking, payment systems, and global platforms/local system
  • Conduct capacity planning and chaos engineering to test and improve system resilience under failure conditions
  • Participate in on-call rotations to respond to production incidents, troubleshoot issues, and conduct post-mortems to prevent recurrence
  • Collaborate with production support teams for rapid incident resolution and escalate complex issues to application teams or vendors as needed
  • Work closely with production support teams to streamline incident handling and integrate automated solutions into support processes
  • Partner with application development teams to embed reliability practices into the software development lifecycle (SDLC)
  • Engage with the bank's operation resilience project team to align on initiatives for regulatory compliance, disaster recovery, and system robustness
  • Coordinate with global and regional SRE and DevOps teams to ensure consistency in tools, processes, and standards across distributed banking systems
  • Implement and maintain monitoring solutions to track service-level indicators (SLIs) and ensure service-level objectives (SLOs) are met
  • Analyze system performance metrics and proactively address potential issues to maintain operational stability
  • Drive continuous improvement in reliability practices, including automation, incident response, and problem management processes
  • Contribute to error budget discussions to balance reliability with innovation in banking systems
  • Ensure systems adhere to China's regulatory requirements (e.g., Cybersecurity Law, data localization) and global banking standards
  • Implement secure coding practices and collaborate with security teams to protect sensitive financial data

Requirements:

  • Bachelor's degree in computer science, Information Technology, or a related field. Advanced degrees or certifications (e.g., ITIL, AWS Certified Solutions Architect, Google SRE) are a plus
  • Minimum of 5 years of experience in site reliability engineering, software development, or systems engineering, preferably in a financial services environment
  • Proven experience in automating operational processes and managing high-availability systems
  • Experience collaborating with production support, application development, and global teams in a distributed environment
  • Programming: Proficiency in Python, Go, Java, or Ruby for automation and tool development
  • Systems: Deep knowledge of Linux/Unix systems for administration, performance tuning, and debugging
  • Cloud and Infrastructure: Expertise in AWS, Azure, or GCP, and Infrastructure as Code (IaC) tools like Terraform or Ansible
  • Containerization: Experience with Docker and Kubernetes for managing containerized banking applications
  • Monitoring: Proficiency in Prometheus, Grafana, Splunk, or Datadog for observability and performance monitoring
  • CI/CD: Familiarity with Jenkins, GitLab CI, or GitHub Actions for integrating reliability into deployment pipelines
  • Networking: Knowledge of TCP/IP, DNS, and load balancing for diagnosing connectivity issues
  • Chaos Engineering: Experience with tools like Chaos Monkey or Gremlin to test system resilience
  • Excellent verbal and written communication skills in English and Mandarin to engage with local teams, global/regional SRE and DevOps teams, and the operation resilience project team
  • Ability to explain complex technical concepts to non-technical stakeholders, including bank operations and compliance teams
  • Strong problem-solving skills and the ability to remain calm under pressure during critical incidents
  • Collaborative mindset with a focus on fostering teamwork across production support, application, and resilience project teams
  • Proactive approach to identifying and mitigating risks to system reliability
  • Willingness to participate in on-call rotations for incident response and support
  • Ability to work across time zones to collaborate with global and regional SRE/DevOps teams
  • Strong understanding of banking systems (e.g., core banking, payment platforms) and compliance with local and global regulations

Additional Information:

Job Posted:
August 05, 2025

Expiration:
September 02, 2025

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:
Welcome to CrawlJobs.com
Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.