CrawlJobs Logo

Head of Production Management Resiliency

https://www.citi.com/ Logo

Citi

Location Icon

Location:
United Kingdom, London

Category Icon
Category:
IT - Software Development

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

Ensure the resilience of critical applications by adhering to enhanced testing and recovery standards, and proactively identifying and mitigating vulnerabilities. This role is crucial for maintaining the stability and operational resilience of Citi's critical business services, ensuring they remain within defined impact tolerances and minimizing client impact duration.

Job Responsibility:

  • Implement Enhanced Testing and Recovery: Oversee the implementation and execution of Production Swing testing for critical applications, ensuring applications run from their alternate site for a minimum of 5 days
  • Implement and oversee Data Recovery testing, ensuring applications can recover critical data from backup solutions within the defined Impact Tolerance (ITOL)
  • Drive the onboarding of critical applications to the One-Touch Recovery orchestration solution
  • Minimize the Recovery Time Actual (TRTA) for critical applications
  • Design and Architecture: Champion resilient application design by advocating for and integrating resiliency principles into architectures, and promoting the use of established resiliency patterns
  • Leverage cloud-native services and features to enhance application resiliency. This includes services for auto-scaling, load balancing, and disaster recovery
  • Explore and implement chaos engineering practices to proactively identify and address system weaknesses under stress
  • Proactive Vulnerability Management: Proactively identify vulnerabilities through regular architecture reviews, comprehensive scenario testing, and foundational testing
  • Document and demonstrate mitigation efforts for all discovered vulnerabilities. This includes developing remediation plans, implementing necessary changes, and validating the effectiveness of mitigations
  • Ensure that all identified vulnerabilities have remediation plans scheduled
  • Operational Resilience Adherence: Ensure that all critical applications adhere to operational resilience testing and recovery requirements
  • Collaborate with relevant stakeholders to define and maintain appropriate impact tolerances for critical business services
  • Performance Measurement and Reporting: Monitor and report on key resilience metrics, including the number of applications executing production swing tests, the number of applications on One-ouch Recovery, recovery times and adherence to operational resilience requirements
  • Provide regular updates to senior management on the status of resilience initiatives and key performance indicators

Requirements:

  • Relevant professional software engineering experience - and in particular in SRE roles
  • Expertise analyzing complex application, database, network, and OS issues across a distributed large scale customer facing systems
  • Strong communication skills and ability to work effectively across multiple business and technical team
  • Experience in Java, .NET, Maven, Gradle, Jenkins, Helm, Puppet, Chef, Ansible, Kubernetes, AWS, Splunk, Prometheus
  • BS degree in computer science or equivalent field
What we offer:
  • 27 days annual leave (plus bank holidays)
  • A discretional annual performance related bonus
  • Private Medical Care & Life Insurance
  • Employee Assistance Program
  • Pension Plan
  • Paid Parental Leave
  • Special discounts for employees, family, and friends
  • Access to an array of learning and development resources

Additional Information:

Job Posted:
May 14, 2025

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:
Welcome to CrawlJobs.com
Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.