This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Ensure the resilience of critical applications by adhering to enhanced testing and recovery standards, and proactively identifying and mitigating vulnerabilities. This role is crucial for maintaining the stability and operational resilience of Citi's critical business services, ensuring they remain within defined impact tolerances and minimizing client impact duration.
Job Responsibility:
Implement Enhanced Testing and Recovery: Oversee the implementation and execution of Production Swing testing for critical applications, ensuring applications run from their alternate site for a minimum of 5 days
Implement and oversee Data Recovery testing, ensuring applications can recover critical data from backup solutions within the defined Impact Tolerance (ITOL)
Drive the onboarding of critical applications to the One-Touch Recovery orchestration solution
Minimize the Recovery Time Actual (TRTA) for critical applications
Design and Architecture: Champion resilient application design by advocating for and integrating resiliency principles into architectures, and promoting the use of established resiliency patterns
Leverage cloud-native services and features to enhance application resiliency. This includes services for auto-scaling, load balancing, and disaster recovery
Explore and implement chaos engineering practices to proactively identify and address system weaknesses under stress
Proactive Vulnerability Management: Proactively identify vulnerabilities through regular architecture reviews, comprehensive scenario testing, and foundational testing
Document and demonstrate mitigation efforts for all discovered vulnerabilities. This includes developing remediation plans, implementing necessary changes, and validating the effectiveness of mitigations
Ensure that all identified vulnerabilities have remediation plans scheduled
Operational Resilience Adherence: Ensure that all critical applications adhere to operational resilience testing and recovery requirements
Collaborate with relevant stakeholders to define and maintain appropriate impact tolerances for critical business services
Performance Measurement and Reporting: Monitor and report on key resilience metrics, including the number of applications executing production swing tests, the number of applications on One-ouch Recovery, recovery times and adherence to operational resilience requirements
Provide regular updates to senior management on the status of resilience initiatives and key performance indicators
Requirements:
Relevant professional software engineering experience - and in particular in SRE roles
Expertise analyzing complex application, database, network, and OS issues across a distributed large scale customer facing systems
Strong communication skills and ability to work effectively across multiple business and technical team
Welcome to CrawlJobs.com – Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.
We use cookies to enhance your experience, analyze traffic, and serve personalized content. By clicking “Accept”, you agree to the use of cookies.