This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Engineer the future of global finance. At Citi, our Tech team doesn’t just support finance – we are helping to redefine it. Every day, $5 trillion crosses through our network. We do business in 180+ countries, operating at a scale few can match. From deploying advanced AI to helping shape global markets, we build systems that matter. Look to join a team where your work helps influence economies, your ideas can drive innovation and outcomes, and your growth is backed by mentorship, continuous learning, and flexibility with potential hybrid work opportunities. Help solve real-world challenges that touch millions and get the opportunity to build the future of finance with Citi Tech. We are seeking an experienced and motivated leader for our AI and DevOps Platform Support team in North America. This role is responsible for ensuring the stability, reliability, and performance of our critical AI and DevOps platforms. The team supports a wide range of services, including multiple AI applications, developer tools, and CI/CD pipeline technologies used by teams across the organization. The ideal candidate will lead a team of SRE and Support engineers, manage incident and problem resolution, and collaborate with engineering and development teams to improve platform services and supportability. Involvement includes short- to medium‑term planning of actions and resources for the area.
Job Responsibility:
Demonstrates an in-depth understanding of how application support integrates within the overall technology function to achieve objectives
requires a good understanding of the industry
Vendor relationship management, including oversight for all offshore managed services
Improve the service level the team provides to our end users, including maximizing operational efficiencies and strengthening incident management, problem management, and knowledge‑sharing practices
Guide development teams on application stability and supportability improvements
Formulate and implement a framework for managing capacity, throughput, and latency
Define and implement application onboarding guidelines and standards
Work with various team members, coaching them on how to maximize their potential, work better in a highly integrated team environment, and focus on bringing out their strengths
Drive continued cost reductions and efficiencies across the portfolios supported through Root Cause Analysis reviews, knowledge management, performance tuning, and user training
Participate in business review meetings, relating technology tools and strategies to business requirements
Assure adherence to all support processes and tool standards, and work with management to create new and/or enhance existing processes to ensure consistency and quality in “best practices” across the overall support program
Perform other duties and functions as assigned
Act as the primary point of contact for platform matters, defining the vision and roadmap in partnership with engineering leaders and business stakeholders
Champion the platform's resilience strategy by planning and executing wargaming scenarios, chaos engineering tests, and disaster recovery drills
Drive a comprehensive automation strategy to reduce manual toil, improve deployment velocity, and identify opportunities to leverage AI for operational intelligence
Define and drive the enterprise-wide observability strategy, ensuring the team has the tools and insights needed to guarantee platform health, performance, and cost‑effectiveness. This includes overseeing monitoring, logging, tracing, and alerting
Remain hands‑on and maintain a deep technical understanding of the platform architecture and services
Oversee the operational health of all production platforms (including OpenShift, ECS, CI/CD), ensuring SLAs are met and a robust incident management process is in place
Implement and manage comprehensive monitoring and observability strategies to ensure proactive issue detection, performance analysis, and system health checks across all supported platforms
Requirements:
10 years of relevant experience in a hands‑on technical leadership role
Lead architecture decision‑making for platform services, ensuring alignment with enterprise standards, long‑term scalability, and operational resilience
Experience with senior stakeholder management
Project management experience with demonstrable results in improving IT services
Exceptional communication and presentation skills, with the ability to articulate a technical vision and report on key metrics to senior leadership
A strong track record of developing and executing a strategic roadmap for a technical platform, balancing new features with a dedicated “book of work” for stability
Demonstrable experience leading resilience initiatives such as wargaming, disaster recovery planning, and incident response simulations
Ability to effectively share information with other support team members and other technology teams
Ability to plan and organize workload
Consistently demonstrates clear and concise written and verbal communication skills
Ability to communicate appropriately with relevant stakeholders
Bachelor’s/University degree
Master’s degree preferred
Nice to have:
Working knowledge of Generative AI with LLMs preferred
Experience with CI/CD and configuration management preferred
Experience with Red Hat OpenShift or similar Kubernetes technologies preferred
Experience working with databases such as Postgres, Oracle, MongoDB, and Redis preferred
Experience writing code in Java, Python, Go, or similar, and desire to build on these skills preferred
Hands‑on experience with modern observability and monitoring tools (e.g., Prometheus, Grafana, Splunk, ELK)