This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The SRE Observability Specialist is a hands-on expert, delivering the future of Observability across Services Technology. This role is a part of a central SRE enablement team within Services Production, working closely with SREs, developers, and platform teams to embed telemetry, implement SLOs, and build meaningful visualizations for key production flows — particularly in critical Payments Business. The ideal candidate will have deep technical knowledge, a collaborative mindset, and the ability to translate strategy into scalable engineering outcomes. You will also act as a bridge between Services Technology teams and central infrastructure/CTO teams, prioritising observability needs from line-of-business teams and driving improvements. A strong understanding of observability tooling, evolving AI/ML capabilities, and enterprise tooling ecosystems will be essential. This role requires providing technological Support solution for Function called Project Orion which provides End-to-End payment monitoring like Building an End-to-End payments Dashboard, Toil Reduction, Transformation of legacy monitoring into observability based monitoring solution, requires good understanding of different Payments Taxonomy (ACH, Wires, Instant Payments, etc.). Strong commercial awareness, technical credibility, and excellent communication skills are essential to negotiate internally, influence peers, and drive change. Some external communication may be necessary.
Job Responsibility
Define the roadmap for Engineering enablers for Project Orion team aligned with enterprise reliability and SRE Services organization goals
Translate Organization strategy into an actionable delivery plan in partnership with Services Products, Operations & Engineering function, delivering incremental, high-value milestones
Understand Critical Business Services functional scope and translate into End-to-End monitoring solutions
Deliver against the observability roadmap for Services Technology by building scalable, reusable telemetry solutions
Periodic review and analyze application monitoring TOIL and collaborate with stakeholders and remediate them as per organization goal
Create and maintain dashboards and visualizations for critical client journeys, including real-time flows across Payments
Guide line-of-business teams in implementing SLIs/SLOs, golden signals, and effective alerting to support operational excellence
Support integration and adoption of observability tooling across on-prem, public cloud (AWS/GCP), and containerized environments (ECS, Kubernetes)
Customize shared dashboards and observability components in partnership with CTI and other central Engineering functions, ensuring usability and flexibility
Provide technical support and implementation guidance to SREs and developers facing integration or tooling challenges
Effectively manage the observability book of work for Services Technology and drive initiatives to reduce MTTD and improve recovery outcomes
Serve as a key connection point between line-of-business SREs and central infrastructure functions by gathering tooling feedback, surfacing systemic issues, and influencing platform enhancements via the Services Observability Forum
Stay current with observability trends, including AI/ML-driven insights, anomaly detection, and emerging OSS practices, and assess their applicability
Maintain strong knowledge of observability platform features and vendor offerings to advise teams and maximize the value of tooling investments
Foster AI adoption by building use cases performed by Orion L1 Functions and remediation using Citi AI tech stack
Requirements
7+ years of experience in SRE, Observability Engineering, or platform infrastructure roles focused on operational telemetry
Hands-on experience in observability tools and stacks such as Grafana, Prometheus, OpenTelemetry, ELK, Splunk, and similar platforms
Deep understanding of SLIs, SLOs, Error Budgets, and telemetry best practices in high-availability environments
Proven ability to troubleshoot integration issues and support observability across hybrid platforms (on-prem, cloud, containers)
Experience building dashboards aligned to business outcomes and incident workflows, especially in critical flows like payments
Familiarity with modern observability tooling ecosystems, including AI/ML capabilities, trace correlation, baselining, and alert tuning
Strong interpersonal and collaboration skills
Experience in enablement or platform teams with a track record of scaling best practices across diverse business units
Bachelor’s degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience