Site Reliability Engineer, Citi

Citi

Location:
India, Pune

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Not provided

Save Job

Apply Position

Job Description:

The SRE Observability Specialist is a hands-on expert, delivering the future of Observability across Services Technology. This role is a part of a central SRE enablement team within Services Production, working closely with SREs, developers, and platform teams to embed telemetry, implement SLOs, and build meaningful visualizations for key production flows — particularly in critical Payments Business.

Job Responsibility:

Deliver against the observability roadmap for Services Technology by building scalable, reusable telemetry solutions
Create and maintain dashboards and visualizations for critical client journeys, including real-time flows across Payments
Guide line-of-business teams in implementing SLIs/SLOs, golden signals, and effective alerting to support operational excellence
Support integration and adoption of observability tooling across on-prem, public cloud (AWS/GCP), and containerized environments (ECS, Kubernetes)
Customize shared dashboards and observability components in partnership with CTI and other central Engineering functions, ensuring usability and flexibility
Provide technical support and implementation guidance to SREs and developers facing integration or tooling challenges
Effectively manage the observability book of work for Services Technology and drive initiatives to reduce MTTD and improve recovery outcomes
Serve as a key connection point between line-of-business SREs and central infrastructure functions by gathering tooling feedback, surfacing systemic issues, and influencing platform enhancements via the Services Observability Forum
Stay current with observability trends, including AI/ML-driven insights, anomaly detection, and emerging OSS practices, and assess their applicability
Maintain strong knowledge of observability platform features and vendor offerings to advise teams and maximize the value of tooling investments

Requirements:

10+ years of experience in SRE, Observability Engineering, or platform infrastructure roles focused on operational telemetry
Hands-on experience in observability tools and stacks such as Grafana, Prometheus, OpenTelemetry, ELK, Splunk, and similar platforms
Deep understanding of SLIs, SLOs, Error Budgets, and telemetry best practices in high-availability environments
Proven ability to troubleshoot integration issues and support observability across hybrid platforms (on-prem, cloud, containers)
Experience building dashboards aligned to business outcomes and incident workflows, especially in critical flows like payments
Familiarity with modern observability tooling ecosystems, including AI/ML capabilities, trace correlation, baselining, and alert tuning
Strong interpersonal and collaboration skills — able to operate across federated engineering teams and central infrastructure groups
Experience in enablement or platform teams with a track record of scaling best practices across diverse business units
Bachelor’s degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience

Additional Information:

Job Posted:
May 03, 2025

Employment Type:

Fulltime

Work Type:

On-site work

View All Jobs In This Company

Job Link Share:

Site Reliability Engineer