Site Reliability Engineer Lead Job at Citi (New York)

Lead Site Reliability Engineer

As a Lead Site Reliability Engineer (SRE), you will ensure the stability, perfor...

Location

United States

Salary:

184000.00 - 229000.00 USD / Year

Corelight

Expiration Date

Until further notice

Requirements

8+ years of experience building and operating FedRAMP environments or similarly regulated systems
Expertise in AWS services (e.g., EC2, S3, RDS, Lambda, ECS/EKS, Glue, EMR, Redshift, OpenSearch, VPC)
Deep understanding of the FedRAMP framework, controls, and compliance requirements
Proficiency in programming languages such as Python, Go, or Java
Experience with big data technologies (Hadoop, Spark, Kafka)
Strong skills in Infrastructure as Code (IaC) tools like Terraform, CloudFormation, or Ansible
Knowledge of containerization and orchestration tools like Docker and Kubernetes
Experience with CI/CD tools such as Jenkins, GitLab CI, or CircleCI
Proven track record in building and scaling platforms with high availability, resilience, and strict SLO objectives
Strong experience with Unix/Linux systems and cloud providers, ideally AWS

Job Responsibility

Collaborate with software engineering teams to ensure the reliability, performance, and security of the Federal region’s infrastructure
Design, implement, and manage FedRAMP-compliant infrastructure and systems
Establish continuous monitoring, logging, and auditing processes to ensure compliance with FedRAMP controls
Partner with security teams to conduct security assessments and implement necessary controls
Design and implement scalable infrastructure solutions that support multi-region growth
Drive automation efforts, enabling infrastructure and platforms to scale efficiently with a focus on compliance
Stay up-to-date on best practices, evolving security threats, and FedRAMP guidelines to maintain a strong security posture
Deploy and maintain cloud-native services in AWS that are resilient and elastic
Participate in 24x7 incident response and on-call rotations
Plan for capacity and work with teams to prepare for platform growth

What we offer

Equity and additional benefits will also be awarded

Fulltime

Site Reliability Engineering Support Lead

Site Reliability Engineering Support Lead role focused on application support, d...

Location

Ireland , Dublin

Salary:

Not provided

Citi

Expiration Date

Until further notice

Requirements

Solid SRE process experience
5+ years of Leading high-performance, 24x7, DevOps or SysOps team
Proficiency in Windows administration, Office 365, Exchange, SharePoint, Active Directory, Backup, Networking and Infrastructure
Experience with Microsoft OS Windows & Server
Experience in ticket tracking and resolving on time
Hands-on experience on ticketing tools (ServiceNow)
Excellent verbal, written, presentation and interpersonal communication skills
Ability to make complex technical matters easy-to-comprehend for non-technical persons.

Job Responsibility

Taking end-to-end Ownership of Application Support for Production Systems Issues resolution
Implementing, monitoring, and maintaining CI/CD frameworks
Developing new capabilities, coordinating implementation across a large number of teams including infrastructure, developer tools and information security
Influencing a culture of Site Reliability Engineering. Engaging in training and mentoring to help develop other engineers with SRE mind set
Providing the first line of after-deployment technical support at L1 and L2 level for applications and and/or associated production systems diagnostics, and network health monitoring
Coordination and/or for deploying hands-on fixes, patches and software updates at the application level, and as appropriate at the network level
Managing a team of technical support engineers who provide technical support to users
Escalating complex problems to the L3 level of expertise within organization, along with observations from investigative and diagnostic assessments
Co-ordinating in the investigation of repeated technical issues affecting user system and seeing through to resolution
Escalating, resolving, guiding team, and tracking production incidents to closure

What we offer

Competitive base salary (which is annually reviewed)
Hybrid working model (up to 2 days working at home per week)
Additional benefits to support you and your family to be well, live well and save well.

Fulltime

Site Reliability Engineering Manager

Hewlett Packard Enterprise (HPE) is looking for a Site Reliability Engineering M...

Location

India , Bangalore

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

7–10 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
Minimum 2 years of experience managing or leading cloud operations teams
Deep understanding of cloud platforms (AWS, GCP, or Azure) and cloud-native architectures
Hands-on experience with Kubernetes, containers, infrastructure as code (e.g., Terraform), and configuration management tools
Strong foundation in observability (monitoring, logging, tracing), automation using Python, and incident response
Familiarity with modern CI/CD automation and tools
Excellent communication, stakeholder management, and team-building skills
Experience scaling SRE practices in high-growth or large-scale environments
Ability to balance long-term reliability initiatives with short-term delivery needs.

Job Responsibility

Lead and mentor a team of Site Reliability Engineers, supporting their growth, performance, and well-being
Own the reliability strategy for SASE cloud infrastructure systems, including incident management, SLIs/SLOs, and capacity planning
Partner with Engineering, Product, and Security teams to design and deliver highly available, scalable, and resilient cloud-native services
Guide the team in building automation, improving observability, and improve operational efficiency of our cloud infrastructure
Drive adoption of best practices in monitoring, alerting, on-call operations, and runbook development
Build and maintain a strong engineering culture based on ownership, collaboration, and continuous learning
Define and track key reliability metrics, and report on team performance and system health to leadership
Contribute to hiring, onboarding, and career development for SREs.

What we offer

Health & Wellbeing benefits for physical, financial, and emotional wellbeing
Personal & Professional Development programs
Unconditional inclusion in the workplace.

Fulltime

Site Reliability Engineer Application Development Technical Lead Analyst

The Applications Development Technology Lead Analyst is a senior level position ...

Location

Canada , Mississauga

Salary:

120800.00 - 170800.00 USD / Year

Citi

Expiration Date

Until further notice

Requirements

6+ years of relevant experience in Apps Development or systems analysis role
5+ years extensive experience system analysis and in programming of software applications with Python and RHEL
5+ years with Site reliability & CI/CD pipelines
Previous experience with containerization orchestration
Experience in managing and implementing successful projects
Subject Matter Expert (SME) in at least one area of Applications Development
Ability to adjust priorities quickly as circumstances dictate
Demonstrated leadership and project management skills
Consistently demonstrates clear and concise written and verbal communication
Bachelor's degree/University degree or equivalent experience

Job Responsibility

Partner with multiple management teams to ensure appropriate integration of functions to meet goals
Identify and define necessary system enhancements to deploy new products and process improvements
Resolve variety of high impact problems/projects through in-depth evaluation of complex business processes, system processes, and industry standards
Provide expertise in area and advanced knowledge of applications programming and ensure application design adheres to the overall architecture blueprint
Utilize advanced knowledge of system flow and develop standards for coding, testing, debugging, and implementation
Develop comprehensive knowledge of how areas of business integrate to accomplish business goals
Provide in-depth analysis with interpretive thinking to define issues and develop innovative solutions
Serve as advisor or coach to mid-level developers and analysts, allocating work as necessary
Appropriately assess risk when business decisions are made

Fulltime

Lead Site Reliability Engineer

Groupon is a marketplace where customers discover new experiences and services e...

Location

India , Bangalore

Salary:

Not provided

Groupon

Expiration Date

Until further notice

Requirements

10+ years in systems engineering
at least 5+ years in SRE or DevOps roles
expertise in cloud platforms (GCP, AWS) and container orchestration (Kubernetes, Docker)
proficiency in programming and scripting languages like Python, Go, and Bash
advanced knowledge of Infrastructure as Code (IaC) tools such as Terraform and Ansible
deep understanding of networking, DNS, load balancing, and security principles
proven track record of managing high-availability systems in demanding environments
exceptional analytical and problem-solving skills

Job Responsibility

Architect and maintain fault-tolerant systems, ensuring uptime SLAs of 99.9% or higher
drive automation in infrastructure management and deployment using Terraform, Ansible, Kubernetes, and similar tools
create and optimize CI/CD pipelines to ensure reliable, secure, and efficient software delivery
build and enhance comprehensive observability solutions, including monitoring, logging, and alerting systems using Prometheus, Grafana, and the ELK stack
collaborate with stakeholders to define and achieve SLIs, SLOs, and error budgets aligned with business needs
lead incident response during on-call rotations, ensuring rapid resolution and root cause analysis for critical issues
design and execute performance testing, capacity planning, and scalability strategies for evolving workloads
proactively identify and resolve bottlenecks, increasing system performance and developer efficiency
mentor junior engineers, fostering a collaborative and growth-oriented team environment
guide architectural decisions that drive innovation and enhance system reliability

What we offer

The opportunity to work with cutting-edge technologies in a transformative environment
a collaborative and innovative work values alignment that values your expertise and contributions
professional growth and leadership development pathways tailored to your aspirations
a chance to leave a lasting impact by shaping the future of reliable and scalable systems

Senior Site Reliability Engineer

Affirm is reinventing credit to make it more honest and friendly, giving consume...

Location

Spain

Salary:

85000.00 - 115000.00 EUR / Year

Affirm

Expiration Date

Until further notice

Requirements

4+ years of experience designing, developing and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin
A track record of developing highly available distributed systems using technologies like AWS, MySQL and Kubernetes
Meaningful experience contributing in or driving parts of the Incident Lifecycle process, enabling actionable insights that improve the quality culture, reliability, resilience, and system performance
4+ years working in a Site Reliability or Production Engineering team
Experience defining a technical plan for the delivery of a significant feature or system component with an elegant, simple and extensible design
Experience in making impactful changes in a large code base, and have developed a suite of tools and practices that enable you and your team to do so safely
Strong verbal and written communication skills that support effective collaboration with our global engineering team
On-Call Rotation - There would be an on-call rotation for this role as a requirement

Job Responsibility

You will be responsible for owning and delivering quarterly goals for your team, leading engineers on your team through ambiguity to solve open-ended problems, and ensuring that everyone is supported throughout delivery
You will support your peers and stakeholders in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics by participating in ideation, articulating technical constraints, and partnering on decisions that properly consider risks and trade-offs
You will proactively identify technical solutions and operational processes that strengthen incident readiness, response, and post-incident analysis
You will support the operations and availability of your team’s artifacts by creating and monitoring metrics, escalating when needed, and supporting “keep the lights on” & on-call efforts
You will foster a culture of quality and ownership on your team by setting or improving code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks
You will help develop talent on your team by providing feedback and guidance, and leading by example

What we offer

Flexible Spending Wallets for tech, food and lifestyle
Away Days - wellness days to take off work and recharge
Learning & Development programs
Parental benefit
Employee Resource & Community Groups
Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount

Fulltime