CrawlJobs Logo

SRE Team Lead

venzotechnologies.com Logo

Venzo Technologies

Location Icon

Location:
India , Chennai

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

We are looking for a hands-on SRE Team Lead to own the reliability, scalability, and operational excellence of a cloud-native fintech platform built on microservices. This role combines technical leadership, architecture ownership, and deep hands-on execution. You will lead a small SRE team while remaining actively involved in design, coding, incident response, and reliability engineering.

Job Responsibility:

  • Own platform availability, latency, scalability, and resilience across environments
  • Define and enforce SLOs, SLIs, error budgets, and operational KPIs
  • Design and review resilience patterns: circuit breakers, retries, rate limiting, graceful degradation
  • Drive chaos engineering, fault-injection, and disaster-recovery readiness
  • Actively contribute code (Java / Node) for reliability tooling
  • Platform automation
  • Observability integrations
  • Review microservice architecture with engineering teams to eliminate single points of failure
  • Own AWS architecture (VPCs, IAM, EKS, RDS, ALB/NLB, autoscaling)
  • Drive Kubernetes best practices (resource tuning, HPA, pod disruption budgets)
  • Improve CI/CD pipelines for reliability, speed, and safety
  • Lead production incident response, root cause analysis (RCA), and postmortems
  • Establish blameless postmortem culture
  • Reduce MTTR through automation and better observability
  • Participate in escalation/on-call strategy (not firefighting 24×7)
  • Mentor SRE DevOps and SRE Full-Stack engineers
  • Define operational standards, runbooks, and SRE practices
  • Work closely with product, security, and engineering leaders

Requirements:

  • 8+ years of experience in SRE / Platform / DevOps engineering
  • Strong hands-on experience with AWS (EKS, EC2, RDS, IAM, CloudWatch, ALB)
  • Kubernetes & Docker
  • Microservices architectures
  • Strong programming background in Java and/or Node.js
  • Deep understanding of distributed systems, production debugging, and capacity planning
  • Experience in fintech or regulated environments is a strong plus

Nice to have:

  • Experience with chaos engineering tools
  • Security & compliance exposure (PCI-DSS, SOC2, ISO)
  • Prior experience building or scaling SRE teams

Additional Information:

Job Posted:
February 20, 2026

Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for SRE Team Lead

Lead SRE

We are looking for a Lead SRE to join our Inetum Team and be part of a work cult...
Location
Location
Portugal , Lisbon
Salary
Salary:
Not provided
https://www.inetum.com Logo
Inetum
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • SRE IT production processes
  • Agile / DevOps Mindset Problem Solving
  • Scripting: Python, YML, Shell
  • Monitoring: Dynatrace, Nagios
  • Linux
  • Admin Network (DNS, Firewall, Switch)
  • DevOps stack: Git & Git Flow, Artifactory, Jenkins or Gitlab CI, Ansible Tower, Digital ai Release
  • Cloud: Kubernetes, Docker, Argo CD, ArgoCD, Vault, Helm
  • End-to-end IT organization and processes (from development to run / operate)
  • Technical Architecture
Job Responsibility
Job Responsibility
  • Train SREs and their managers on SRE practices
  • Co-construct the transformation strategy and the support plan by participating in workshops, brainstorming with the transformation team and producing training content
  • Coach and support
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Support Lead

Site Reliability Engineering Support Lead role focused on application support, d...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Solid SRE process experience
  • 5+ years of Leading high-performance, 24x7, DevOps or SysOps team
  • Proficiency in Windows administration, Office 365, Exchange, SharePoint, Active Directory, Backup, Networking and Infrastructure
  • Experience with Microsoft OS Windows & Server
  • Experience in ticket tracking and resolving on time
  • Hands-on experience on ticketing tools (ServiceNow)
  • Excellent verbal, written, presentation and interpersonal communication skills
  • Ability to make complex technical matters easy-to-comprehend for non-technical persons.
Job Responsibility
Job Responsibility
  • Taking end-to-end Ownership of Application Support for Production Systems Issues resolution
  • Implementing, monitoring, and maintaining CI/CD frameworks
  • Developing new capabilities, coordinating implementation across a large number of teams including infrastructure, developer tools and information security
  • Influencing a culture of Site Reliability Engineering. Engaging in training and mentoring to help develop other engineers with SRE mind set
  • Providing the first line of after-deployment technical support at L1 and L2 level for applications and and/or associated production systems diagnostics, and network health monitoring
  • Coordination and/or for deploying hands-on fixes, patches and software updates at the application level, and as appropriate at the network level
  • Managing a team of technical support engineers who provide technical support to users
  • Escalating complex problems to the L3 level of expertise within organization, along with observations from investigative and diagnostic assessments
  • Co-ordinating in the investigation of repeated technical issues affecting user system and seeing through to resolution
  • Escalating, resolving, guiding team, and tracking production incidents to closure
What we offer
What we offer
  • Competitive base salary (which is annually reviewed)
  • Hybrid working model (up to 2 days working at home per week)
  • Additional benefits to support you and your family to be well, live well and save well.
  • Fulltime
Read More
Arrow Right

SRE Networking Team Lead

Coralogix is a modern, full-stack observability platform transforming how busine...
Location
Location
Germany , Berlin
Salary
Salary:
Not provided
coralogix.com Logo
Coralogix
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience leading DevOps, SRE, or Platform teams
  • Strong background in Kubernetes and knowledgeable on Networking standards (CNI, ingress/egress, network policies)
  • Hands-on experience with service mesh technologies (Istio, Envoy, or similar)
  • Solid understanding of cloud networking (AWS: VPCs, PrivateLink, NAT, multi-AZ)
  • Experience operating systems through incidents, outages, and shifting priorities
  • Strong communication skills - able to explain complex networking topics clearly
Job Responsibility
Job Responsibility
  • Lead and develop a distributed team of Engineers
  • End-to-end reliability of the network layer (ingress, egress, service mesh, traffic flows)
  • Final accountability for network-related failure modes and incidents
  • Traffic-related SLOs and reliability outcomes
  • Identifying and addressing network-level cost anomalies
  • Defining and enforcing networking standards and guardrails
  • Fulltime
Read More
Arrow Right

SRE Networking Team Lead

Coralogix is a modern, full-stack observability platform transforming how busine...
Location
Location
Germany , Berlin
Salary
Salary:
Not provided
coralogix.com Logo
Coralogix
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience leading DevOps, SRE, or Platform teams
  • Strong background in Kubernetes and knowledgeable on Networking standards (CNI, ingress/egress, network policies)
  • Hands-on experience with service mesh technologies (Istio, Envoy, or similar)
  • Solid understanding of cloud networking (AWS: VPCs, PrivateLink, NAT, multi-AZ)
  • Experience operating systems through incidents, outages, and shifting priorities
  • Strong communication skills - able to explain complex networking topics clearly
Job Responsibility
Job Responsibility
  • Lead and develop a distributed team of Engineers
  • End-to-end reliability of the network layer (ingress, egress, service mesh, traffic flows)
  • Final accountability for network-related failure modes and incidents
  • Traffic-related SLOs and reliability outcomes
  • Identifying and addressing network-level cost anomalies
  • Defining and enforcing networking standards and guardrails
  • Fulltime
Read More
Arrow Right

Director, Service Reliability Engineering

As Director of SRE, you will lead the team responsible for accelerating and auto...
Location
Location
United States , Bethesda
Salary
Salary:
125600.00 - 203700.00 USD / Year
https://www.marriott.com Logo
Marriott Bonvoy
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Undergraduate degree in computer science, software engineering, or a related field (or equivalent experience)
  • 10+ years of experience in SRE, devsecops or IT operations
  • At least 5 years’ experience in a previous leadership role within SRE, devsecops or IT Operations
  • At least five years of experience in the following technologies - Presentation Management: HTML, CSS, JS, Backbone, Node JS, Android, iOS, Application Platforms: NGINX, Java, Akana, Play Framework, Tomcat, Docker, Openshift, Application Data: PostgreSQL, Couchbase, Cassandra, Integration Services: Apache Kafka, Apache Spark, Akana, Analytics Platforms: Hadoop, dashDB, Cognos, Tableau, Security: Forgerock, OpenID, OAUTH, Ping Identity, Public Cloud: Azure, Google Cloud, AliCloud, Amazon Web Services, CI/CD: Harness
  • Experience with test automation
  • Working knowledge and proven track record of implementing disaster indifferent architecture
  • Experience with CDN and Akamai tools
  • Linux/Unix system administration experience
  • Proficient in scripting and programming languages (like Python, Go, Bash, Shell)
  • Hands on experience with infrastructure as code (like Terraform), container orchestration (like Kubernetes), and reliability automation
Job Responsibility
Job Responsibility
  • Define and execute Marriott’s SRE vision, aligning with business objectives and technology roadmaps
  • Build, mentor and lead a high-performing SRE team, fostering a culture of collaboration and innovation
  • Establish reliability, observability and automation goals to improve system uptime, performance and scalability
  • Partner with engineering, operations and security teams to drive best practices and continuous improvement
  • Implement reliability-focused engineering practices, including SLAs, SLOs/SLIs and error budgets
  • Design and maintain resilient, scalable and fault-tolerant architectures across cloud and hybrid environments
  • Develop strategies to proactively identify and mitigate risks to system performance and availability
  • Drive root cause analysis (RCA) and post-mortem processes to prevent recurring incidents
  • Champion automation in monitoring, deployment and incident resolution to reduce toil and enhance efficiency
  • Lead and optimize incident response processes, ensuring rapid detection, diagnosis, and resolution of system failures
What we offer
What we offer
  • Bonus program
  • comprehensive health care benefits
  • 401(k) plan with up to 5% company match
  • employee stock purchase plan at 15% discount
  • accrued paid time off (including sick leave where applicable)
  • life insurance
  • group disability insurance
  • travel discounts
  • adoption assistance
  • paid parental leave
  • Fulltime
Read More
Arrow Right

Lead SRE

We have a 6 month contract to hire for a senior, hands-on Site Reliability Engin...
Location
Location
United States , St Louis
Salary
Salary:
Not provided
zeektek.com Logo
Zeektek
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree
  • AWS Certified DevOps Engineer – Professional
  • Dynatrace Professional
  • One SaaS tool certifications (Prometheus Certified Associate (PCA), Datadog, New Relic)
  • 7+ years in SRE/Production Engineering/Platform roles
  • 2+ years leading initiatives or teams
  • Strong in Linux, networking fundamentals (HTTP, TLS, DNS, TCP), and distributed systems concepts
  • Proficiency with Go, Python, Shell Scripting, SQL, Java or JVM, JavaScript/TypeScript, YAML/HCL/JSON
  • Hands-on with IaC (Terraform) and CI/CD (GitLab CI, GitHub Actions, AWS/Azure DevOps)
  • Deep experience in AWS Cloud infrastructure
Job Responsibility
Job Responsibility
  • Lead SRE to drive reliability, scalability, observability (monitoring & alerts) and performance across the production platforms
  • Own the SLO/SLI strategy, modernize observability and incident response, and partner with application teams to deliver resilient systems
  • Define and govern SLOs/SLIs/Error Budgets for critical services
  • enforce guardrails and drive reliability roadmaps
  • Lead performance tuning collaboration with application teams to ensure high availability and low latency
  • Define and own infrastructure tuning to ensure scalability leading to high availability
  • Lead Metrics and automation driven Reliability
  • Dedug systems across layers
  • Architect and evolve CI/CD, infrastructure-as-code (IaC- Terraform)
  • Design and build serverless APIs (Lambda, API Gateway, SQS, SNS, DynamoDB, etc.)
What we offer
What we offer
  • Weekly Direct Deposit
  • 401K Matching
  • Competitive medical, dental and vision insurance
  • Consistent communication throughout your project
  • ZeekTek Referral Program
  • Fulltime
Read More
Arrow Right

Lead Site Reliability Engineer

Groupon is a marketplace where customers discover new experiences and services e...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in systems engineering
  • at least 5+ years in SRE or DevOps roles
  • expertise in cloud platforms (GCP, AWS) and container orchestration (Kubernetes, Docker)
  • proficiency in programming and scripting languages like Python, Go, and Bash
  • advanced knowledge of Infrastructure as Code (IaC) tools such as Terraform and Ansible
  • deep understanding of networking, DNS, load balancing, and security principles
  • proven track record of managing high-availability systems in demanding environments
  • exceptional analytical and problem-solving skills
Job Responsibility
Job Responsibility
  • Architect and maintain fault-tolerant systems, ensuring uptime SLAs of 99.9% or higher
  • drive automation in infrastructure management and deployment using Terraform, Ansible, Kubernetes, and similar tools
  • create and optimize CI/CD pipelines to ensure reliable, secure, and efficient software delivery
  • build and enhance comprehensive observability solutions, including monitoring, logging, and alerting systems using Prometheus, Grafana, and the ELK stack
  • collaborate with stakeholders to define and achieve SLIs, SLOs, and error budgets aligned with business needs
  • lead incident response during on-call rotations, ensuring rapid resolution and root cause analysis for critical issues
  • design and execute performance testing, capacity planning, and scalability strategies for evolving workloads
  • proactively identify and resolve bottlenecks, increasing system performance and developer efficiency
  • mentor junior engineers, fostering a collaborative and growth-oriented team environment
  • guide architectural decisions that drive innovation and enhance system reliability
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • a collaborative and innovative work values alignment that values your expertise and contributions
  • professional growth and leadership development pathways tailored to your aspirations
  • a chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

SRE Lead Design & Support Engineer

This is a critical enabler achieving a high resiliency during operations and als...
Location
Location
Mexico , Miguel Hidalgo
Salary
Salary:
Not provided
pepsico.com Logo
Pepsico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of work experience evolving to a SRE engineer
  • 3-5 years of experience in continuously improving and transforming IT operations ways of working
  • Bachelor’s degree in Computer Science, Information Technology or a related field
  • Proven experience as an SRE in designing the events diagnostics, performance measures and alert solutions to meet the SLA/SLO/SLIs
  • Highly quantitative, have great judgment, able to connect dots across ecosytems, and efficiently work cross-functionally across teams
  • A strong expertise of SRE (Software Reliability Engineering) and IT Service Management (ITSM) processes
  • Hands on experience in Python, SQL /No-SQl( MySQL, Mongo DB, Cassandra, Postgress), AppDynamics, ELK Stack Grafana, Splunk, Dynatrace, Kafka and any SRE Ops toolsets
  • A firm understanding of cloud archticture for distributed environments
  • Front-end technologies: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js
  • Back-end technologies: Server-side languages (Java, Spring Boot, and related technologies that build the server-side logic, APIs, and database interaction with MySQL, MongoDB, Cassandra, Couchbase)
Job Responsibility
Job Responsibility
  • Drive new shift left activities critical to apply Site Reliability Engineering (SRE) and quality assurance principles within the application design / Project roadmap that enablees resilient outcomes
  • Apply pre-emptive approach into production minimizing business impact, via SRE-driven orchestration of connecting all components of the ecosystem diagnosing anomalies prior to user & remediating through automation
  • Ensure ecosystem availability and performance in production environments, Pro-actively preventing P1, P2, potential P3s
  • Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications
  • Accountable to institute non-functional requirements (NFRs) are embedded early including SLA/SLO/SLI and error budgets into the product’s offerings as part of the engineering solution
  • Leads the team diagnosing any anomalies prior to any user and driving the necessary remediations across the teams involved in end-to-end ecosystem availability, performance and consumption of the cloud architected application ecosystem leveraging SRE Orchestration solutions
  • Collaborates with Engineering & support teams, including participation in escalations, and blameless postmortems
  • Work closely with customer-facing support teams to empower them with SRE insights and tooling
  • Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member
  • Continuously optimize the L2/support operations work via SRE workflow automation
What we offer
What we offer
  • Opportunities to learn and develop every day through a wide range of programs
  • Internal digital platforms that promote self-learning
  • Development programs according to Leadership skills
  • Specialized training according to the role
  • Learning experiences with internal and external providers
  • Recognition programs for seniority, behavior, leadership, moments of life, among others
  • Financial wellness programs that will help you reach your goals in all stages of life
  • A flexibility program that will allow you to balance your personal and work life, adapting your working day to your lifestyle
  • Wellness Line, thousands of Agreements and Discounts, Scholarship programs for your children, Aid Plans for different moments of life
Read More
Arrow Right