CrawlJobs Logo

SRE Team Lead

venzotechnologies.com Logo

Venzo Technologies

Location Icon

Location:
India , Chennai

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

We are looking for a hands-on SRE Team Lead to own the reliability, scalability, and operational excellence of a cloud-native fintech platform built on microservices. This role combines technical leadership, architecture ownership, and deep hands-on execution. You will lead a small SRE team while remaining actively involved in design, coding, incident response, and reliability engineering.

Job Responsibility:

  • Own platform availability, latency, scalability, and resilience across environments
  • Define and enforce SLOs, SLIs, error budgets, and operational KPIs
  • Design and review resilience patterns: circuit breakers, retries, rate limiting, graceful degradation
  • Drive chaos engineering, fault-injection, and disaster-recovery readiness
  • Actively contribute code (Java / Node) for reliability tooling
  • Platform automation
  • Observability integrations
  • Review microservice architecture with engineering teams to eliminate single points of failure
  • Own AWS architecture (VPCs, IAM, EKS, RDS, ALB/NLB, autoscaling)
  • Drive Kubernetes best practices (resource tuning, HPA, pod disruption budgets)
  • Improve CI/CD pipelines for reliability, speed, and safety
  • Lead production incident response, root cause analysis (RCA), and postmortems
  • Establish blameless postmortem culture
  • Reduce MTTR through automation and better observability
  • Participate in escalation/on-call strategy (not firefighting 24×7)
  • Mentor SRE DevOps and SRE Full-Stack engineers
  • Define operational standards, runbooks, and SRE practices
  • Work closely with product, security, and engineering leaders

Requirements:

  • 8+ years of experience in SRE / Platform / DevOps engineering
  • Strong hands-on experience with AWS (EKS, EC2, RDS, IAM, CloudWatch, ALB)
  • Kubernetes & Docker
  • Microservices architectures
  • Strong programming background in Java and/or Node.js
  • Deep understanding of distributed systems, production debugging, and capacity planning
  • Experience in fintech or regulated environments is a strong plus

Nice to have:

  • Experience with chaos engineering tools
  • Security & compliance exposure (PCI-DSS, SOC2, ISO)
  • Prior experience building or scaling SRE teams

Additional Information:

Job Posted:
February 20, 2026

Job Link Share:
PREMIUM
More languages and countries
+ Unlock 31698 hidden job offers
Languages
English Čeština Deutsch Ελληνικά Español Français +15
Countries
United States United Kingdom India Canada Australia +
See plans
Plans from $2.99 / month

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for SRE Team Lead

Lead SRE

We are looking for a Lead SRE to join our Inetum Team and be part of a work cult...
Location
Location
Portugal , Lisbon
Salary
Salary:
Not provided
https://www.inetum.com Logo
Inetum
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • SRE IT production processes
  • Agile / DevOps Mindset Problem Solving
  • Scripting: Python, YML, Shell
  • Monitoring: Dynatrace, Nagios
  • Linux
  • Admin Network (DNS, Firewall, Switch)
  • DevOps stack: Git & Git Flow, Artifactory, Jenkins or Gitlab CI, Ansible Tower, Digital ai Release
  • Cloud: Kubernetes, Docker, Argo CD, ArgoCD, Vault, Helm
  • End-to-end IT organization and processes (from development to run / operate)
  • Technical Architecture
Job Responsibility
Job Responsibility
  • Train SREs and their managers on SRE practices
  • Co-construct the transformation strategy and the support plan by participating in workshops, brainstorming with the transformation team and producing training content
  • Coach and support
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Support Lead

Site Reliability Engineering Support Lead role focused on application support, d...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Solid SRE process experience
  • 5+ years of Leading high-performance, 24x7, DevOps or SysOps team
  • Proficiency in Windows administration, Office 365, Exchange, SharePoint, Active Directory, Backup, Networking and Infrastructure
  • Experience with Microsoft OS Windows & Server
  • Experience in ticket tracking and resolving on time
  • Hands-on experience on ticketing tools (ServiceNow)
  • Excellent verbal, written, presentation and interpersonal communication skills
  • Ability to make complex technical matters easy-to-comprehend for non-technical persons.
Job Responsibility
Job Responsibility
  • Taking end-to-end Ownership of Application Support for Production Systems Issues resolution
  • Implementing, monitoring, and maintaining CI/CD frameworks
  • Developing new capabilities, coordinating implementation across a large number of teams including infrastructure, developer tools and information security
  • Influencing a culture of Site Reliability Engineering. Engaging in training and mentoring to help develop other engineers with SRE mind set
  • Providing the first line of after-deployment technical support at L1 and L2 level for applications and and/or associated production systems diagnostics, and network health monitoring
  • Coordination and/or for deploying hands-on fixes, patches and software updates at the application level, and as appropriate at the network level
  • Managing a team of technical support engineers who provide technical support to users
  • Escalating complex problems to the L3 level of expertise within organization, along with observations from investigative and diagnostic assessments
  • Co-ordinating in the investigation of repeated technical issues affecting user system and seeing through to resolution
  • Escalating, resolving, guiding team, and tracking production incidents to closure
What we offer
What we offer
  • Competitive base salary (which is annually reviewed)
  • Hybrid working model (up to 2 days working at home per week)
  • Additional benefits to support you and your family to be well, live well and save well.
  • Fulltime
Read More
Arrow Right

SRE Networking Team Lead

Coralogix is a modern, full-stack observability platform transforming how busine...
Location
Location
Germany , Berlin
Salary
Salary:
Not provided
coralogix.com Logo
Coralogix
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience leading DevOps, SRE, or Platform teams
  • Strong background in Kubernetes and knowledgeable on Networking standards (CNI, ingress/egress, network policies)
  • Hands-on experience with service mesh technologies (Istio, Envoy, or similar)
  • Solid understanding of cloud networking (AWS: VPCs, PrivateLink, NAT, multi-AZ)
  • Experience operating systems through incidents, outages, and shifting priorities
  • Strong communication skills - able to explain complex networking topics clearly
Job Responsibility
Job Responsibility
  • Lead and develop a distributed team of Engineers
  • End-to-end reliability of the network layer (ingress, egress, service mesh, traffic flows)
  • Final accountability for network-related failure modes and incidents
  • Traffic-related SLOs and reliability outcomes
  • Identifying and addressing network-level cost anomalies
  • Defining and enforcing networking standards and guardrails
  • Fulltime
Read More
Arrow Right

SRE Networking Team Lead

Coralogix is a modern, full-stack observability platform transforming how busine...
Location
Location
Germany , Berlin
Salary
Salary:
Not provided
coralogix.com Logo
Coralogix
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience leading DevOps, SRE, or Platform teams
  • Strong background in Kubernetes and knowledgeable on Networking standards (CNI, ingress/egress, network policies)
  • Hands-on experience with service mesh technologies (Istio, Envoy, or similar)
  • Solid understanding of cloud networking (AWS: VPCs, PrivateLink, NAT, multi-AZ)
  • Experience operating systems through incidents, outages, and shifting priorities
  • Strong communication skills - able to explain complex networking topics clearly
Job Responsibility
Job Responsibility
  • Lead and develop a distributed team of Engineers
  • End-to-end reliability of the network layer (ingress, egress, service mesh, traffic flows)
  • Final accountability for network-related failure modes and incidents
  • Traffic-related SLOs and reliability outcomes
  • Identifying and addressing network-level cost anomalies
  • Defining and enforcing networking standards and guardrails
  • Fulltime
Read More
Arrow Right

Director, Service Reliability Engineering

As Director of SRE, you will lead the team responsible for accelerating and auto...
Location
Location
United States , Bethesda
Salary
Salary:
125600.00 - 203700.00 USD / Year
https://www.marriott.com Logo
Marriott Bonvoy
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Undergraduate degree in computer science, software engineering, or a related field (or equivalent experience)
  • 10+ years of experience in SRE, devsecops or IT operations
  • At least 5 years’ experience in a previous leadership role within SRE, devsecops or IT Operations
  • At least five years of experience in the following technologies - Presentation Management: HTML, CSS, JS, Backbone, Node JS, Android, iOS, Application Platforms: NGINX, Java, Akana, Play Framework, Tomcat, Docker, Openshift, Application Data: PostgreSQL, Couchbase, Cassandra, Integration Services: Apache Kafka, Apache Spark, Akana, Analytics Platforms: Hadoop, dashDB, Cognos, Tableau, Security: Forgerock, OpenID, OAUTH, Ping Identity, Public Cloud: Azure, Google Cloud, AliCloud, Amazon Web Services, CI/CD: Harness
  • Experience with test automation
  • Working knowledge and proven track record of implementing disaster indifferent architecture
  • Experience with CDN and Akamai tools
  • Linux/Unix system administration experience
  • Proficient in scripting and programming languages (like Python, Go, Bash, Shell)
  • Hands on experience with infrastructure as code (like Terraform), container orchestration (like Kubernetes), and reliability automation
Job Responsibility
Job Responsibility
  • Define and execute Marriott’s SRE vision, aligning with business objectives and technology roadmaps
  • Build, mentor and lead a high-performing SRE team, fostering a culture of collaboration and innovation
  • Establish reliability, observability and automation goals to improve system uptime, performance and scalability
  • Partner with engineering, operations and security teams to drive best practices and continuous improvement
  • Implement reliability-focused engineering practices, including SLAs, SLOs/SLIs and error budgets
  • Design and maintain resilient, scalable and fault-tolerant architectures across cloud and hybrid environments
  • Develop strategies to proactively identify and mitigate risks to system performance and availability
  • Drive root cause analysis (RCA) and post-mortem processes to prevent recurring incidents
  • Champion automation in monitoring, deployment and incident resolution to reduce toil and enhance efficiency
  • Lead and optimize incident response processes, ensuring rapid detection, diagnosis, and resolution of system failures
What we offer
What we offer
  • Bonus program
  • comprehensive health care benefits
  • 401(k) plan with up to 5% company match
  • employee stock purchase plan at 15% discount
  • accrued paid time off (including sick leave where applicable)
  • life insurance
  • group disability insurance
  • travel discounts
  • adoption assistance
  • paid parental leave
  • Fulltime
Read More
Arrow Right

Lead SRE

We have a 6 month contract to hire for a senior, hands-on Site Reliability Engin...
Location
Location
United States , St Louis
Salary
Salary:
Not provided
zeektek.com Logo
Zeektek
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree
  • AWS Certified DevOps Engineer – Professional
  • Dynatrace Professional
  • One SaaS tool certifications (Prometheus Certified Associate (PCA), Datadog, New Relic)
  • 7+ years in SRE/Production Engineering/Platform roles
  • 2+ years leading initiatives or teams
  • Strong in Linux, networking fundamentals (HTTP, TLS, DNS, TCP), and distributed systems concepts
  • Proficiency with Go, Python, Shell Scripting, SQL, Java or JVM, JavaScript/TypeScript, YAML/HCL/JSON
  • Hands-on with IaC (Terraform) and CI/CD (GitLab CI, GitHub Actions, AWS/Azure DevOps)
  • Deep experience in AWS Cloud infrastructure
Job Responsibility
Job Responsibility
  • Lead SRE to drive reliability, scalability, observability (monitoring & alerts) and performance across the production platforms
  • Own the SLO/SLI strategy, modernize observability and incident response, and partner with application teams to deliver resilient systems
  • Define and govern SLOs/SLIs/Error Budgets for critical services
  • enforce guardrails and drive reliability roadmaps
  • Lead performance tuning collaboration with application teams to ensure high availability and low latency
  • Define and own infrastructure tuning to ensure scalability leading to high availability
  • Lead Metrics and automation driven Reliability
  • Dedug systems across layers
  • Architect and evolve CI/CD, infrastructure-as-code (IaC- Terraform)
  • Design and build serverless APIs (Lambda, API Gateway, SQS, SNS, DynamoDB, etc.)
What we offer
What we offer
  • Weekly Direct Deposit
  • 401K Matching
  • Competitive medical, dental and vision insurance
  • Consistent communication throughout your project
  • ZeekTek Referral Program
  • Fulltime
Read More
Arrow Right

Lead Site Reliability Engineer

Groupon is a marketplace where customers discover new experiences and services e...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in systems engineering
  • at least 5+ years in SRE or DevOps roles
  • expertise in cloud platforms (GCP, AWS) and container orchestration (Kubernetes, Docker)
  • proficiency in programming and scripting languages like Python, Go, and Bash
  • advanced knowledge of Infrastructure as Code (IaC) tools such as Terraform and Ansible
  • deep understanding of networking, DNS, load balancing, and security principles
  • proven track record of managing high-availability systems in demanding environments
  • exceptional analytical and problem-solving skills
Job Responsibility
Job Responsibility
  • Architect and maintain fault-tolerant systems, ensuring uptime SLAs of 99.9% or higher
  • drive automation in infrastructure management and deployment using Terraform, Ansible, Kubernetes, and similar tools
  • create and optimize CI/CD pipelines to ensure reliable, secure, and efficient software delivery
  • build and enhance comprehensive observability solutions, including monitoring, logging, and alerting systems using Prometheus, Grafana, and the ELK stack
  • collaborate with stakeholders to define and achieve SLIs, SLOs, and error budgets aligned with business needs
  • lead incident response during on-call rotations, ensuring rapid resolution and root cause analysis for critical issues
  • design and execute performance testing, capacity planning, and scalability strategies for evolving workloads
  • proactively identify and resolve bottlenecks, increasing system performance and developer efficiency
  • mentor junior engineers, fostering a collaborative and growth-oriented team environment
  • guide architectural decisions that drive innovation and enhance system reliability
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • a collaborative and innovative work values alignment that values your expertise and contributions
  • professional growth and leadership development pathways tailored to your aspirations
  • a chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Senior SRE Manager

We are in search of a Senior Site Reliability Engineering Manager to build and l...
Location
Location
New Zealand , Auckland
Salary
Salary:
Not provided
aiven.io Logo
Aiven Deutschland GmbH
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience managing or leading software development or operations team and defining, setting measurable success criteria for engineering roles with a focus on hiring, training and leading a team of engineers
  • Prior experience working as an engineer at a software company, enabling you to contribute to technical discussions at a senior level
  • Previous knowledge working with other internal teams to successfully resolve complex cases with excellent communication skills
Job Responsibility
Job Responsibility
  • Coordinate and develop global operations work together with our EMEA and NA based SRE Managers and their respective teams
  • Lead a team of ANZ-based SREs
  • Enable the team to drive important software and process initiatives for the Aiven platform plan on-call shift rotations, ensuring adequate team availability in Aiven's global follow-the-sun operation
  • Run a tight, efficient operation where decision-making is based on metrics and data and work is prioritized accordingly
What we offer
What we offer
  • Participate in Aiven's equity plan
  • Balance work and life with our hybrid work policy
  • Choose the equipment you need to set yourself up for success
  • Use your Professional Development Plan budget for learning opportunities
  • Receive holistic wellbeing support through our global Employee Assistance Program
  • Inquire about our Global Time Off Commitment (Parental and Sick Leave, as well as Personal Time)
  • Enjoy country-specific benefits for our global cast
  • Fulltime
Read More
Arrow Right