CrawlJobs Logo

SRE Team Lead

venzotechnologies.com Logo

Venzo Technologies

Location Icon

Location:
India , Chennai

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

We are looking for a hands-on SRE Team Lead to own the reliability, scalability, and operational excellence of a cloud-native fintech platform built on microservices. This role combines technical leadership, architecture ownership, and deep hands-on execution. You will lead a small SRE team while remaining actively involved in design, coding, incident response, and reliability engineering.

Job Responsibility:

  • Own platform availability, latency, scalability, and resilience across environments
  • Define and enforce SLOs, SLIs, error budgets, and operational KPIs
  • Design and review resilience patterns: circuit breakers, retries, rate limiting, graceful degradation
  • Drive chaos engineering, fault-injection, and disaster-recovery readiness
  • Actively contribute code (Java / Node) for reliability tooling
  • Platform automation
  • Observability integrations
  • Review microservice architecture with engineering teams to eliminate single points of failure
  • Own AWS architecture (VPCs, IAM, EKS, RDS, ALB/NLB, autoscaling)
  • Drive Kubernetes best practices (resource tuning, HPA, pod disruption budgets)
  • Improve CI/CD pipelines for reliability, speed, and safety
  • Lead production incident response, root cause analysis (RCA), and postmortems
  • Establish blameless postmortem culture
  • Reduce MTTR through automation and better observability
  • Participate in escalation/on-call strategy (not firefighting 24×7)
  • Mentor SRE DevOps and SRE Full-Stack engineers
  • Define operational standards, runbooks, and SRE practices
  • Work closely with product, security, and engineering leaders

Requirements:

  • 8+ years of experience in SRE / Platform / DevOps engineering
  • Strong hands-on experience with AWS (EKS, EC2, RDS, IAM, CloudWatch, ALB)
  • Kubernetes & Docker
  • Microservices architectures
  • Strong programming background in Java and/or Node.js
  • Deep understanding of distributed systems, production debugging, and capacity planning
  • Experience in fintech or regulated environments is a strong plus

Nice to have:

  • Experience with chaos engineering tools
  • Security & compliance exposure (PCI-DSS, SOC2, ISO)
  • Prior experience building or scaling SRE teams

Additional Information:

Job Posted:
February 20, 2026

Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for SRE Team Lead

Lead SRE

We are looking for a Lead SRE to join our Inetum Team and be part of a work cult...
Location
Location
Portugal , Lisbon
Salary
Salary:
Not provided
https://www.inetum.com Logo
Inetum
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • SRE IT production processes
  • Agile / DevOps Mindset Problem Solving
  • Scripting: Python, YML, Shell
  • Monitoring: Dynatrace, Nagios
  • Linux
  • Admin Network (DNS, Firewall, Switch)
  • DevOps stack: Git & Git Flow, Artifactory, Jenkins or Gitlab CI, Ansible Tower, Digital ai Release
  • Cloud: Kubernetes, Docker, Argo CD, ArgoCD, Vault, Helm
  • End-to-end IT organization and processes (from development to run / operate)
  • Technical Architecture
Job Responsibility
Job Responsibility
  • Train SREs and their managers on SRE practices
  • Co-construct the transformation strategy and the support plan by participating in workshops, brainstorming with the transformation team and producing training content
  • Coach and support
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Support Lead

Site Reliability Engineering Support Lead role focused on application support, d...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Solid SRE process experience
  • 5+ years of Leading high-performance, 24x7, DevOps or SysOps team
  • Proficiency in Windows administration, Office 365, Exchange, SharePoint, Active Directory, Backup, Networking and Infrastructure
  • Experience with Microsoft OS Windows & Server
  • Experience in ticket tracking and resolving on time
  • Hands-on experience on ticketing tools (ServiceNow)
  • Excellent verbal, written, presentation and interpersonal communication skills
  • Ability to make complex technical matters easy-to-comprehend for non-technical persons.
Job Responsibility
Job Responsibility
  • Taking end-to-end Ownership of Application Support for Production Systems Issues resolution
  • Implementing, monitoring, and maintaining CI/CD frameworks
  • Developing new capabilities, coordinating implementation across a large number of teams including infrastructure, developer tools and information security
  • Influencing a culture of Site Reliability Engineering. Engaging in training and mentoring to help develop other engineers with SRE mind set
  • Providing the first line of after-deployment technical support at L1 and L2 level for applications and and/or associated production systems diagnostics, and network health monitoring
  • Coordination and/or for deploying hands-on fixes, patches and software updates at the application level, and as appropriate at the network level
  • Managing a team of technical support engineers who provide technical support to users
  • Escalating complex problems to the L3 level of expertise within organization, along with observations from investigative and diagnostic assessments
  • Co-ordinating in the investigation of repeated technical issues affecting user system and seeing through to resolution
  • Escalating, resolving, guiding team, and tracking production incidents to closure
What we offer
What we offer
  • Competitive base salary (which is annually reviewed)
  • Hybrid working model (up to 2 days working at home per week)
  • Additional benefits to support you and your family to be well, live well and save well.
  • Fulltime
Read More
Arrow Right

Director, Service Reliability Engineering

As Director of SRE, you will lead the team responsible for accelerating and auto...
Location
Location
United States , Bethesda
Salary
Salary:
125600.00 - 203700.00 USD / Year
https://www.marriott.com Logo
Marriott Bonvoy
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Undergraduate degree in computer science, software engineering, or a related field (or equivalent experience)
  • 10+ years of experience in SRE, devsecops or IT operations
  • At least 5 years’ experience in a previous leadership role within SRE, devsecops or IT Operations
  • At least five years of experience in the following technologies - Presentation Management: HTML, CSS, JS, Backbone, Node JS, Android, iOS, Application Platforms: NGINX, Java, Akana, Play Framework, Tomcat, Docker, Openshift, Application Data: PostgreSQL, Couchbase, Cassandra, Integration Services: Apache Kafka, Apache Spark, Akana, Analytics Platforms: Hadoop, dashDB, Cognos, Tableau, Security: Forgerock, OpenID, OAUTH, Ping Identity, Public Cloud: Azure, Google Cloud, AliCloud, Amazon Web Services, CI/CD: Harness
  • Experience with test automation
  • Working knowledge and proven track record of implementing disaster indifferent architecture
  • Experience with CDN and Akamai tools
  • Linux/Unix system administration experience
  • Proficient in scripting and programming languages (like Python, Go, Bash, Shell)
  • Hands on experience with infrastructure as code (like Terraform), container orchestration (like Kubernetes), and reliability automation
Job Responsibility
Job Responsibility
  • Define and execute Marriott’s SRE vision, aligning with business objectives and technology roadmaps
  • Build, mentor and lead a high-performing SRE team, fostering a culture of collaboration and innovation
  • Establish reliability, observability and automation goals to improve system uptime, performance and scalability
  • Partner with engineering, operations and security teams to drive best practices and continuous improvement
  • Implement reliability-focused engineering practices, including SLAs, SLOs/SLIs and error budgets
  • Design and maintain resilient, scalable and fault-tolerant architectures across cloud and hybrid environments
  • Develop strategies to proactively identify and mitigate risks to system performance and availability
  • Drive root cause analysis (RCA) and post-mortem processes to prevent recurring incidents
  • Champion automation in monitoring, deployment and incident resolution to reduce toil and enhance efficiency
  • Lead and optimize incident response processes, ensuring rapid detection, diagnosis, and resolution of system failures
What we offer
What we offer
  • Bonus program
  • comprehensive health care benefits
  • 401(k) plan with up to 5% company match
  • employee stock purchase plan at 15% discount
  • accrued paid time off (including sick leave where applicable)
  • life insurance
  • group disability insurance
  • travel discounts
  • adoption assistance
  • paid parental leave
  • Fulltime
Read More
Arrow Right

Lead SRE

We have a 6 month contract to hire for a senior, hands-on Site Reliability Engin...
Location
Location
United States , St Louis
Salary
Salary:
Not provided
zeektek.com Logo
Zeektek
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree
  • AWS Certified DevOps Engineer – Professional
  • Dynatrace Professional
  • One SaaS tool certifications (Prometheus Certified Associate (PCA), Datadog, New Relic)
  • 7+ years in SRE/Production Engineering/Platform roles
  • 2+ years leading initiatives or teams
  • Strong in Linux, networking fundamentals (HTTP, TLS, DNS, TCP), and distributed systems concepts
  • Proficiency with Go, Python, Shell Scripting, SQL, Java or JVM, JavaScript/TypeScript, YAML/HCL/JSON
  • Hands-on with IaC (Terraform) and CI/CD (GitLab CI, GitHub Actions, AWS/Azure DevOps)
  • Deep experience in AWS Cloud infrastructure
Job Responsibility
Job Responsibility
  • Lead SRE to drive reliability, scalability, observability (monitoring & alerts) and performance across the production platforms
  • Own the SLO/SLI strategy, modernize observability and incident response, and partner with application teams to deliver resilient systems
  • Define and govern SLOs/SLIs/Error Budgets for critical services
  • enforce guardrails and drive reliability roadmaps
  • Lead performance tuning collaboration with application teams to ensure high availability and low latency
  • Define and own infrastructure tuning to ensure scalability leading to high availability
  • Lead Metrics and automation driven Reliability
  • Dedug systems across layers
  • Architect and evolve CI/CD, infrastructure-as-code (IaC- Terraform)
  • Design and build serverless APIs (Lambda, API Gateway, SQS, SNS, DynamoDB, etc.)
What we offer
What we offer
  • Weekly Direct Deposit
  • 401K Matching
  • Competitive medical, dental and vision insurance
  • Consistent communication throughout your project
  • ZeekTek Referral Program
  • Fulltime
Read More
Arrow Right

Lead Site Reliability Engineer

Groupon is a marketplace where customers discover new experiences and services e...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in systems engineering
  • at least 5+ years in SRE or DevOps roles
  • expertise in cloud platforms (GCP, AWS) and container orchestration (Kubernetes, Docker)
  • proficiency in programming and scripting languages like Python, Go, and Bash
  • advanced knowledge of Infrastructure as Code (IaC) tools such as Terraform and Ansible
  • deep understanding of networking, DNS, load balancing, and security principles
  • proven track record of managing high-availability systems in demanding environments
  • exceptional analytical and problem-solving skills
Job Responsibility
Job Responsibility
  • Architect and maintain fault-tolerant systems, ensuring uptime SLAs of 99.9% or higher
  • drive automation in infrastructure management and deployment using Terraform, Ansible, Kubernetes, and similar tools
  • create and optimize CI/CD pipelines to ensure reliable, secure, and efficient software delivery
  • build and enhance comprehensive observability solutions, including monitoring, logging, and alerting systems using Prometheus, Grafana, and the ELK stack
  • collaborate with stakeholders to define and achieve SLIs, SLOs, and error budgets aligned with business needs
  • lead incident response during on-call rotations, ensuring rapid resolution and root cause analysis for critical issues
  • design and execute performance testing, capacity planning, and scalability strategies for evolving workloads
  • proactively identify and resolve bottlenecks, increasing system performance and developer efficiency
  • mentor junior engineers, fostering a collaborative and growth-oriented team environment
  • guide architectural decisions that drive innovation and enhance system reliability
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • a collaborative and innovative work values alignment that values your expertise and contributions
  • professional growth and leadership development pathways tailored to your aspirations
  • a chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

SRE Lead Design & Support Engineer

This is a critical enabler achieving a high resiliency during operations and als...
Location
Location
Mexico , Miguel Hidalgo
Salary
Salary:
Not provided
pepsico.com Logo
Pepsico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of work experience evolving to a SRE engineer
  • 3-5 years of experience in continuously improving and transforming IT operations ways of working
  • Bachelor’s degree in Computer Science, Information Technology or a related field
  • Proven experience as an SRE in designing the events diagnostics, performance measures and alert solutions to meet the SLA/SLO/SLIs
  • Highly quantitative, have great judgment, able to connect dots across ecosytems, and efficiently work cross-functionally across teams
  • A strong expertise of SRE (Software Reliability Engineering) and IT Service Management (ITSM) processes
  • Hands on experience in Python, SQL /No-SQl( MySQL, Mongo DB, Cassandra, Postgress), AppDynamics, ELK Stack Grafana, Splunk, Dynatrace, Kafka and any SRE Ops toolsets
  • A firm understanding of cloud archticture for distributed environments
  • Front-end technologies: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js
  • Back-end technologies: Server-side languages (Java, Spring Boot, and related technologies that build the server-side logic, APIs, and database interaction with MySQL, MongoDB, Cassandra, Couchbase)
Job Responsibility
Job Responsibility
  • Drive new shift left activities critical to apply Site Reliability Engineering (SRE) and quality assurance principles within the application design / Project roadmap that enablees resilient outcomes
  • Apply pre-emptive approach into production minimizing business impact, via SRE-driven orchestration of connecting all components of the ecosystem diagnosing anomalies prior to user & remediating through automation
  • Ensure ecosystem availability and performance in production environments, Pro-actively preventing P1, P2, potential P3s
  • Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications
  • Accountable to institute non-functional requirements (NFRs) are embedded early including SLA/SLO/SLI and error budgets into the product’s offerings as part of the engineering solution
  • Leads the team diagnosing any anomalies prior to any user and driving the necessary remediations across the teams involved in end-to-end ecosystem availability, performance and consumption of the cloud architected application ecosystem leveraging SRE Orchestration solutions
  • Collaborates with Engineering & support teams, including participation in escalations, and blameless postmortems
  • Work closely with customer-facing support teams to empower them with SRE insights and tooling
  • Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member
  • Continuously optimize the L2/support operations work via SRE workflow automation
What we offer
What we offer
  • Opportunities to learn and develop every day through a wide range of programs
  • Internal digital platforms that promote self-learning
  • Development programs according to Leadership skills
  • Specialized training according to the role
  • Learning experiences with internal and external providers
  • Recognition programs for seniority, behavior, leadership, moments of life, among others
  • Financial wellness programs that will help you reach your goals in all stages of life
  • A flexibility program that will allow you to balance your personal and work life, adapting your working day to your lifestyle
  • Wellness Line, thousands of Agreements and Discounts, Scholarship programs for your children, Aid Plans for different moments of life
Read More
Arrow Right
New

Engineering Manager, Infrastructure Engineering

This is not a traditional SRE or DevOps role. Whatnot's Reliability Engineering ...
Location
Location
Poland , Kraków
Salary
Salary:
Not provided
whatnot.com Logo
Whatnot
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of experience in infrastructure or platform engineering
  • 5+ years managing engineering teams
  • Experience leading managers or multiple teams a plus
  • Proven track record building and operating large-scale distributed systems with strong reliability, observability, and incident response practices
  • Deep technical grounding in one or more of: SLO design, monitoring/alerting, incident tooling, traffic control mechanisms, load and chaos testing, or platform engineering
  • Experience leading teams that ship developer-facing platforms, frameworks, or internal tools
  • Strong software engineering fundamentals
  • Demonstrated ability to guide teams through complex system challenges, large-scale migrations, and longer-term reliability initiatives
  • Exceptional communication and leadership skills
  • A passion for enabling teams to build fast while building safely through well-designed tooling and proactive detection mechanisms
Job Responsibility
Job Responsibility
  • Lead and mentor a team of highly skilled software engineers, supporting their technical growth, execution, and long-term career development
  • Set technical direction and quality standards for the team while empowering senior ICs to own design and architecture decisions
  • Develop and execute the strategic roadmap for reliability engineering at Whatnot
  • Build and operationalize best practices that empower product and platform teams to design and run reliable systems
  • Own the strategic roadmap for reliability tooling, including incident response systems, SLO measurement platforms, and developer-facing reliability libraries
  • Lead the team in designing and building traffic control systems as reusable platform components
  • Lead the design and execution of load testing at scale
  • Drive continuous improvement in incident detection and mitigation
  • Collaborate with cross-functional teams to influence product and architectural decisions that improve overall reliability and customer impact
  • Partner with Infrastructure and Engineering leadership to shape reliability strategy and investment priorities across the organization
  • Fulltime
Read More
Arrow Right

Engineering Lead Analyst

The Engineering Lead Analyst is a senior level position responsible for leading ...
Location
Location
India , Pune
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6-10 years of relevant experience in an Engineering role
  • Experience working in Financial Services or a large complex and/or global environment
  • Project Management experience
  • Consistently demonstrates clear and concise written and verbal communication
  • Comprehensive knowledge of design metrics, analytics tools, benchmarking activities and related reporting to identify best practices
  • Demonstrated analytic/diagnostic skills
  • Ability to work in a matrix environment and partner with virtual teams
  • Ability to work independently, multi-task, and take ownership of various parts of a project or initiative
  • Ability to work under pressure and manage to tight deadlines or unexpected changes in expectations or requirements
  • Proven track record of operational process change and improvement
Job Responsibility
Job Responsibility
  • Serve as a technology subject matter expert for internal and external stakeholders
  • Provide direction for all firm mandated controls and compliance initiatives
  • Lead projects within the group and create a technology domain roadmap
  • Ensure that all integration of functions meet business goals
  • Define necessary system enhancements to deploy new products and process enhancements
  • Recommend product customization for system integration
  • Identify problem causality, business impact and root causes
  • Exhibit knowledge of how own specialty area contributes to the business
  • Apply knowledge of competitors, products and services
  • Advise or mentor junior team members
  • Fulltime
Read More
Arrow Right