CrawlJobs Logo

Senior Site Reliability Engineer (SRE)

India, Chennai · Job Posted May 30, 2026
Apply Position
Job Link Share

Job Description

The Senior SRE is responsible for deployment, updates, and operational support for environments hosting Dalet’s cloud-based solutions. This role ensures operational excellence, a seamless client experience, and continuous improvement across infrastructure and delivery processes. The ideal candidate combines strong technical capabilities with the ability to lead delivery through influence and hands-on engineering expertise.

Job Responsibility

  • Act as a senior technical authority for APAC Site Reliability Engineering activities
  • Drive best practices in reliability, operations, and engineering standards
  • Promote technical excellence, collaboration, and accountability across stakeholders
  • Make infrastructure complexity transparent to both internal teams and customers, ensuring a consistently excellent client experience
  • Implement, track, and evolve service performance measures such as SLAs, SLOs, and SLIs
  • Anticipate risks related to service availability, capacity, performance regressions, and security vulnerabilities
  • Drive continuous improvement, including leading and facilitating Root Cause Analysis (RCA) activities
  • Ensure timely execution of deployments, upgrades, maintenance activities, and change requests
  • Anticipate workload, plan deliverables, and ensure qualification/validation of upcoming tasks
  • Collaborate closely with engineering to improve platform components, automation, and operational processes
  • Control and optimise infrastructure and operational expenditure across cloud and on-prem environments
  • Drive automation initiatives and enhance monitoring to improve reliability and reduce manual effort
  • Improve deployment workflows and operational tooling to support scalability and efficiency
  • Communicate proactively with internal stakeholders including Pre-Sales, Project Management, CSM, and TAM teams
  • Engage with customers as needed to ensure clarity, trust, and high-quality service.

Requirements

  • Cloud platforms: AWS, Azure
  • Containerisation & Orchestration: Kubernetes
  • Infrastructure as Code: Terraform
  • Configuration Management: Ansible
  • Packaging & Deployment: Helm
  • Databases: MariaDB, MongoDB
  • Monitoring, observability, networking, and cloud security.

What we offer

  • Great career opportunities around the world
  • Truly collaborative environment with supportive leadership
  • Cutting edge technologies (AI, Cloud, Cybersecurity...)
  • Talented and passionate team members
  • Fun working environment

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Site Reliability Engineer (SRE)

8 matching positions

Senior Site Reliability Engineer (SRE) – Cloud & Distributed Systems

We are seeking an experienced Senior Site Reliability Engineer (SRE) to design, ...
Location
Location
United States , Austin
Salary
Salary:
Not provided
dutechsystems.com Logo
Dutech Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in SRE, DevOps, or Systems Engineering
  • Strong expertise in Linux/Unix systems and system internals
  • Proficiency in at least one programming/scripting language (Python, Go, Java, Bash)
  • Experience designing and operating distributed systems
  • Hands-on experience with cloud platforms (AWS or GCP)
  • Experience with Docker and Kubernetes
  • Strong understanding of monitoring, alerting, and logging concepts
  • Experience managing SLIs, SLOs, and error budgets
  • Experience with incident management and RCA processes
Job Responsibility
Job Responsibility
  • Design, implement, and manage highly available, distributed systems
  • Maintain and optimize cloud infrastructure (AWS/GCP)
  • Develop automation scripts using Python, Go, Java, or Bash
  • Manage containerized environments using Docker and Kubernetes
  • Define and monitor SLIs, SLOs, and error budgets
  • Implement monitoring, logging, and alerting solutions
  • Lead incident management, root cause analysis (RCA), and postmortems
  • Ensure system security and compliance within operational workflows
  • Improve system reliability through performance tuning and optimization
  • Collaborate with engineering teams to enhance deployment and release processes
Read More
Arrow Right
New

Senior Site Reliability Engineer Manager

RemoteStar is looking to hire a Senior Site Reliability Engineering Manager on b...
Location
Location
United Kingdom of Great Britain and Northern Ireland , London
Salary
Salary:
Not provided
Remotestar
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience in a senior or lead SRE role, with a strong track record of building and maintaining highly reliable infrastructure and services.
  • Expertise in incident management, including incident response, resolution, and post-mortem analysis.
  • Proficiency in monitoring, alerting, and observability tools such as Prometheus, Grafana, ELK stack or Datadog.
  • Experience with cloud platforms such as AWS, Azure, or GCP, including infrastructure as code tools like Terraform or CloudFormation.
  • Strong scripting and automation skills, with proficiency in languages such as Python, Bash, or Go.
  • Excellent communication and collaboration skills, with the ability to work effectively with cross-functional teams in a remote environment.
  • Demonstrated leadership capabilities, with a passion for mentoring and developing team members.
Job Responsibility
Job Responsibility
  • Take full ownership of the production estate from both a technical and process perspective.
  • Provide a consistent smooth operation of live systems and drive all on-call support issues.
  • Design and operate a new incident tracking process to ensure root causes are found and remediated in a timely fashion by the development team.
  • Create and maintain high end monitoring and automation tooling.
  • Drive automation initiatives to streamline operational workflows and improve efficiency.
  • Develop and maintain tools, scripts, and dashboards to monitor system health, performance, and reliability.
  • Build a first class SRE team.
  • Through a combination of leading by example, coaching and mentoring, mould the team would want to have around you.
  • Provide leadership and guidance to the SRE team, fostering a culture of collaboration, innovation, and continuous improvement.
What we offer
What we offer
  • Dynamic working environment in an extremely fast-growing company
  • Work in an international environment
  • Work in a pleasant environment with very little hierarchy
  • Intellectually challenging, play a massive role in client’s success and scalability
  • Flexible working hours
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Embark on a transformative journey as a Senior Site Reliability Engineer - AVP. ...
Location
Location
United States , Whippany
Salary
Salary:
120000.00 - 175000.00 USD / Year
barclays.co.uk Logo
Barclays
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Considerable programming expertise in languages such as Python, Java, and others
  • Practical experience with Infrastructure as Code (IaC) tools, including Ansible, Chef, and Terraform
  • Validated experience with observability and monitoring platforms such as Observe, Elastic, InfluxDB, and Grafana
  • Solid understanding of containerization technologies and Unix/Linux environments
  • Demonstrates a Site Reliability Engineering (SRE) mindset, with good analytical skills, ownership, and a forward-thinking approach to problem-solving
Job Responsibility
Job Responsibility
  • Build and maintain infrastructure platforms and products that support applications and data systems
  • Ensure the reliability, availability, and scalability of the systems, platforms, and technology
  • Development, delivery, and maintenance of high-quality infrastructure solutions
  • Monitoring of IT infrastructure and system performance to measure, identify, address, and resolve any potential issues, vulnerabilities, or outages
  • Development and implementation of automated tasks and processes to improve efficiency and reduce manual intervention
  • Implementation of a secure configuration and measures to protect infrastructure against cyber-attacks, vulnerabilities, and other security threats
  • Cross-functional collaboration with product managers, architects, and other engineers to define IT Infrastructure requirements
  • Stay informed of industry technology trends and innovations
What we offer
What we offer
  • medical, dental and vision coverage
  • 401(k)
  • life insurance
  • other paid leave for qualifying circumstances
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer, Infrastructure Foundations

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to su...
Location
Location
United States
Salary
Salary:
113082.00 - 175725.00 USD / Year
wikimediafoundation.org Logo
Wikimedia Foundation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in an SRE/Operations/DevOps role as part of a team
  • Experience with shell and any scripting languages used in an SRE context (Python, Go, Bash, Ruby
  • we primarily use Python) and configuration management tools (Puppet, Ansible
  • we use Puppet)
  • Experience designing and managing infrastructure security for large fleets of diverse services
  • Experience with technical response during security incidents
  • Experience with package management on Linux systems (we use Debian)
  • Strong Linux system-level troubleshooting skills
  • History of automating tasks and processes, identifying process gaps, and finding automation opportunities
  • Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones
Job Responsibility
Job Responsibility
  • Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
  • Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
  • Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
  • Work closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
  • Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure
  • Collaborating with a global, cross-functional team in an asynchronous communication environment
  • Mentoring peers in your areas of technical and operational strength
  • Ability and willingness to travel 1-2 times a year for in-person events and team meetings
  • Most importantly, share our values and work in accordance with them
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

AutoRABIT is the leader in DevSecOps for SaaS platforms such as Salesforce. Its ...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
autorabit.com Logo
AutoRABIT
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in SRE, DevOps, or related roles
  • Solid hands-on experience with AWS services (EKS, ECS, EC2, RDS, S3, Redis, etc.)
  • Proficient in writing Terraform infrastructure scripts
  • Strong scripting skills in Python using Boto3
  • Deep understanding of monitoring/logging tools (ELK, CloudWatch, TrendMicro)
  • Experience building and managing CI/CD pipelines (CodeBuild, CodePipeline)
  • Knowledge of infrastructure security and incident response practices
  • Willing to work in rotational shifts and rotational week-offs
  • Bachelor’s in computers or any related field
  • AWS certifications is preferred
Job Responsibility
Job Responsibility
  • Provision and manage AWS infrastructure using Terraform
  • Write AWS Lambda functions (Python3 + Boto3) to automate operational tasks
  • Set up monitoring, logging, and alerting with ELK, TrendMicro, and AWS CloudWatch
  • Configure alerts for performance and security anomalies
  • Develop and maintain CI/CD pipelines using AWS CodeBuild and CodePipeline
  • Troubleshoot production issues and contribute to blameless postmortems
  • Contribute to system hardening and security compliance efforts
  • Responsibility to adhere to set internal controls
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Senior Site Reliability Engineer (SRE). This role has been designed as ‘’Onsite’...
Location
Location
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, Information Systems, or a related field
  • 6–10+ years of experience in DevOps, Site Reliability Engineering, or cloud infrastructure roles
  • Strong hands-on experience with cloud platforms (AWS or GCP) including services like EC2/GCE, IAM, and object storage (S3/GCS)
  • Experience with containerization and orchestration technologies, especially Docker and Kubernetes
  • Experience building and managing CI/CD pipelines using tools such as Jenkins, GitHub Actions, or GitLab
  • Experience with monitoring and observability tools such as Prometheus, CloudWatch, or Stackdriver
  • Strong understanding of Linux systems administration and configuration management tools like Ansible
  • Experience managing distributed systems and streaming platforms such as Kafka, Cassandra, Elasticsearch, Spark, Flink, or Storm
  • Strong automation and scripting skills using Python, Go, Rust, or Shell scripting
  • Experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation
Job Responsibility
Job Responsibility
  • Ensure high availability, reliability, and performance of large-scale cloud infrastructure across AWS and GCP environments
  • Operate and support infrastructure components and distributed data platforms such as Kubernetes, Kafka, Flink, Storm, and Spark
  • Manage and maintain databases including Cassandra, Elasticsearch, Redis, Postgres, and ArangoDB
  • Monitor systems, troubleshoot issues, and resolve production incidents across microservices and distributed systems
  • Collaborate closely with software engineering teams to debug and resolve complex production problems
  • Participate in 24x7 on-call rotation supporting multi-cloud production environments
  • Monitor system metrics, application performance, and infrastructure health using observability tools
  • Own the incident management lifecycle, including detection, mitigation, Root Cause Analysis (RCA), and post-incident reviews
  • Develop and maintain runbooks, automation, and operational processes to improve reliability and efficiency
  • Perform capacity planning using system usage and performance data
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
Read More
Arrow Right

Senior Site Reliability Engineer

We are seeking a highly skilled and passionate Senior Site Reliability Engineer ...
Location
Location
Spain; Portugal; United Kingdom
Salary
Salary:
Not provided
parserdigital.com Logo
Parser Limited
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep SRE Expertise: Proven experience as a Senior Site Reliability Engineer or a similar role, with a strong understanding of SRE principles (error budgets, SLOs/SLIs, toil reduction)
  • Azure Cloud Proficiency: Extensive hands-on experience designing, deploying, and operating highly available and scalable applications on Microsoft Azure
  • Azure Kubernetes Service (AKS) Expertise: Mandatory extensive hands-on experience with AKS for container orchestration, including deployment, scaling, monitoring, and troubleshooting
  • Java Ecosystem Mastery: Expert-level proficiency with Java, including experience with modern frameworks (ideally Micronaut, Spring Boot, or similar) and JVM performance tuning
  • Distributed Systems Knowledge: Solid understanding and practical experience with distributed systems, microservices architecture, and associated challenges (e.g., consistency, fault tolerance)
  • Messaging & Database Expertise: Hands-on experience with an event streaming platform (ideally Kafka) and NoSQL data storage (ideally Couchbase), including operational best practices
  • Automation First Mindset: Strong scripting skills (e.g., Python, Bash) and experience with Infrastructure as Code tools (e.g., Terraform, ARM templates) and CI/CD pipelines (e.g., Azure DevOps, Jenkins)
  • Observability Tools: Experience with monitoring, logging, and alerting tools (e.g., Azure Monitor, Prometheus, Grafana, ELK Stack, Splunk)
  • Problem-Solving Acumen: Exceptional analytical and troubleshooting skills, with a methodical approach to diagnosing and resolving complex production issues
  • Communication & Collaboration: Excellent communication skills, with the ability to articulate complex technical concepts to diverse audiences and collaborate effectively with cross-functional teams
Job Responsibility
Job Responsibility
  • Architect and Implement Reliability: Design, build, and maintain highly scalable, resilient, and performant systems on Azure, focusing on our Java, Kafka, and Couchbase stack
  • Drive Modernisation: Work hands-on as part of the team spearheading the adoption of Micronaut, standardising application templates, and transitioning to managed cloud services
  • Enhance Operational Excellence: Develop and implement strategies for improving system observability (standardised logging, metrics, tracing), alerting, and on-call practices
  • Automate Everything: Champion automation across the software development lifecycle (SDLC), from CI/CD pipelines to infrastructure provisioning, focusing on accelerating delivery and de-risking deployments
  • Incident Management & Learning: Contribute to our mature, blameless post-incident review process, identifying root causes and implementing preventative measures to reduce incident hours
  • Tooling & Standards: Develop, maintain, and drive the adoption of shared, standardised SRE tooling and best practices across engineering teams, including containerisation (e.g., Docker, Kubernetes on Azure), infrastructure as code (e.g., Terraform), and configuration management
  • Mentorship & Collaboration: Provide technical leadership and mentorship to junior engineers, fostering a culture of SRE principles and operational excellence across the wider engineering organisation
  • Strategic Input: Contribute to the overall technical strategy and roadmap for our SRE and platform initiatives, ensuring alignment with business objectives
What we offer
What we offer
  • The chance to join an organization with triple-digit growth that is changing the paradigm on how software products are built
  • The opportunity to form part of an amazing, multicultural community of tech experts
  • A highly competitive compensation package
  • Medical insurance
  • English lessons
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

We are looking for a Senior Site Reliability Engineer (SRE) with deep experience...
Location
Location
United States , Chicago
Salary
Salary:
131000.00 USD / Year
realign-llc.com Logo
Realign
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in SRE, DevOps, or Cloud Engineering
  • Expertise in AWS core services (EC2, ECS/EKS, Lambda, S3, VPC, RDS, IAM, CloudFront, etc.)
  • Hands-on experience with Terraform, Ansible, or other IaC tools
  • Strong scripting/coding skills (Python, Go, Shell, etc.)
  • Experience with Kubernetes, containerization, and orchestration
  • Deep knowledge of Linux systems and networking
  • Experience with Service Meshes (e.g., Istio, App Mesh)
  • Familiarity with AWS Well-Architected Framework
  • Experience building self-healing systems and automated remediation
  • Background in security, compliance, or multi-account/multi-region AWS architectures
Job Responsibility
Job Responsibility
  • Design, implement, and maintain scalable, secure, and highly available infrastructure on AWS
  • Develop and improve CI/CD pipelines, Infrastructure as Code (IaC) using Terraform, Harness
  • Own and implement monitoring, alerting, logging, and distributed tracing with tools like Dynatrace/ Datadog
  • Troubleshoot production incidents, conduct blameless postmortems, and improve incident response processes
  • Optimize systems for cost, performance, and reliability
  • Drive chaos engineering and resilience testing
  • Collaborate with development teams to embed SRE practices like SLAs, SLOs, and error budgets
  • Mentor junior SREs and promote DevOps/SRE culture across the org
  • Fulltime
Read More
Arrow Right