Senior Site Reliability Engineer (SRE) Job at Dalet (Chennai)

Senior Site Reliability Engineer (SRE) – Cloud & Distributed Systems

We are seeking an experienced Senior Site Reliability Engineer (SRE) to design, ...

Location

United States , Austin

Salary:

Not provided

Dutech Systems

Expiration Date

Until further notice

Requirements

8+ years of experience in SRE, DevOps, or Systems Engineering
Strong expertise in Linux/Unix systems and system internals
Proficiency in at least one programming/scripting language (Python, Go, Java, Bash)
Experience designing and operating distributed systems
Hands-on experience with cloud platforms (AWS or GCP)
Experience with Docker and Kubernetes
Strong understanding of monitoring, alerting, and logging concepts
Experience managing SLIs, SLOs, and error budgets
Experience with incident management and RCA processes

Job Responsibility

Design, implement, and manage highly available, distributed systems
Maintain and optimize cloud infrastructure (AWS/GCP)
Develop automation scripts using Python, Go, Java, or Bash
Manage containerized environments using Docker and Kubernetes
Define and monitor SLIs, SLOs, and error budgets
Implement monitoring, logging, and alerting solutions
Lead incident management, root cause analysis (RCA), and postmortems
Ensure system security and compliance within operational workflows
Improve system reliability through performance tuning and optimization
Collaborate with engineering teams to enhance deployment and release processes

New

Senior Site Reliability Engineer Manager

RemoteStar is looking to hire a Senior Site Reliability Engineering Manager on b...

Location

United Kingdom of Great Britain and Northern Ireland , London

Salary:

Not provided

Remotestar

Expiration Date

Until further notice

Requirements

Proven experience in a senior or lead SRE role, with a strong track record of building and maintaining highly reliable infrastructure and services.
Expertise in incident management, including incident response, resolution, and post-mortem analysis.
Proficiency in monitoring, alerting, and observability tools such as Prometheus, Grafana, ELK stack or Datadog.
Experience with cloud platforms such as AWS, Azure, or GCP, including infrastructure as code tools like Terraform or CloudFormation.
Strong scripting and automation skills, with proficiency in languages such as Python, Bash, or Go.
Excellent communication and collaboration skills, with the ability to work effectively with cross-functional teams in a remote environment.
Demonstrated leadership capabilities, with a passion for mentoring and developing team members.

Job Responsibility

Take full ownership of the production estate from both a technical and process perspective.
Provide a consistent smooth operation of live systems and drive all on-call support issues.
Design and operate a new incident tracking process to ensure root causes are found and remediated in a timely fashion by the development team.
Create and maintain high end monitoring and automation tooling.
Drive automation initiatives to streamline operational workflows and improve efficiency.
Develop and maintain tools, scripts, and dashboards to monitor system health, performance, and reliability.
Build a first class SRE team.
Through a combination of leading by example, coaching and mentoring, mould the team would want to have around you.
Provide leadership and guidance to the SRE team, fostering a culture of collaboration, innovation, and continuous improvement.

What we offer

Dynamic working environment in an extremely fast-growing company
Work in an international environment
Work in a pleasant environment with very little hierarchy
Intellectually challenging, play a massive role in client’s success and scalability
Flexible working hours

Fulltime

Senior Site Reliability Engineer

Embark on a transformative journey as a Senior Site Reliability Engineer - AVP. ...

Location

United States , Whippany

Salary:

120000.00 - 175000.00 USD / Year

Barclays

Expiration Date

Until further notice

Requirements

Considerable programming expertise in languages such as Python, Java, and others
Practical experience with Infrastructure as Code (IaC) tools, including Ansible, Chef, and Terraform
Validated experience with observability and monitoring platforms such as Observe, Elastic, InfluxDB, and Grafana
Solid understanding of containerization technologies and Unix/Linux environments
Demonstrates a Site Reliability Engineering (SRE) mindset, with good analytical skills, ownership, and a forward-thinking approach to problem-solving

Job Responsibility

Build and maintain infrastructure platforms and products that support applications and data systems
Ensure the reliability, availability, and scalability of the systems, platforms, and technology
Development, delivery, and maintenance of high-quality infrastructure solutions
Monitoring of IT infrastructure and system performance to measure, identify, address, and resolve any potential issues, vulnerabilities, or outages
Development and implementation of automated tasks and processes to improve efficiency and reduce manual intervention
Implementation of a secure configuration and measures to protect infrastructure against cyber-attacks, vulnerabilities, and other security threats
Cross-functional collaboration with product managers, architects, and other engineers to define IT Infrastructure requirements
Stay informed of industry technology trends and innovations

What we offer

medical, dental and vision coverage
401(k)
life insurance
other paid leave for qualifying circumstances

Fulltime

Senior Site Reliability Engineer, Infrastructure Foundations

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to su...

Location

United States

Salary:

113082.00 - 175725.00 USD / Year

Wikimedia Foundation

Expiration Date

Until further notice

Requirements

6+ years of experience in an SRE/Operations/DevOps role as part of a team
Experience with shell and any scripting languages used in an SRE context (Python, Go, Bash, Ruby
we primarily use Python) and configuration management tools (Puppet, Ansible
we use Puppet)
Experience designing and managing infrastructure security for large fleets of diverse services
Experience with technical response during security incidents
Experience with package management on Linux systems (we use Debian)
Strong Linux system-level troubleshooting skills
History of automating tasks and processes, identifying process gaps, and finding automation opportunities
Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones

Job Responsibility

Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
Work closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure
Collaborating with a global, cross-functional team in an asynchronous communication environment
Mentoring peers in your areas of technical and operational strength
Ability and willingness to travel 1-2 times a year for in-person events and team meetings
Most importantly, share our values and work in accordance with them

Fulltime

Senior Site Reliability Engineer

AutoRABIT is the leader in DevSecOps for SaaS platforms such as Salesforce. Its ...

Location

India , Hyderabad

Salary:

Not provided

AutoRABIT

Expiration Date

Until further notice

Requirements

6+ years of experience in SRE, DevOps, or related roles
Solid hands-on experience with AWS services (EKS, ECS, EC2, RDS, S3, Redis, etc.)
Proficient in writing Terraform infrastructure scripts
Strong scripting skills in Python using Boto3
Deep understanding of monitoring/logging tools (ELK, CloudWatch, TrendMicro)
Experience building and managing CI/CD pipelines (CodeBuild, CodePipeline)
Knowledge of infrastructure security and incident response practices
Willing to work in rotational shifts and rotational week-offs
Bachelor’s in computers or any related field
AWS certifications is preferred

Job Responsibility

Provision and manage AWS infrastructure using Terraform
Write AWS Lambda functions (Python3 + Boto3) to automate operational tasks
Set up monitoring, logging, and alerting with ELK, TrendMicro, and AWS CloudWatch
Configure alerts for performance and security anomalies
Develop and maintain CI/CD pipelines using AWS CodeBuild and CodePipeline
Troubleshoot production issues and contribute to blameless postmortems
Contribute to system hardening and security compliance efforts
Responsibility to adhere to set internal controls

Fulltime

Senior Site Reliability Engineer

Senior Site Reliability Engineer (SRE). This role has been designed as ‘’Onsite’...

Location

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Bachelor’s or Master’s degree in Computer Science, Information Systems, or a related field
6–10+ years of experience in DevOps, Site Reliability Engineering, or cloud infrastructure roles
Strong hands-on experience with cloud platforms (AWS or GCP) including services like EC2/GCE, IAM, and object storage (S3/GCS)
Experience with containerization and orchestration technologies, especially Docker and Kubernetes
Experience building and managing CI/CD pipelines using tools such as Jenkins, GitHub Actions, or GitLab
Experience with monitoring and observability tools such as Prometheus, CloudWatch, or Stackdriver
Strong understanding of Linux systems administration and configuration management tools like Ansible
Experience managing distributed systems and streaming platforms such as Kafka, Cassandra, Elasticsearch, Spark, Flink, or Storm
Strong automation and scripting skills using Python, Go, Rust, or Shell scripting
Experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation

Job Responsibility

Ensure high availability, reliability, and performance of large-scale cloud infrastructure across AWS and GCP environments
Operate and support infrastructure components and distributed data platforms such as Kubernetes, Kafka, Flink, Storm, and Spark
Manage and maintain databases including Cassandra, Elasticsearch, Redis, Postgres, and ArangoDB
Monitor systems, troubleshoot issues, and resolve production incidents across microservices and distributed systems
Collaborate closely with software engineering teams to debug and resolve complex production problems
Participate in 24x7 on-call rotation supporting multi-cloud production environments
Monitor system metrics, application performance, and infrastructure health using observability tools
Own the incident management lifecycle, including detection, mitigation, Root Cause Analysis (RCA), and post-incident reviews
Develop and maintain runbooks, automation, and operational processes to improve reliability and efficiency
Perform capacity planning using system usage and performance data

What we offer

Health & Wellbeing
Personal & Professional Development
Unconditional Inclusion

Senior Site Reliability Engineer

We are seeking a highly skilled and passionate Senior Site Reliability Engineer ...

Location

Spain; Portugal; United Kingdom

Salary:

Not provided

Parser Limited

Expiration Date

Until further notice

Requirements

Deep SRE Expertise: Proven experience as a Senior Site Reliability Engineer or a similar role, with a strong understanding of SRE principles (error budgets, SLOs/SLIs, toil reduction)
Azure Cloud Proficiency: Extensive hands-on experience designing, deploying, and operating highly available and scalable applications on Microsoft Azure
Azure Kubernetes Service (AKS) Expertise: Mandatory extensive hands-on experience with AKS for container orchestration, including deployment, scaling, monitoring, and troubleshooting
Java Ecosystem Mastery: Expert-level proficiency with Java, including experience with modern frameworks (ideally Micronaut, Spring Boot, or similar) and JVM performance tuning
Distributed Systems Knowledge: Solid understanding and practical experience with distributed systems, microservices architecture, and associated challenges (e.g., consistency, fault tolerance)
Messaging & Database Expertise: Hands-on experience with an event streaming platform (ideally Kafka) and NoSQL data storage (ideally Couchbase), including operational best practices
Automation First Mindset: Strong scripting skills (e.g., Python, Bash) and experience with Infrastructure as Code tools (e.g., Terraform, ARM templates) and CI/CD pipelines (e.g., Azure DevOps, Jenkins)
Observability Tools: Experience with monitoring, logging, and alerting tools (e.g., Azure Monitor, Prometheus, Grafana, ELK Stack, Splunk)
Problem-Solving Acumen: Exceptional analytical and troubleshooting skills, with a methodical approach to diagnosing and resolving complex production issues
Communication & Collaboration: Excellent communication skills, with the ability to articulate complex technical concepts to diverse audiences and collaborate effectively with cross-functional teams

Job Responsibility

Architect and Implement Reliability: Design, build, and maintain highly scalable, resilient, and performant systems on Azure, focusing on our Java, Kafka, and Couchbase stack
Drive Modernisation: Work hands-on as part of the team spearheading the adoption of Micronaut, standardising application templates, and transitioning to managed cloud services
Enhance Operational Excellence: Develop and implement strategies for improving system observability (standardised logging, metrics, tracing), alerting, and on-call practices
Automate Everything: Champion automation across the software development lifecycle (SDLC), from CI/CD pipelines to infrastructure provisioning, focusing on accelerating delivery and de-risking deployments
Incident Management & Learning: Contribute to our mature, blameless post-incident review process, identifying root causes and implementing preventative measures to reduce incident hours
Tooling & Standards: Develop, maintain, and drive the adoption of shared, standardised SRE tooling and best practices across engineering teams, including containerisation (e.g., Docker, Kubernetes on Azure), infrastructure as code (e.g., Terraform), and configuration management
Mentorship & Collaboration: Provide technical leadership and mentorship to junior engineers, fostering a culture of SRE principles and operational excellence across the wider engineering organisation
Strategic Input: Contribute to the overall technical strategy and roadmap for our SRE and platform initiatives, ensuring alignment with business objectives

What we offer

The chance to join an organization with triple-digit growth that is changing the paradigm on how software products are built
The opportunity to form part of an amazing, multicultural community of tech experts
A highly competitive compensation package
Medical insurance
English lessons

Fulltime

Senior Site Reliability Engineer

We are looking for a Senior Site Reliability Engineer (SRE) with deep experience...

Location

United States , Chicago

Salary:

131000.00 USD / Year

Realign

Expiration Date

Until further notice

Requirements

5+ years of experience in SRE, DevOps, or Cloud Engineering
Expertise in AWS core services (EC2, ECS/EKS, Lambda, S3, VPC, RDS, IAM, CloudFront, etc.)
Hands-on experience with Terraform, Ansible, or other IaC tools
Strong scripting/coding skills (Python, Go, Shell, etc.)
Experience with Kubernetes, containerization, and orchestration
Deep knowledge of Linux systems and networking
Experience with Service Meshes (e.g., Istio, App Mesh)
Familiarity with AWS Well-Architected Framework
Experience building self-healing systems and automated remediation
Background in security, compliance, or multi-account/multi-region AWS architectures

Job Responsibility

Design, implement, and maintain scalable, secure, and highly available infrastructure on AWS
Develop and improve CI/CD pipelines, Infrastructure as Code (IaC) using Terraform, Harness
Own and implement monitoring, alerting, logging, and distributed tracing with tools like Dynatrace/ Datadog
Troubleshoot production incidents, conduct blameless postmortems, and improve incident response processes
Optimize systems for cost, performance, and reliability
Drive chaos engineering and resilience testing
Collaborate with development teams to embed SRE practices like SLAs, SLOs, and error budgets
Mentor junior SREs and promote DevOps/SRE culture across the org

Fulltime

Select Country

Senior Site Reliability Engineer (SRE)

Job Description

Job Responsibility

Requirements

What we offer

Looking for more opportunities?

Senior Site Reliability Engineer (SRE)

Senior Site Reliability Engineer (SRE) – Cloud & Distributed Systems

Senior Site Reliability Engineer Manager

Senior Site Reliability Engineer

Senior Site Reliability Engineer, Infrastructure Foundations

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Our AI answers in your language