CrawlJobs Logo

Site Reliability Engineer

India, Pune · Job Posted April 16, 2026
Apply Position
Job Link Share

Job Description

The Site Reliability Engineer (SRE) is a strategic professional accountable for the daily operations, architectural resilience, and overall implementation of SRE principles in a complex, critical, and largescale multi-disciplinary environment. This role requires a comprehensive understanding of multiple technology domains and their interaction to achieve business objectives. As a recognized technical authority, you will apply an in depth understanding of the business impact of technical contributions and provide advice and counsel on strategic solutions. We are seeking a passionate and experienced SRE to join our Production Management team. In this role, you will be instrumental in enhancing the reliability, performance, and efficiency of our Applications and Services. You will drive our strategy for end-to-end observability and resiliency, collaborating across the organization to ensure our services are stable, scalable, and fault tolerant. This is a key role that will influence strategic decisions and foster a culture of technical excellence and accountability.

Job Responsibility

  • Foster a culture of transparency, innovation, and accountability that encourages continuous improvement
  • Communicate the progress and impact of SRE initiatives to stakeholders at all levels
  • Operate effectively within a highly regulated environment, ensuring compliance with all relevant requirements
  • Ensure critical business applications meet stringent operational resilience requirements, including adherence to defined impact tolerances
  • Oversee advanced recovery testing, including Production Swing Tests, Data Recovery Tests, and chaos engineering practices
  • Drive the adoption and development of automation, such as One Touch Recovery solutions, to minimize recovery time
  • Partner with development teams to leverage cloud native services and established resiliency patterns to enhance application reliability
  • Collaborate across the organization to develop and scale observability solutions using modern tools for metrics, logging, and tracing
  • Partner with development teams to effectively instrument applications, providing deep insights into system health and performance

Requirements

  • 10+ Years of Experience is a must have
  • Significant professional experience in production management, software development, or an equivalent field, with a strong focus on Site Reliability Engineering
  • Expertise in analyzing complex application, database, network, and OS issues within large scale, customer facing systems
  • A service-oriented attitude combined with excellent problem-solving and strategic thinking skills
  • Strong communication and diplomacy skills, with a proven ability to work effectively across multiple business and technical teams
  • Deep understanding of SRE concepts, including SLOs, SLIs, error budgets, and toil reduction
  • Demonstrable experience with Disaster Recovery planning, resiliency testing, and fault tolerant distributed system design
  • Proficiency in deploying, managing, and troubleshooting applications on OpenShift/Kubernetes
  • Hands on experience with modern observability tools (e.g., Prometheus, Grafana, Loki, Mimir, Tempo, AppDynamics)
  • Experience with Infrastructure as Code (IaC), configuration management, and automation tools (e.g., Ansible, Terraform)
  • Experience creating, modifying, and managing Helm charts for application deployment
  • Bachelor’s/University degree, Master’s degree preferred

Nice to have

  • Experience with major public cloud providers (e.g., Google Cloud, AWS, Azure)
  • Proven experience delivering software and infrastructure using Agile frameworks
  • Experience presenting technical strategy to senior and executive level audiences
  • Experience writing or maintaining code in Java, Python, Go, or similar languages

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineer

8 matching positions

New

Site Reliability Engineer

As a Site Reliability Engineer, you are passionate about experience innovation a...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
valtech.com Logo
Valtech
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field
  • 2+ years in DevOps, SRE, or Support Engineering roles
  • Experience with incident management in high-traffic, public-facing platforms
  • Strong scripting skills (Python, Bash, or PowerShell)
  • Familiarity with CI/CD tools: GitHub Actions, Azure DevOps, GitLab, Jenkins
  • Experience with monitoring/APM tools: Datadog, New Relic, Dynatrace, Prometheus, Grafana
  • Basic knowledge of serverless services in AWS, Azure, or GCP
  • Proficiency with Docker and containerized environments
  • Excellent English communication skills (B2+ level)
  • Experience working in international, cross-cultural teams
Job Responsibility
Job Responsibility
  • Maintain and improve observability systems (monitoring, logging, alerting)
  • Define, adjust, and maintain Service Level Objectives (SLOs)
  • Participate in incident resolution and on-call rotations (max 1 week/month)
  • Drive proactive reliability improvements across platforms
  • Collaborate with teams to analyze failure scenarios and implement mitigations
  • Create and maintain runbooks for incident response and prevention
  • Eliminate non-value-adding tasks through automation and process optimization
What we offer
What we offer
  • Flexibility, with hybrid work options (country-dependent)
  • Learning and development, with access to cutting-edge tools, training and industry experts
  • Fulltime
Read More
Arrow Right
New

Site Reliability Engineer

NetApp is looking for a Senior TechOps Engineer - Cassandra to join our growing ...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
netapp.com Logo
NetApp
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience in Apache Cassandra administration and architecture, with a desire to continuously learn and develop to an expert level
  • Experience in diagnosing and recommending mitigation strategies for Cassandra-related issues, including performance degradation due to resource bottlenecks, suboptimal data modeling leading to hot partitions, excessive tombstones, and inefficiencies caused by range slices and poorly constructed queries
  • Hands-on experience with Cassandra architecture and core administrative tasks, including compactions, repairs, backup and recovery, schema disagreement resolution, and configuration management
  • Experience handling Cassandra maintenance activities, including upgrades and migrations
  • Ability to investigate and research Cassandra issues by reviewing the Apache Cassandra codebase
  • Strong knowledge and experience with Linux, with the ability to work comfortably from the command line
  • Exceptional ability to communicate clearly and professionally in written and verbal English
  • Experience working with at least one public cloud platform, preferably AWS
  • Prior IT customer service or support experience within an ITIL-based environment
  • Strong fundamental computer science and software engineering skills, particularly in operating system internals, memory management, and networking
Job Responsibility
Job Responsibility
  • Your work will ensure the security, reliability, and performance of world-class systems and databases
  • You will collaborate with the technical teams of our customers, who are globally recognized companies in the gaming, banking, and logistics industries, ranging from large multinationals to emerging start-ups
What we offer
What we offer
  • Volunteer time off
  • Well-being
  • Time away
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

As Site Reliability Engineer you will contribute to the overarching implementati...
Location
Location
Romania , Bucuresti
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or related field
  • Minimum 5 years proven work experience as a Reliability Engineer or similar role
  • Expert knowledge and hands-on experience with applications hosted on cloud platforms such as Google Cloud Platform as well as with Docker / Kubernetes in combination with Google Kubernetes Engine (GKE), Terraform or similar technology
  • Experience in resilient software development in Python/JAVA and the usage of modern CI/CD pipelines e.g. Github, Github Actions, Bitbucket, Helm
  • Strong experience in the setup of observability, monitoring and self-healing solutions for instance with New Relic, Splunk, Google Cloud Operations, Lightstep and Ansible
  • Very good knowledge of security standards (e.g.: TLS, OAuth2, KMS, Vault, Admission Controllers, let's encrypt), microservice architectures and experience with API Management with Apigee or WSO2
  • Proactive attitude and collaborative Team player mindset paired with self confidence
  • Not losing your coolness and keep your eye for details even in stressful situations where time matters
  • Having a creative approach towards solving technical problems
  • Excellent communication skills in English
Job Responsibility
Job Responsibility
  • Define Service Level Objectives (SLOs), and enable an end-to-end view on customer satisfaction based on best practices for setting up Service Level Indicators (SLIs) to create effective strategies for maintaining and improving system performance and availability
  • Collaborate with Business Functional Analysts and Solution Architects to find improvements in the solution design to improve the resilience of technical solutions early on
  • Consult and guide the squad on the prioritization of reliability improvement and actively deliver them as part of the sprint
  • Hands-on experience in implementing reliability and resilience patterns like auto-scaling, circuit breakers, bulk-heads, rate limiter, retry mechanisms, etc.
  • Actively work on service request fulfilment, incident and problem mgmt. to identify and reduce toil and the MTTR with engineering best practices
  • Align and contribute on state-of-the-art SRE best practices e.g. Distributed Tracing, Open Telemetry and Chaos Engineering with the SRE chapter function
  • Be a knowledge- and skill multiplicator of your profession by being a Lead of the Site Reliability engineer population
  • Increase the seniority of the overall Site Reliability Engineer chapter by establishing events and procedures, and foster a culture of high standards
  • Lead people of your engineer profession and make them become better each day
What we offer
What we offer
  • Smooth integration and a supportive mentor
  • Pick your working style: choose from Remote, Hybrid or Office work opportunities
  • Our projects have different working hours to suit your needs
  • Sponsored certifications, trainings and top e-learning platforms
  • Private Health Insurance – custom-made for you
  • Individual coaching sessions or accredited Coaching School
  • Epic parties or themed events – lovingly designed for our people and their families
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

Build the tools and systems that make M365 sovereign cloud operations faster, sm...
Location
Location
United States , Multiple Locations
Salary
Salary:
102100.00 - 219200.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Passionate about distributed systems and working with highly scalable services
  • Enjoys new technological challenges and is motivated to solve them
  • Excited about making better software and continuously improving the development, integration, and deployment processes
  • Self-starter who thrives in a bottoms-up, fast-paced, highly technical environment
  • Effective collaborator, experienced in creating technical partnerships across teams
  • Committed to ensuring exceptional customer satisfaction through technical excellence
  • Candidates must be able to meet Microsoft, customer and/or government security screening requirements are required for this role
  • The successful candidate must have an active U.S. Government Top Secret Clearance with access to Sensitive Compartmented Information (SCI) based on a Single Scope Background Investigation (SSBI)
Job Responsibility
Job Responsibility
  • Creates and implements code for a product, service, or feature, reusing code as applicable with minimal supervision
  • Acts as a designated responsible individual (DRI), working on-call to monitor a system/product feature/service for degradation, downtime, or interruptions
  • Maintains operations of live site service, following security best practices when responding quickly to mitigate issues while using the minimum required permissions to do so that arise on a rotational, on-call basis
  • Contributes to identifying dependencies, and incorporates them into the development of design documents for a product area with little oversight
  • Contributes to the identification of requirements for, and development of automation within production and deployment of a complex product feature, targeting zero-touch deployment when possible
  • Works with appropriate internal stakeholders to understand and determine customer/user requirements for a set of features
  • Remains current in skills by investing time and effort into being informed of current developments that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale
What we offer
What we offer
  • Certain roles may be eligible for benefits and other compensation
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

We are looking for a Site Reliability Engineer to support the stability, perform...
Location
Location
United States , New York
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related discipline, or equivalent practical experience in infrastructure or operations
  • Working knowledge of Linux and/or Windows server administration fundamentals
  • Understanding of core networking principles such as TCP/IP, DNS, VLANs, routing, and firewall concepts
  • Experience with at least one scripting or automation language such as Python, Bash, or PowerShell
  • Familiarity with cloud infrastructure concepts in at least one major platform, such as Azure or AWS
  • Exposure to automation and configuration tools such as Terraform or Ansible
  • Strong analytical thinking, troubleshooting ability, and a willingness to learn in a fast-moving technical environment
  • Clear written and verbal communication skills with the ability to document operational procedures effectively
Job Responsibility
Job Responsibility
  • Oversee the health of production platforms through monitoring tools, assist with incident response, and help refine alerts, dashboards, and issue tracking processes
  • Support day-to-day operations for infrastructure spanning on-premises facilities and cloud environments, including servers, storage, network components, and middleware services
  • Contribute to the administration of multi-cloud resources across platforms such as Azure and Amazon EC2, with involvement in compute, networking, storage, and identity-related tasks
  • Build and enhance automation solutions using Infrastructure as Code practices to streamline repeatable work and improve platform consistency
  • Participate in DevSecOps and GitOps processes by assisting with CI/CD workflows, configuration management, and policy adherence
  • Help strengthen cloud security by identifying configuration gaps, assisting with remediation efforts, and supporting vulnerability reduction initiatives
  • Join the on-call rotation, respond to operational events, and contribute to post-incident reviews focused on continuous improvement
  • Create and maintain runbooks, technical procedures, and system documentation to improve operational readiness and knowledge sharing
  • Assist with containerized and orchestrated environments, including platforms that use Kubernetes, to support scalable application operations
What we offer
What we offer
  • medical
  • vision
  • dental
  • life and disability insurance
  • company 401(k) plan
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

We are looking for a Site Reliability Engineer (SRE) to support reliable, high-p...
Location
Location
United States , Novi
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Information Technology, Computer Science, Computer Engineering, or comparable practical experience
  • At least 5 years of experience supporting production environments in a corporate, startup, or similarly fast-paced technical setting
  • Hands-on expertise with infrastructure as code, including Terraform, along with experience in cloud platforms and related services
  • Working knowledge of container technologies such as Docker and orchestration platforms like Kubernetes
  • Experience supporting live systems, participating in on-call rotations, and contributing to incident reviews and corrective actions
  • Proficiency with automation and scripting using Bash and Python to reduce manual operational effort
  • Strong communication skills with the ability to explain technical decisions and tradeoffs to cross-functional or non-technical stakeholders
  • Willingness and ability to travel to customer or plant locations as business needs require
Job Responsibility
Job Responsibility
  • Maintain dependable and secure production environments across plant-edge and cloud-based systems, with a focus on uptime, responsiveness, and operational stability
  • Design, refine, and support monitoring dashboards, alerting frameworks, and operational runbooks using tools such as Prometheus, Grafana, and modern telemetry solutions
  • Build and manage infrastructure through code using Terraform, applying version control standards, peer reviews, and controlled deployment processes
  • Create automation scripts and lightweight tools in Bash and Python to streamline routine operations, recovery procedures, backup workflows, and environment setup
  • Take part in incident response and on-call coverage, troubleshoot service disruptions, coordinate initial communication, and document follow-up actions through blameless reviews
  • Establish and measure service reliability indicators and objectives, helping stakeholders balance system dependability with release speed and operational risk
  • Support secure connectivity between factory networks and cloud resources by configuring and maintaining VPNs, routing, private networking, and access controls
  • Administer and optimize relational or time-series databases, including backup planning, replication, performance tuning, and long-term storage health
  • Contribute to CI/CD delivery practices by improving deployment pipelines, supporting controlled release strategies, and preparing rollback procedures when needed
  • Partner with controls, software, and data teams to enable reliable data flow from industrial systems and ensure safe deployment to edge infrastructure
What we offer
What we offer
  • medical, vision, dental, and life and disability insurance
  • 401(k) plan
Read More
Arrow Right

Site Reliability Engineer

Barclays is seeking a Site Reliability Engineer to join its Securitized Products...
Location
Location
United States , Whippany
Salary
Salary:
170000.00 - 230000.00 USD / Year
barclays.co.uk Logo
Barclays
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Programming or scripting experience (Python, Go, PowerShell, Bash, or similar) and SQL
  • Linux/Unix/Windows systems and systems engineering fundamentals
  • Client-server model architecture and scalability knowledge monitoring high traffic by distributing load across multiple backend servers
  • Performance monitoring and reducing latency in request-response cycles
  • Containers and orchestration (Docker, Kubernetes)
  • Networking (TCP/IP, DNS, HTTP, SFTP) and relational databases
  • and monitoring and observability tools (Geneos ITRS, Prometheus, Grafana, APM, Observe)
Job Responsibility
Job Responsibility
  • Development and delivery of high-quality software solutions by using industry aligned programming languages, frameworks, and tools
  • Cross-functional collaboration with product managers, designers, and other engineers to define software requirements, devise solution strategies, and ensure seamless integration and alignment with business objectives
  • Collaboration with peers, participate in code reviews, and promote a culture of code quality and knowledge sharing
  • Stay informed of industry technology trends and innovations and actively contribute to the organization’s technology communities to foster a culture of technical excellence and growth
  • Adherence to secure coding practices to mitigate vulnerabilities, protect sensitive data, and ensure secure software solutions
  • Implementation of effective unit testing practices to ensure proper code design, readability, and reliability
What we offer
What we offer
  • medical, dental and vision coverage
  • 401(k)
  • life insurance
  • other paid leave
  • incentive award
  • Competitive holiday allowance
  • Life assurance
  • Private medical care
  • Pension contribution
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

Take full ownership of smartclip’s internal utility and platform tooling. Focus ...
Location
Location
Germany , Berlin
Salary
Salary:
Not provided
smartclip.tv Logo
Smartclip Europe GmbH
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Apply an Observability Mindset: Implement a clear strategy for metrics, logs, and traces. Transform 'noisy alerts' into 'actionable insights.'
  • Embrace Ownership: Live the 'you build it, you run it' philosophy. Stop the ticket ping-pong and end the excuses.
Job Responsibility
Job Responsibility
  • Take full ownership of smartclip’s internal utility and platform tooling.
  • Focus your energy on the intersection of observability, automation, and developer infrastructure.
  • Evolve existing systems, research cutting-edge open-source alternatives, and implement them.
  • Invest in deep in-house expertise.
  • Understand our systems end-to-end, maintain total flexibility, and contribute back to the open-source ecosystem.
  • Build & Evolve: Operate and advance our observability stack (including Prometheus, Grafana, and Forgejo).
  • Go Open Source First: Replace 'buy' decisions with robust 'build & maintain' strategies.
  • Engineer the Platform: Design observability as a platform capability. Define SLOs and create actionable alerting to stop incidents before they start.
  • Secure the Stack: Embed security engineering into the delivery process. Find vulnerabilities before the pen tests do.
  • Master the Infrastructure: Navigate Linux systems and distributed tooling. Balance bold exploration with production stability.
What we offer
What we offer
  • 30 days of vacation + Dec 24 & 31 off
  • Smart Fridays (4 days week possible)
  • mobility (Germany ticket & JobRad)
  • sports & health offerings
  • mental health support
  • corporate benefits
  • RTL+ access
  • Fulltime
Read More
Arrow Right