CrawlJobs Logo

Service Reliability Engineer

India, Bengaluru · Job Posted June 29, 2026
Apply Position
Job Link Share

Job Responsibility

  • Deploying and managing highly available hybrid systems on On premise and cloud platforms like AWS, focusing on Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service (PaaS) offerings
  • Deploy and manage containerized applications using Docker and orchestrate them with docker compose or Kubernetes for scalability and resilience
  • Manage installations, configurations, and upgrades
  • troubleshoot outages and incidents
  • Implement Infrastructure as Code practices using tools like Terraform to automate cloud infrastructure provisioning and management
  • Improve operational efficiency by automating routine application tasks using python and shell scripting
  • Design, implement, and maintain CI/CD pipelines to streamline application deployment processes, ensuring high-quality software delivery
  • Modernize existing infrastructure and applications by integrating new technologies and cloud-native solutions
  • Actively participate in scaling, performance tuning and capacity planning of Enterprise Stack, including Single Sign On and SSL keystore management
  • Conduct application server hardening to enhance security against potential threats
  • Create and maintain comprehensive documentation for system configurations, procedures, and best practices to ensure knowledge transfer and compliance
  • Ensure robust monitoring processes are in place and compliance with production security standards

Requirements

  • Bachelor's degree and eight years of relevant experience or a combination of education and relevant experience
  • Hands on experience with Kubernetes resources – deployments, services, ingress, storage volumes, configmaps with EKS/Karpenter
  • Proficiency using AWS Services – Autoscaling/EC2, AMI, Security Groups, ALB, S3, VPC
  • Experience with diverse middleware technologies(tomcat, weblogic, apache etc) on bare metal and docker containers
  • Experience with Infrastructure as a code like terraform and container orchestration utilities
  • Demonstrate Cloud Infrastructure experience with experience in building full-stack infrastructure for enterprise ready applications
  • Experience with install configure and support Oracle Database
  • Experience with version control systems (Git, SVN) and CI/CD tools
  • Proficiency in programming and scripting languages, especially Python and Shell
  • Strong working knowledge of Linux-based systems

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Service Reliability Engineer

8 matching positions

Principal Service Reliability Engineer

We are seeking a Principal Service Reliability Engineer (SRE) to lead the reliab...
Location
Location
United States , Redmond
Salary
Salary:
142800.00 - 304200.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • Proven track record of defining and operationalizing SLOs, SLIs, and error budgets across multiple services or organizations
  • Experience leading reliability efforts for enterprise-scale or globally distributed systems
  • Advanced debugging and troubleshooting skills across application, platform, and infrastructure layers
  • Demonstrated ability to mentor senior engineers and influence engineering culture at scale
  • Experience driving platform-level improvements (e.g., standardized observability, shared reliability tooling, automated remediation frameworks)
  • Extensive experience operating large-scale, distributed production systems, including cloud-native platforms (Azure preferred)
  • Demonstrated ability to drive cross-team technical initiatives and influence architecture and engineering standards
  • Deep experience in observability, incident management, and production operations at scale
  • Strong understanding of Azure networking, distributed systems performance, and reliability engineering principles
Job Responsibility
Job Responsibility
  • Define and drive reliability strategy across services, including measurable targets for availability, latency, and performance aligned to business priorities
  • Establish and enforce SLO/SLI frameworks and error budgets across multiple teams, ensuring consistent adoption and accountability
  • Lead complex incident management and systemic RCA efforts, identifying cross-service failure patterns and driving durable, long-term fixes
  • Influence architecture and platform design to improve operability, scalability, fault isolation, and disaster recovery at enterprise scale
  • Drive reliability engineering standards for observability (metrics, logs, traces), capacity planning, and production readiness across the organization
  • Eliminate operational toil through automation, enabling self-healing systems and reducing manual intervention
  • Embed security, compliance, and resiliency practices into design and operational processes, ensuring alignment with enterprise requirements
  • Partner with engineering leadership to prioritize reliability investments and balance feature velocity with system stability
  • Lead and mentor engineers while shaping a strong reliability culture across teams and org boundaries
  • Fulltime
Read More
Arrow Right

Field Service Reliability Engineer

Founded in 1985, ATS is a company with a presence in the United States, Mexico a...
Location
Location
United States , Hammond
Salary
Salary:
Not provided
atpchemical.com Logo
Advanced Technology Products
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering (ABET accredited)
  • Eight or more years of reliability experience across 2 or more manufacturing sites
  • Demonstrates ability to perform full array of reliability tool sets
  • Strong technical understanding of electrical or mechanical components, tools, and designs
  • Ability to complete a failure mode effects analysis, cause and effect diagrams, root cause failure analysis, life-cycle costing, and risk analysis
  • Ability to research and apply new equipment technology / trends
  • Robust problem solving, mathematical, analytical, and decision making skills
  • Proficiency with computers, maintenance systems, and applications, including Microsoft Office
  • Excellent verbal communication, facilitation, and presentation skills
  • Strong reporting and technical writing capability
Job Responsibility
Job Responsibility
  • Extensive travel required. (Local, National, International)
  • Promotes and adheres to the ATS safety culture
  • Engages in various work environments and industries to lead reliability centered maintenance efforts
  • Mentors, coaches, and provides reliability best practices for applications in customer facilities, by customer personnel
  • Identifies top potential issues leading to lost production and preventable maintenance spending. Communicates findings with leadership
  • Provides solutions to root cause deficiencies and demonstrates economic benefits to their correction
  • Actively drives the implementation of equipment improvement projects
  • Identifies and implements current and new processes / technologies to increase equipment performance and uptime
  • Champions systems and best practice procedures towards a proactive manufacturing culture
  • Analyzes equipment performance, failure data, and corrective maintenance history to develop and deploy engineering solutions, improved maintenance strategies, preventative maintenance optimization, and other reliability techniques
  • Fulltime
Read More
Arrow Right

Field Service Reliability Engineer

Founded in 1985, ATS is a company with a presence in the United States, Mexico a...
Location
Location
United States , Hammond, Indiana
Salary
Salary:
Not provided
atpchemical.com Logo
Advanced Technology Products
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering (ABET accredited)
  • Eight or more years of reliability experience across 2 or more manufacturing sites
  • Demonstrates ability to perform full array of reliability tool sets
  • Strong technical understanding of electrical or mechanical components, tools, and designs
  • Ability to complete a failure mode effects analysis, cause and effect diagrams, root cause failure analysis, life-cycle costing, and risk analysis
  • Ability to research and apply new equipment technology / trends
  • Robust problem solving, mathematical, analytical, and decision making skills
  • Proficiency with computers, maintenance systems, and applications, including Microsoft Office
  • Excellent verbal communication, facilitation, and presentation skills
  • Strong reporting and technical writing capability
Job Responsibility
Job Responsibility
  • Extensive travel required. (Local, National, International)
  • Promotes and adheres to the ATS safety culture
  • Engages in various work environments and industries to lead reliability centered maintenance efforts
  • Mentors, coaches, and provides reliability best practices for applications in customer facilities, by customer personnel
  • Identifies top potential issues leading to lost production and preventable maintenance spending. Communicates findings with leadership
  • Provides solutions to root cause deficiencies and demonstrates economic benefits to their correction
  • Actively drives the implementation of equipment improvement projects
  • Identifies and implements current and new processes / technologies to increase equipment performance and uptime
  • Champions systems and best practice procedures towards a proactive manufacturing culture
  • Analyzes equipment performance, failure data, and corrective maintenance history to develop and deploy engineering solutions, improved maintenance strategies, preventative maintenance optimization, and other reliability techniques
  • Fulltime
Read More
Arrow Right

Lead Service Reliability Engineer

As Service Reliability Engineer (SRE) in DAMO service line, you will take a mult...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
thoughtworks.com Logo
Thoughtworks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You can program with one or more high-level languages such as Python, Golang, Shell scripting, Ruby or Java
  • You are familiar with DevOps and GitOps practices, driving the integration of observability automation into CI/CD pipelines, e.g.: GitLab, Jenkins, CircleCI or equivalent
  • You have in-depth knowledge of configuration management and Infrastructure as Code (IAC) tools such as Terraform, Ansible, ARM and CloudFormation for provisioning and managing infrastructure
  • You have an expertise in observability, logs, tracing and monitoring tools such as Grafana (Loki and Tempo), Prometheus, Graylog, Jaeger, Zipkin, ELK stack or equivalent
  • You have a strong understanding of container-based architecture and hands-on experience with orchestration tools such as Kubernetes, AWS EKS, Docker Swarm, Nomad, etc
  • You have in-depth experience in application and infrastructure performance tuning and scaling to handle heavy loads under different scenarios e.g.: Periodic traffic load and tsunami patterns
  • You have a good understanding of essential concepts such as quality gates encompassing SLI/SLO/SLA, chaos engineering, golden signals, blameless postmortem methodologies, synthetic monitoring, distributed tracing, end-user monitoring and performance testing
  • You have experience with network load balancing, security tech stacks, Transport Layer Security (TLS) and certificate management, and an understanding of standard networking protocols and configurations
  • You have strong communication and articulation skills, and are proficient in English
  • You are able to convey resolutions to audiences with varying degrees of technical/business proficiency and bring them to consensus
Job Responsibility
Job Responsibility
  • You will be responsible for understanding requirements or SRE goals in depth from both tech and business perspectives
  • You will provide solutions to improve reliability, including identifying and implementing mechanisms and architectures that enable fault tolerance and faster median time to respond and median time to detect
  • You will be responsible for enhancing the incident management process, including the development of an incident prioritization matrix, triage, communication, mitigation, post-mortem analysis and implementation of corrective actions
  • You will manage client stakeholder expectations and queries during production incidents, providing detailed technical analysis of issues and remediation plans for mitigation and prevention in future, and act as the interface for C-level executives, if or when needed
  • You will be a liaison with client engineering teams, build trust and productive relationships with senior client stakeholders and team leads to influence them in making better decisions
  • You will be responsible for identifying opportunities for enhancing system performance and reliability in alignment with business SLAs, SLOs, KPIs and objectives, and provide guidance and assistance to SRE teams in implementing the identified improvements
  • As an SRE expert, you will collaborate with Thoughtworks application development leads and solution architects, recommending changes in system design and adopting best practices for improved reliability from day one
  • You will oversee and mentor other SREs on the team, contributing to their growth and development
What we offer
What we offer
  • There is no one-size-fits-all career path
  • career is supported by interactive tools, numerous development programs and teammates who want to help you grow
  • Fulltime
Read More
Arrow Right

Senior Service Reliability Engineer

As a Service Reliability Engineer (SRE) in DAMO service line, you will take a mu...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
thoughtworks.com Logo
Thoughtworks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You have expertise in Ansible orchestration including advanced strategies, failure logic handling, and Jinja2 templating
  • You have the ability to integrate Terraform with Ansible for seamless provisioning-to-configuration workflows
  • You have hands-on experience with Python, Go, Bash or PowerShell scripting
  • You have working knowledge of at least one public cloud (AWS/Azure/GCP)
  • You have experience with observability tools (Grafana, Datadog, NewRelic, ELK, Dynatrace, etc.) and can use data for RCA
  • You have familiarity with DevOps, SRE and GitOps concepts and practices
  • You have knowledge of container technologies and orchestration (Kubernetes, EKS, Docker Swarm, Nomad, etc.)
  • You have understanding of modern architecture (microservices, serverless, NoSQL, REST APIs) and experience debugging and building metrics/dashboards
  • You have experience designing infrastructure aligned with Cloud Well-Architected principles (reliability, security, cost, performance, operations)
  • You are able to mentor team members through workshops and knowledge enablement
Job Responsibility
Job Responsibility
  • You will conduct SRE and Disaster Recovery (DR) maturity assessments
  • You will engineer automation solutions using Ansible to replace manual workflows
  • You will own and manage the current manual Disaster Recovery process/pipeline
  • You will improve site reliability through mechanisms and architectures that enhance fault tolerance and reduce MTTR/MTTD
  • You will drive the integration of observability automation into the CI/CD pipeline
  • You will handle production incidents, lead client communication, and create root cause analysis documentation
  • You will monitor performance of production systems and improve scaling to meet SLA and SLO targets
  • You will work closely with application development teams to advise and implement reliability improvements
  • You will improve system observability across logging, metrics and alerting, reducing false alarms to eliminate unnecessary toil and improving overall process efficiency, while implementing chaos engineering practices to regularly validate system reliability
  • You have a clear understanding of client goals and business needs, setting direction for site reliability in alignment with business expectations - including high availability targets such as 99.999% with minimal/no disruption where required.
What we offer
What we offer
  • Learning & Development: There is no one-size-fits-all career path at Thoughtworks: however you want to develop your career is entirely up to you. But we also balance autonomy with the strength of our cultivation culture. This means your career is supported by interactive tools, numerous development programs and teammates who want to help you grow. We see value in helping each other be our best and that extends to empowering our employees in their career journeys.
  • Fulltime
Read More
Arrow Right

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
  • Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
  • Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
  • Proficiency in Python, Go, or Java, with strong code review and readability standards
  • Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
  • Ability to think and act under pressure
  • Strong communication skills
Job Responsibility
Job Responsibility
  • Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
  • Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
  • Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
  • Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
  • Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
  • Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
  • Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
  • Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling
  • Fulltime
Read More
Arrow Right

Cloud Engineer / Site Reliability Engineer (SRE)

Location
Location
United States , Orlando
Salary
Salary:
75.00 USD / Hour
bhsg.com Logo
Beacon Hill
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong hands-on AWS experience with solid understanding of core AWS services
  • Experience supporting and troubleshooting AWS and Azure cloud environments
  • Terraform experience for Infrastructure as Code
  • Docker/containerization experience
  • Strong troubleshooting and problem-solving skills
  • Ability to translate requirements into technical execution
  • Experience performing cloud architecture and diagramming
  • Experience supporting deployments, environments, and site standups
  • Strong communication and collaboration skills
Job Responsibility
Job Responsibility
  • Support cloud infrastructure and deployments across AWS and Azure
  • Troubleshoot infrastructure and application-related cloud issues
  • Build and maintain Terraform-based infrastructure
  • Support Docker/containerized environments
  • Create architecture diagrams and technical documentation
  • Work closely with engineering and project teams to execute cloud initiatives
  • Assist with automation and operational improvement efforts
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer Platform Engineer

Join a mission-driven, national financial services organization at the heart of ...
Location
Location
United States , Reston
Salary
Salary:
Not provided
tier4group.com Logo
Tier4 Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years hands-on operating and managing Kubernetes and OpenShift clusters
  • Strong experience with Microsoft Azure (compute, networking, storage, and data services)
  • Proven skills in automation and Infrastructure-as-Code (Terraform, Ansible, GitOps)
  • Proficiency with observability tooling (Datadog, Prometheus, Grafana)
  • Scripting/coding ability in Bash, Python, or Go
Job Responsibility
Job Responsibility
  • Operate, tune, and optimize OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies)
  • Stand up and/or refine observability (Datadog, Prometheus, Grafana)—dashboards, alerts, SLOs, runbooks
  • Map current hybrid topology and critical delivery pipelines
  • identify toil and prioritize automation (Terraform/Ansible)
  • Begin supporting Azure environments (compute, networking, storage, data services) used by analytics teams
  • Drive GitOps-first workflows
  • harden CI/CD with ArgoCD/Jenkins/GitHub Actions and policy-as-code guardrails
  • Implement or enhance platform services (Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams
  • Lead incident response and postmortems
  • institutionalize RCA, blameless learning, and continuous improvement
  • Fulltime
Read More
Arrow Right