Service Reliability Engineer Job at Nichi-In Software Solutions Pvt. Ltd. (Bengaluru)

Principal Service Reliability Engineer

We are seeking a Principal Service Reliability Engineer (SRE) to lead the reliab...

Location

United States , Redmond

Salary:

142800.00 - 304200.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

8+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
Proven track record of defining and operationalizing SLOs, SLIs, and error budgets across multiple services or organizations
Experience leading reliability efforts for enterprise-scale or globally distributed systems
Advanced debugging and troubleshooting skills across application, platform, and infrastructure layers
Demonstrated ability to mentor senior engineers and influence engineering culture at scale
Experience driving platform-level improvements (e.g., standardized observability, shared reliability tooling, automated remediation frameworks)
Extensive experience operating large-scale, distributed production systems, including cloud-native platforms (Azure preferred)
Demonstrated ability to drive cross-team technical initiatives and influence architecture and engineering standards
Deep experience in observability, incident management, and production operations at scale
Strong understanding of Azure networking, distributed systems performance, and reliability engineering principles

Job Responsibility

Define and drive reliability strategy across services, including measurable targets for availability, latency, and performance aligned to business priorities
Establish and enforce SLO/SLI frameworks and error budgets across multiple teams, ensuring consistent adoption and accountability
Lead complex incident management and systemic RCA efforts, identifying cross-service failure patterns and driving durable, long-term fixes
Influence architecture and platform design to improve operability, scalability, fault isolation, and disaster recovery at enterprise scale
Drive reliability engineering standards for observability (metrics, logs, traces), capacity planning, and production readiness across the organization
Eliminate operational toil through automation, enabling self-healing systems and reducing manual intervention
Embed security, compliance, and resiliency practices into design and operational processes, ensuring alignment with enterprise requirements
Partner with engineering leadership to prioritize reliability investments and balance feature velocity with system stability
Lead and mentor engineers while shaping a strong reliability culture across teams and org boundaries

Fulltime

Field Service Reliability Engineer

Founded in 1985, ATS is a company with a presence in the United States, Mexico a...

Location

United States , Hammond

Salary:

Not provided

Advanced Technology Products

Expiration Date

Until further notice

Requirements

Bachelor’s degree in engineering (ABET accredited)
Eight or more years of reliability experience across 2 or more manufacturing sites
Demonstrates ability to perform full array of reliability tool sets
Strong technical understanding of electrical or mechanical components, tools, and designs
Ability to complete a failure mode effects analysis, cause and effect diagrams, root cause failure analysis, life-cycle costing, and risk analysis
Ability to research and apply new equipment technology / trends
Robust problem solving, mathematical, analytical, and decision making skills
Proficiency with computers, maintenance systems, and applications, including Microsoft Office
Excellent verbal communication, facilitation, and presentation skills
Strong reporting and technical writing capability

Job Responsibility

Extensive travel required. (Local, National, International)
Promotes and adheres to the ATS safety culture
Engages in various work environments and industries to lead reliability centered maintenance efforts
Mentors, coaches, and provides reliability best practices for applications in customer facilities, by customer personnel
Identifies top potential issues leading to lost production and preventable maintenance spending. Communicates findings with leadership
Provides solutions to root cause deficiencies and demonstrates economic benefits to their correction
Actively drives the implementation of equipment improvement projects
Identifies and implements current and new processes / technologies to increase equipment performance and uptime
Champions systems and best practice procedures towards a proactive manufacturing culture
Analyzes equipment performance, failure data, and corrective maintenance history to develop and deploy engineering solutions, improved maintenance strategies, preventative maintenance optimization, and other reliability techniques

Fulltime

Field Service Reliability Engineer

Founded in 1985, ATS is a company with a presence in the United States, Mexico a...

Location

United States , Hammond, Indiana

Salary:

Not provided

Advanced Technology Products

Expiration Date

Until further notice

Requirements

Bachelor’s degree in engineering (ABET accredited)
Eight or more years of reliability experience across 2 or more manufacturing sites
Demonstrates ability to perform full array of reliability tool sets
Strong technical understanding of electrical or mechanical components, tools, and designs
Ability to complete a failure mode effects analysis, cause and effect diagrams, root cause failure analysis, life-cycle costing, and risk analysis
Ability to research and apply new equipment technology / trends
Robust problem solving, mathematical, analytical, and decision making skills
Proficiency with computers, maintenance systems, and applications, including Microsoft Office
Excellent verbal communication, facilitation, and presentation skills
Strong reporting and technical writing capability

Job Responsibility

Extensive travel required. (Local, National, International)
Promotes and adheres to the ATS safety culture
Engages in various work environments and industries to lead reliability centered maintenance efforts
Mentors, coaches, and provides reliability best practices for applications in customer facilities, by customer personnel
Identifies top potential issues leading to lost production and preventable maintenance spending. Communicates findings with leadership
Provides solutions to root cause deficiencies and demonstrates economic benefits to their correction
Actively drives the implementation of equipment improvement projects
Identifies and implements current and new processes / technologies to increase equipment performance and uptime
Champions systems and best practice procedures towards a proactive manufacturing culture
Analyzes equipment performance, failure data, and corrective maintenance history to develop and deploy engineering solutions, improved maintenance strategies, preventative maintenance optimization, and other reliability techniques

Fulltime

Lead Service Reliability Engineer

As Service Reliability Engineer (SRE) in DAMO service line, you will take a mult...

Location

Singapore , Singapore

Salary:

Not provided

Thoughtworks

Expiration Date

Until further notice

Requirements

You can program with one or more high-level languages such as Python, Golang, Shell scripting, Ruby or Java
You are familiar with DevOps and GitOps practices, driving the integration of observability automation into CI/CD pipelines, e.g.: GitLab, Jenkins, CircleCI or equivalent
You have in-depth knowledge of configuration management and Infrastructure as Code (IAC) tools such as Terraform, Ansible, ARM and CloudFormation for provisioning and managing infrastructure
You have an expertise in observability, logs, tracing and monitoring tools such as Grafana (Loki and Tempo), Prometheus, Graylog, Jaeger, Zipkin, ELK stack or equivalent
You have a strong understanding of container-based architecture and hands-on experience with orchestration tools such as Kubernetes, AWS EKS, Docker Swarm, Nomad, etc
You have in-depth experience in application and infrastructure performance tuning and scaling to handle heavy loads under different scenarios e.g.: Periodic traffic load and tsunami patterns
You have a good understanding of essential concepts such as quality gates encompassing SLI/SLO/SLA, chaos engineering, golden signals, blameless postmortem methodologies, synthetic monitoring, distributed tracing, end-user monitoring and performance testing
You have experience with network load balancing, security tech stacks, Transport Layer Security (TLS) and certificate management, and an understanding of standard networking protocols and configurations
You have strong communication and articulation skills, and are proficient in English
You are able to convey resolutions to audiences with varying degrees of technical/business proficiency and bring them to consensus

Job Responsibility

You will be responsible for understanding requirements or SRE goals in depth from both tech and business perspectives
You will provide solutions to improve reliability, including identifying and implementing mechanisms and architectures that enable fault tolerance and faster median time to respond and median time to detect
You will be responsible for enhancing the incident management process, including the development of an incident prioritization matrix, triage, communication, mitigation, post-mortem analysis and implementation of corrective actions
You will manage client stakeholder expectations and queries during production incidents, providing detailed technical analysis of issues and remediation plans for mitigation and prevention in future, and act as the interface for C-level executives, if or when needed
You will be a liaison with client engineering teams, build trust and productive relationships with senior client stakeholders and team leads to influence them in making better decisions
You will be responsible for identifying opportunities for enhancing system performance and reliability in alignment with business SLAs, SLOs, KPIs and objectives, and provide guidance and assistance to SRE teams in implementing the identified improvements
As an SRE expert, you will collaborate with Thoughtworks application development leads and solution architects, recommending changes in system design and adopting best practices for improved reliability from day one
You will oversee and mentor other SREs on the team, contributing to their growth and development

What we offer

There is no one-size-fits-all career path
career is supported by interactive tools, numerous development programs and teammates who want to help you grow

Fulltime

Senior Service Reliability Engineer

As a Service Reliability Engineer (SRE) in DAMO service line, you will take a mu...

Location

Singapore , Singapore

Salary:

Not provided

Thoughtworks

Expiration Date

Until further notice

Requirements

You have expertise in Ansible orchestration including advanced strategies, failure logic handling, and Jinja2 templating
You have the ability to integrate Terraform with Ansible for seamless provisioning-to-configuration workflows
You have hands-on experience with Python, Go, Bash or PowerShell scripting
You have working knowledge of at least one public cloud (AWS/Azure/GCP)
You have experience with observability tools (Grafana, Datadog, NewRelic, ELK, Dynatrace, etc.) and can use data for RCA
You have familiarity with DevOps, SRE and GitOps concepts and practices
You have knowledge of container technologies and orchestration (Kubernetes, EKS, Docker Swarm, Nomad, etc.)
You have understanding of modern architecture (microservices, serverless, NoSQL, REST APIs) and experience debugging and building metrics/dashboards
You have experience designing infrastructure aligned with Cloud Well-Architected principles (reliability, security, cost, performance, operations)
You are able to mentor team members through workshops and knowledge enablement

Job Responsibility

You will conduct SRE and Disaster Recovery (DR) maturity assessments
You will engineer automation solutions using Ansible to replace manual workflows
You will own and manage the current manual Disaster Recovery process/pipeline
You will improve site reliability through mechanisms and architectures that enhance fault tolerance and reduce MTTR/MTTD
You will drive the integration of observability automation into the CI/CD pipeline
You will handle production incidents, lead client communication, and create root cause analysis documentation
You will monitor performance of production systems and improve scaling to meet SLA and SLO targets
You will work closely with application development teams to advise and implement reliability improvements
You will improve system observability across logging, metrics and alerting, reducing false alarms to eliminate unnecessary toil and improving overall process efficiency, while implementing chaos engineering practices to regularly validate system reliability
You have a clear understanding of client goals and business needs, setting direction for site reliability in alignment with business expectations - including high availability targets such as 99.999% with minimal/no disruption where required.

What we offer

Learning & Development: There is no one-size-fits-all career path at Thoughtworks: however you want to develop your career is entirely up to you. But we also balance autonomy with the strength of our cultivation culture. This means your career is supported by interactive tools, numerous development programs and teammates who want to help you grow. We see value in helping each other be our best and that extends to empowering our employees in their career journeys.

Fulltime

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...

Location

Ireland , Dublin

Salary:

Not provided

General Motors

Expiration Date

Until further notice

Requirements

8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
Proficiency in Python, Go, or Java, with strong code review and readability standards
Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
Ability to think and act under pressure
Strong communication skills

Job Responsibility

Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling

Fulltime

Cloud Engineer / Site Reliability Engineer (SRE)

Location

United States , Orlando

Salary:

75.00 USD / Hour

Beacon Hill

Expiration Date

Until further notice

Requirements

Strong hands-on AWS experience with solid understanding of core AWS services
Experience supporting and troubleshooting AWS and Azure cloud environments
Terraform experience for Infrastructure as Code
Docker/containerization experience
Strong troubleshooting and problem-solving skills
Ability to translate requirements into technical execution
Experience performing cloud architecture and diagramming
Experience supporting deployments, environments, and site standups
Strong communication and collaboration skills

Job Responsibility

Support cloud infrastructure and deployments across AWS and Azure
Troubleshoot infrastructure and application-related cloud issues
Build and maintain Terraform-based infrastructure
Support Docker/containerized environments
Create architecture diagrams and technical documentation
Work closely with engineering and project teams to execute cloud initiatives
Assist with automation and operational improvement efforts

Fulltime

Site Reliability Engineer Platform Engineer

Join a mission-driven, national financial services organization at the heart of ...

Location

United States , Reston

Salary:

Not provided

Tier4 Group

Expiration Date

Until further notice

Requirements

5+ years hands-on operating and managing Kubernetes and OpenShift clusters
Strong experience with Microsoft Azure (compute, networking, storage, and data services)
Proven skills in automation and Infrastructure-as-Code (Terraform, Ansible, GitOps)
Proficiency with observability tooling (Datadog, Prometheus, Grafana)
Scripting/coding ability in Bash, Python, or Go

Job Responsibility

Operate, tune, and optimize OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies)
Stand up and/or refine observability (Datadog, Prometheus, Grafana)—dashboards, alerts, SLOs, runbooks
Map current hybrid topology and critical delivery pipelines
identify toil and prioritize automation (Terraform/Ansible)
Begin supporting Azure environments (compute, networking, storage, data services) used by analytics teams
Drive GitOps-first workflows
harden CI/CD with ArgoCD/Jenkins/GitHub Actions and policy-as-code guardrails
Implement or enhance platform services (Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams
Lead incident response and postmortems
institutionalize RCA, blameless learning, and continuous improvement

Fulltime

Select Country

Service Reliability Engineer

Job Responsibility

Requirements

Looking for more opportunities?