CrawlJobs Logo

Lead Service Reliability Engineer

thoughtworks.com Logo

Thoughtworks

Location Icon

Location:
Singapore , Singapore

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

As Service Reliability Engineer (SRE) in DAMO service line, you will take a multifaceted approach to ensure technical excellence and operational efficiency within the infrastructure domain. Specializing in reliability, resilience and system performance, you take a lead role in championing the principles of Site Reliability Engineering. By strategically integrating automation, monitoring and incident response, you facilitate the evolution from traditional operations to a more customer-focused and agile approach. Emphasizing shared responsibility and a commitment to continuous improvement, you cultivate a collaborative culture, enabling organizations to meet and exceed their reliability and business objectives.

Job Responsibility:

  • You will be responsible for understanding requirements or SRE goals in depth from both tech and business perspectives
  • You will provide solutions to improve reliability, including identifying and implementing mechanisms and architectures that enable fault tolerance and faster median time to respond and median time to detect
  • You will be responsible for enhancing the incident management process, including the development of an incident prioritization matrix, triage, communication, mitigation, post-mortem analysis and implementation of corrective actions
  • You will manage client stakeholder expectations and queries during production incidents, providing detailed technical analysis of issues and remediation plans for mitigation and prevention in future, and act as the interface for C-level executives, if or when needed
  • You will be a liaison with client engineering teams, build trust and productive relationships with senior client stakeholders and team leads to influence them in making better decisions
  • You will be responsible for identifying opportunities for enhancing system performance and reliability in alignment with business SLAs, SLOs, KPIs and objectives, and provide guidance and assistance to SRE teams in implementing the identified improvements
  • As an SRE expert, you will collaborate with Thoughtworks application development leads and solution architects, recommending changes in system design and adopting best practices for improved reliability from day one
  • You will oversee and mentor other SREs on the team, contributing to their growth and development

Requirements:

  • You can program with one or more high-level languages such as Python, Golang, Shell scripting, Ruby or Java
  • You are familiar with DevOps and GitOps practices, driving the integration of observability automation into CI/CD pipelines, e.g.: GitLab, Jenkins, CircleCI or equivalent
  • You have in-depth knowledge of configuration management and Infrastructure as Code (IAC) tools such as Terraform, Ansible, ARM and CloudFormation for provisioning and managing infrastructure
  • You have an expertise in observability, logs, tracing and monitoring tools such as Grafana (Loki and Tempo), Prometheus, Graylog, Jaeger, Zipkin, ELK stack or equivalent
  • You have a strong understanding of container-based architecture and hands-on experience with orchestration tools such as Kubernetes, AWS EKS, Docker Swarm, Nomad, etc
  • You have in-depth experience in application and infrastructure performance tuning and scaling to handle heavy loads under different scenarios e.g.: Periodic traffic load and tsunami patterns
  • You have a good understanding of essential concepts such as quality gates encompassing SLI/SLO/SLA, chaos engineering, golden signals, blameless postmortem methodologies, synthetic monitoring, distributed tracing, end-user monitoring and performance testing
  • You have experience with network load balancing, security tech stacks, Transport Layer Security (TLS) and certificate management, and an understanding of standard networking protocols and configurations
  • You have strong communication and articulation skills, and are proficient in English
  • You are able to convey resolutions to audiences with varying degrees of technical/business proficiency and bring them to consensus
  • You have excellent problem-solving and analytical skills, with a focus on continuous improvement
  • You have good listening and presentation skills
  • You solve challenging problems and difficult to debug issues with a never give up attitude
  • You can collaborate with cross-functional engineering teams to conduct capacity planning and scalability assessments, and design solutions for handling current and future growth
  • You have the ability to work under pressure, with composure, during production incidents
  • You understand requirements provided by the client on both technical and business aspects, and can break them down for successful implementation
  • You’re willing to be part of a rotation- and need-based, 24x7 available team
  • Candidates must be Singaporean citizens or already hold Singaporean Permanent Residency (PR) at the time of application
What we offer:
  • There is no one-size-fits-all career path
  • career is supported by interactive tools, numerous development programs and teammates who want to help you grow

Additional Information:

Job Posted:
January 12, 2026

Employment Type:
Fulltime
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Lead Service Reliability Engineer

Field Service Reliability Engineer

Founded in 1985, ATS is a company with a presence in the United States, Mexico a...
Location
Location
United States , Milwaukee, Wisconsin
Salary
Salary:
50.96 - 65.19 USD / Hour
atpchemical.com Logo
Advanced Technology Products
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering (ABET accredited)
  • Eight or more years of reliability experience across 2 or more manufacturing sites
  • Demonstrates ability to perform full array of reliability tool sets
  • Strong technical understanding of electrical or mechanical components, tools, and designs
  • Ability to complete a failure mode effects analysis, cause and effect diagrams, root cause failure analysis, life-cycle costing, and risk analysis
  • Ability to research and apply new equipment technology / trends
  • Robust problem solving, mathematical, analytical, and decision making skills
  • Proficiency with computers, maintenance systems, and applications, including Microsoft Office
  • Excellent verbal communication, facilitation, and presentation skills
  • Strong reporting and technical writing capability
Job Responsibility
Job Responsibility
  • Extensive travel required. (Local, National)
  • Promotes and adheres to the ATS safety culture
  • Engages in various work environments and industries to lead reliability centered maintenance efforts
  • Mentors, coaches, and provides reliability best practices for applications in customer facilities, by customer personnel
  • Identifies top potential issues leading to lost production and preventable maintenance spending. Communicates findings with leadership
  • Provides solutions to root cause deficiencies and demonstrates economic benefits to their correction
  • Actively drives the implementation of equipment improvement projects
  • Identifies and implements current and new processes / technologies to increase equipment performance and uptime
  • Champions systems and best practice procedures towards a proactive manufacturing culture
  • Analyzes equipment performance, failure data, and corrective maintenance history to develop and deploy engineering solutions, improved maintenance strategies, preventative maintenance optimization, and other reliability techniques
  • Fulltime
Read More
Arrow Right

Field Reliability Services Engineer

Field Reliability Services Engineer role requiring 95% travel. Promotes safety, ...
Location
Location
United States , Greenville
Salary
Salary:
Not provided
atpchemical.com Logo
Advanced Technology Products
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering (ABET accredited)
  • Eight or more years of reliability experience across 2 or more manufacturing sites
  • Demonstrates ability to perform full array of reliability tool sets
  • Strong technical understanding of electrical or mechanical components, tools, and designs
  • Ability to complete a failure mode effects analysis, cause and effect diagrams, root cause failure analysis, life-cycle costing, and risk analysis
  • Ability to research and apply new equipment technology / trends
  • Robust problem solving, mathematical, analytical, and decision making skills
  • Proficiency with computers, maintenance systems, and applications, including Microsoft Office
  • Excellent verbal communication, facilitation, and presentation skills
  • Strong reporting and technical writing capability
Job Responsibility
Job Responsibility
  • Extensive travel required. (Local, National, International)
  • Promotes and adheres to the ATS safety culture
  • Engages in various work environments and industries to lead reliability centered maintenance efforts
  • Mentors, coaches, and provides reliability best practices for applications in customer facilities, by customer personnel
  • Identifies top potential issues leading to lost production and preventable maintenance spending. Communicates findings with leadership
  • Provides solutions to root cause deficiencies and demonstrates economic benefits to their correction
  • Actively drives the implementation of equipment improvement projects
  • Identifies and implements current and new processes / technologies to increase equipment performance and uptime
  • Champions systems and best practice procedures towards a proactive manufacturing culture
  • Analyzes equipment performance, failure data, and corrective maintenance history to develop and deploy engineering solutions, improved maintenance strategies, preventative maintenance optimization, and other reliability techniques
  • Fulltime
Read More
Arrow Right

Field Service Reliability Engineer

Founded in 1985, ATS is a company with a presence in the United States, Mexico a...
Location
Location
United States , Chicago, Illinois
Salary
Salary:
50.96 - 65.19 USD / Hour
atpchemical.com Logo
Advanced Technology Products
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering (ABET accredited)
  • Eight or more years of reliability experience across 2 or more manufacturing sites
  • Demonstrates ability to perform full array of reliability tool sets
  • Strong technical understanding of electrical or mechanical components, tools, and designs
  • Ability to complete a failure mode effects analysis, cause and effect diagrams, root cause failure analysis, life-cycle costing, and risk analysis
  • Ability to research and apply new equipment technology / trends
  • Robust problem solving, mathematical, analytical, and decision making skills
  • Proficiency with computers, maintenance systems, and applications, including Microsoft Office
  • Excellent verbal communication, facilitation, and presentation skills
  • Strong reporting and technical writing capability
Job Responsibility
Job Responsibility
  • Promotes and adheres to the ATS safety culture
  • Engages in various work environments and industries to lead reliability centered maintenance efforts
  • Mentors, coaches, and provides reliability best practices for applications in customer facilities, by customer personnel
  • Identifies top potential issues leading to lost production and preventable maintenance spending. Communicates findings with leadership
  • Provides solutions to root cause deficiencies and demonstrates economic benefits to their correction
  • Actively drives the implementation of equipment improvement projects
  • Identifies and implements current and new processes / technologies to increase equipment performance and uptime
  • Champions systems and best practice procedures towards a proactive manufacturing culture
  • Analyzes equipment performance, failure data, and corrective maintenance history to develop and deploy engineering solutions, improved maintenance strategies, preventative maintenance optimization, and other reliability techniques
  • Provides technical service to operations and manufacturing personnel on equipment related troubleshooting efforts
  • Fulltime
Read More
Arrow Right

Field Service Reliability Engineer

Founded in 1985, ATS is a company with a presence in the United States, Mexico a...
Location
Location
United States , Hammond, Indiana
Salary
Salary:
Not provided
atpchemical.com Logo
Advanced Technology Products
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering (ABET accredited)
  • Eight or more years of reliability experience across 2 or more manufacturing sites
  • Demonstrates ability to perform full array of reliability tool sets
  • Strong technical understanding of electrical or mechanical components, tools, and designs
  • Ability to complete a failure mode effects analysis, cause and effect diagrams, root cause failure analysis, life-cycle costing, and risk analysis
  • Ability to research and apply new equipment technology / trends
  • Robust problem solving, mathematical, analytical, and decision making skills
  • Proficiency with computers, maintenance systems, and applications, including Microsoft Office
  • Excellent verbal communication, facilitation, and presentation skills
  • Strong reporting and technical writing capability
Job Responsibility
Job Responsibility
  • Extensive travel required. (Local, National, International)
  • Promotes and adheres to the ATS safety culture
  • Engages in various work environments and industries to lead reliability centered maintenance efforts
  • Mentors, coaches, and provides reliability best practices for applications in customer facilities, by customer personnel
  • Identifies top potential issues leading to lost production and preventable maintenance spending. Communicates findings with leadership
  • Provides solutions to root cause deficiencies and demonstrates economic benefits to their correction
  • Actively drives the implementation of equipment improvement projects
  • Identifies and implements current and new processes / technologies to increase equipment performance and uptime
  • Champions systems and best practice procedures towards a proactive manufacturing culture
  • Analyzes equipment performance, failure data, and corrective maintenance history to develop and deploy engineering solutions, improved maintenance strategies, preventative maintenance optimization, and other reliability techniques
  • Fulltime
Read More
Arrow Right

Field Reliability Services Engineer

Founded in 1985, ATS is a company with a presence in the United States, Mexico a...
Location
Location
United States , Greenville
Salary
Salary:
Not provided
atpchemical.com Logo
Advanced Technology Products
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering (ABET accredited)
  • Eight or more years of reliability experience across 2 or more manufacturing sites
  • Demonstrates ability to perform full array of reliability tool sets
  • Strong technical understanding of electrical or mechanical components, tools, and designs
  • Ability to complete a failure mode effects analysis, cause and effect diagrams, root cause failure analysis, life-cycle costing, and risk analysis
  • Ability to research and apply new equipment technology / trends
  • Robust problem solving, mathematical, analytical, and decision making skills
  • Proficiency with computers, maintenance systems, and applications, including Microsoft Office
  • Excellent verbal communication, facilitation, and presentation skills
  • Strong reporting and technical writing capability
Job Responsibility
Job Responsibility
  • Extensive travel required. (Local, National, International)
  • Promotes and adheres to the ATS safety culture
  • Engages in various work environments and industries to lead reliability centered maintenance efforts
  • Mentors, coaches, and provides reliability best practices for applications in customer facilities, by customer personnel
  • Identifies top potential issues leading to lost production and preventable maintenance spending. Communicates findings with leadership
  • Provides solutions to root cause deficiencies and demonstrates economic benefits to their correction
  • Actively drives the implementation of equipment improvement projects
  • Identifies and implements current and new processes / technologies to increase equipment performance and uptime
  • Champions systems and best practice procedures towards a proactive manufacturing culture
  • Analyzes equipment performance, failure data, and corrective maintenance history to develop and deploy engineering solutions, improved maintenance strategies, preventative maintenance optimization, and other reliability techniques
  • Fulltime
Read More
Arrow Right

Field Service Reliability Engineer

Founded in 1985, ATS is a company with a presence in the United States, Mexico a...
Location
Location
United States , Hammond
Salary
Salary:
Not provided
atpchemical.com Logo
Advanced Technology Products
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering (ABET accredited)
  • Eight or more years of reliability experience across 2 or more manufacturing sites
  • Demonstrates ability to perform full array of reliability tool sets
  • Strong technical understanding of electrical or mechanical components, tools, and designs
  • Ability to complete a failure mode effects analysis, cause and effect diagrams, root cause failure analysis, life-cycle costing, and risk analysis
  • Ability to research and apply new equipment technology / trends
  • Robust problem solving, mathematical, analytical, and decision making skills
  • Proficiency with computers, maintenance systems, and applications, including Microsoft Office
  • Excellent verbal communication, facilitation, and presentation skills
  • Strong reporting and technical writing capability
Job Responsibility
Job Responsibility
  • Extensive travel required. (Local, National, International)
  • Promotes and adheres to the ATS safety culture
  • Engages in various work environments and industries to lead reliability centered maintenance efforts
  • Mentors, coaches, and provides reliability best practices for applications in customer facilities, by customer personnel
  • Identifies top potential issues leading to lost production and preventable maintenance spending. Communicates findings with leadership
  • Provides solutions to root cause deficiencies and demonstrates economic benefits to their correction
  • Actively drives the implementation of equipment improvement projects
  • Identifies and implements current and new processes / technologies to increase equipment performance and uptime
  • Champions systems and best practice procedures towards a proactive manufacturing culture
  • Analyzes equipment performance, failure data, and corrective maintenance history to develop and deploy engineering solutions, improved maintenance strategies, preventative maintenance optimization, and other reliability techniques
  • Fulltime
Read More
Arrow Right

Director, Service Reliability Engineering

As Director of SRE, you will lead the team responsible for accelerating and auto...
Location
Location
United States , Bethesda
Salary
Salary:
125600.00 - 203700.00 USD / Year
https://www.marriott.com Logo
Marriott Bonvoy
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Undergraduate degree in computer science, software engineering, or a related field (or equivalent experience)
  • 10+ years of experience in SRE, devsecops or IT operations
  • At least 5 years’ experience in a previous leadership role within SRE, devsecops or IT Operations
  • At least five years of experience in the following technologies - Presentation Management: HTML, CSS, JS, Backbone, Node JS, Android, iOS, Application Platforms: NGINX, Java, Akana, Play Framework, Tomcat, Docker, Openshift, Application Data: PostgreSQL, Couchbase, Cassandra, Integration Services: Apache Kafka, Apache Spark, Akana, Analytics Platforms: Hadoop, dashDB, Cognos, Tableau, Security: Forgerock, OpenID, OAUTH, Ping Identity, Public Cloud: Azure, Google Cloud, AliCloud, Amazon Web Services, CI/CD: Harness
  • Experience with test automation
  • Working knowledge and proven track record of implementing disaster indifferent architecture
  • Experience with CDN and Akamai tools
  • Linux/Unix system administration experience
  • Proficient in scripting and programming languages (like Python, Go, Bash, Shell)
  • Hands on experience with infrastructure as code (like Terraform), container orchestration (like Kubernetes), and reliability automation
Job Responsibility
Job Responsibility
  • Define and execute Marriott’s SRE vision, aligning with business objectives and technology roadmaps
  • Build, mentor and lead a high-performing SRE team, fostering a culture of collaboration and innovation
  • Establish reliability, observability and automation goals to improve system uptime, performance and scalability
  • Partner with engineering, operations and security teams to drive best practices and continuous improvement
  • Implement reliability-focused engineering practices, including SLAs, SLOs/SLIs and error budgets
  • Design and maintain resilient, scalable and fault-tolerant architectures across cloud and hybrid environments
  • Develop strategies to proactively identify and mitigate risks to system performance and availability
  • Drive root cause analysis (RCA) and post-mortem processes to prevent recurring incidents
  • Champion automation in monitoring, deployment and incident resolution to reduce toil and enhance efficiency
  • Lead and optimize incident response processes, ensuring rapid detection, diagnosis, and resolution of system failures
What we offer
What we offer
  • Bonus program
  • comprehensive health care benefits
  • 401(k) plan with up to 5% company match
  • employee stock purchase plan at 15% discount
  • accrued paid time off (including sick leave where applicable)
  • life insurance
  • group disability insurance
  • travel discounts
  • adoption assistance
  • paid parental leave
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Support Lead

Site Reliability Engineering Support Lead role focused on application support, d...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Solid SRE process experience
  • 5+ years of Leading high-performance, 24x7, DevOps or SysOps team
  • Proficiency in Windows administration, Office 365, Exchange, SharePoint, Active Directory, Backup, Networking and Infrastructure
  • Experience with Microsoft OS Windows & Server
  • Experience in ticket tracking and resolving on time
  • Hands-on experience on ticketing tools (ServiceNow)
  • Excellent verbal, written, presentation and interpersonal communication skills
  • Ability to make complex technical matters easy-to-comprehend for non-technical persons.
Job Responsibility
Job Responsibility
  • Taking end-to-end Ownership of Application Support for Production Systems Issues resolution
  • Implementing, monitoring, and maintaining CI/CD frameworks
  • Developing new capabilities, coordinating implementation across a large number of teams including infrastructure, developer tools and information security
  • Influencing a culture of Site Reliability Engineering. Engaging in training and mentoring to help develop other engineers with SRE mind set
  • Providing the first line of after-deployment technical support at L1 and L2 level for applications and and/or associated production systems diagnostics, and network health monitoring
  • Coordination and/or for deploying hands-on fixes, patches and software updates at the application level, and as appropriate at the network level
  • Managing a team of technical support engineers who provide technical support to users
  • Escalating complex problems to the L3 level of expertise within organization, along with observations from investigative and diagnostic assessments
  • Co-ordinating in the investigation of repeated technical issues affecting user system and seeing through to resolution
  • Escalating, resolving, guiding team, and tracking production incidents to closure
What we offer
What we offer
  • Competitive base salary (which is annually reviewed)
  • Hybrid working model (up to 2 days working at home per week)
  • Additional benefits to support you and your family to be well, live well and save well.
  • Fulltime
Read More
Arrow Right