Senior Service Reliability Engineer Job at Thoughtworks (Singapore)

Job Description

As a Service Reliability Engineer (SRE) in DAMO service line, you will take a multifaceted approach to ensure technical excellence and operational efficiency within the infrastructure domain. Specializing in reliability, resilience and system performance, you take a lead role in championing the principles of Site Reliability Engineering. By strategically integrating automation, monitoring and incident response, you facilitate the evolution from traditional operations to a more customer-focused and agile approach. Emphasizing shared responsibility and a commitment to continuous improvement, you cultivate a collaborative culture, enabling organizations to meet and exceed their reliability and business objectives.

Job Responsibility

You will conduct SRE and Disaster Recovery (DR) maturity assessments
You will engineer automation solutions using Ansible to replace manual workflows
You will own and manage the current manual Disaster Recovery process/pipeline
You will improve site reliability through mechanisms and architectures that enhance fault tolerance and reduce MTTR/MTTD
You will drive the integration of observability automation into the CI/CD pipeline
You will handle production incidents, lead client communication, and create root cause analysis documentation
You will monitor performance of production systems and improve scaling to meet SLA and SLO targets
You will work closely with application development teams to advise and implement reliability improvements
You will improve system observability across logging, metrics and alerting, reducing false alarms to eliminate unnecessary toil and improving overall process efficiency, while implementing chaos engineering practices to regularly validate system reliability
You have a clear understanding of client goals and business needs, setting direction for site reliability in alignment with business expectations - including high availability targets such as 99.999% with minimal/no disruption where required.

Requirements

You have expertise in Ansible orchestration including advanced strategies, failure logic handling, and Jinja2 templating
You have the ability to integrate Terraform with Ansible for seamless provisioning-to-configuration workflows
You have hands-on experience with Python, Go, Bash or PowerShell scripting
You have working knowledge of at least one public cloud (AWS/Azure/GCP)
You have experience with observability tools (Grafana, Datadog, NewRelic, ELK, Dynatrace, etc.) and can use data for RCA
You have familiarity with DevOps, SRE and GitOps concepts and practices
You have knowledge of container technologies and orchestration (Kubernetes, EKS, Docker Swarm, Nomad, etc.)
You have understanding of modern architecture (microservices, serverless, NoSQL, REST APIs) and experience debugging and building metrics/dashboards
You have experience designing infrastructure aligned with Cloud Well-Architected principles (reliability, security, cost, performance, operations)
You are able to mentor team members through workshops and knowledge enablement
You are able to create comprehensive documentation and runbooks
You have strong communication and articulation skills in English
You have strong collaboration and negotiation skills with client and cross-functional teams
You have a resilient problem-solving mindset and don’t give up easily when debugging issues
You can remain calm and composed during high-pressure production incidents
You can recommend improvements backed by strong technical reasoning
You can understand both business and technical requirements and break them down into deliverables
You have strong ownership and willingness to take responsibility beyond strict role boundaries
You are willing to participate in rotation-based or need-based 24x7 availability support
Candidates must be Singaporean citizens or already hold Singaporean Permanent Residency (PR) at the time of application.

What we offer

Learning & Development: There is no one-size-fits-all career path at Thoughtworks: however you want to develop your career is entirely up to you. But we also balance autonomy with the strength of our cultivation culture. This means your career is supported by interactive tools, numerous development programs and teammates who want to help you grow. We see value in helping each other be our best and that extends to empowering our employees in their career journeys.

Thoughtworks - All Job Offers

Select Country

Senior Service Reliability Engineer

Job Description

Job Responsibility

Requirements

What we offer

Looking for more opportunities?

Senior Service Reliability Engineer

Senior Reliability Engineer

Senior Reliability Engineer - AV Labs

Principal Service Reliability Engineer

Senior Reliability Engineer

Senior Service Engineer - CTJ - Top Secret

Senior Service Engineer

Senior Service Engineer

Senior Service Engineer

Our AI answers in your language