CrawlJobs Logo

Senior Service Reliability Engineer

Singapore, Singapore · Job Posted January 12, 2026
Apply Position
Job Link Share

Job Description

As a Service Reliability Engineer (SRE) in DAMO service line, you will take a multifaceted approach to ensure technical excellence and operational efficiency within the infrastructure domain. Specializing in reliability, resilience and system performance, you take a lead role in championing the principles of Site Reliability Engineering. By strategically integrating automation, monitoring and incident response, you facilitate the evolution from traditional operations to a more customer-focused and agile approach. Emphasizing shared responsibility and a commitment to continuous improvement, you cultivate a collaborative culture, enabling organizations to meet and exceed their reliability and business objectives.

Job Responsibility

  • You will conduct SRE and Disaster Recovery (DR) maturity assessments
  • You will engineer automation solutions using Ansible to replace manual workflows
  • You will own and manage the current manual Disaster Recovery process/pipeline
  • You will improve site reliability through mechanisms and architectures that enhance fault tolerance and reduce MTTR/MTTD
  • You will drive the integration of observability automation into the CI/CD pipeline
  • You will handle production incidents, lead client communication, and create root cause analysis documentation
  • You will monitor performance of production systems and improve scaling to meet SLA and SLO targets
  • You will work closely with application development teams to advise and implement reliability improvements
  • You will improve system observability across logging, metrics and alerting, reducing false alarms to eliminate unnecessary toil and improving overall process efficiency, while implementing chaos engineering practices to regularly validate system reliability
  • You have a clear understanding of client goals and business needs, setting direction for site reliability in alignment with business expectations - including high availability targets such as 99.999% with minimal/no disruption where required.

Requirements

  • You have expertise in Ansible orchestration including advanced strategies, failure logic handling, and Jinja2 templating
  • You have the ability to integrate Terraform with Ansible for seamless provisioning-to-configuration workflows
  • You have hands-on experience with Python, Go, Bash or PowerShell scripting
  • You have working knowledge of at least one public cloud (AWS/Azure/GCP)
  • You have experience with observability tools (Grafana, Datadog, NewRelic, ELK, Dynatrace, etc.) and can use data for RCA
  • You have familiarity with DevOps, SRE and GitOps concepts and practices
  • You have knowledge of container technologies and orchestration (Kubernetes, EKS, Docker Swarm, Nomad, etc.)
  • You have understanding of modern architecture (microservices, serverless, NoSQL, REST APIs) and experience debugging and building metrics/dashboards
  • You have experience designing infrastructure aligned with Cloud Well-Architected principles (reliability, security, cost, performance, operations)
  • You are able to mentor team members through workshops and knowledge enablement
  • You are able to create comprehensive documentation and runbooks
  • You have strong communication and articulation skills in English
  • You have strong collaboration and negotiation skills with client and cross-functional teams
  • You have a resilient problem-solving mindset and don’t give up easily when debugging issues
  • You can remain calm and composed during high-pressure production incidents
  • You can recommend improvements backed by strong technical reasoning
  • You can understand both business and technical requirements and break them down into deliverables
  • You have strong ownership and willingness to take responsibility beyond strict role boundaries
  • You are willing to participate in rotation-based or need-based 24x7 availability support
  • Candidates must be Singaporean citizens or already hold Singaporean Permanent Residency (PR) at the time of application.

What we offer

Learning & Development: There is no one-size-fits-all career path at Thoughtworks: however you want to develop your career is entirely up to you. But we also balance autonomy with the strength of our cultivation culture. This means your career is supported by interactive tools, numerous development programs and teammates who want to help you grow. We see value in helping each other be our best and that extends to empowering our employees in their career journeys.

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Service Reliability Engineer

8 matching positions

New

Senior Reliability Engineer

Barbaricum is seeking an experienced Senior Site Reliability Engineer to support...
Location
Location
United States , Washington
Salary
Salary:
Not provided
barbaricum.com Logo
Barbaricum
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expert knowledge of site reliability engineering practices, system monitoring, incident management, automation, performance tuning, and operational resilience
  • Strong understanding of Windows and Linux administration, infrastructure operations, system configuration, service management, and troubleshooting practices
  • Experience with automation platforms and configuration management tools such as Ansible, Puppet, Chef, or similar technologies
  • Proficiency with scripting languages such as Python, Shell, PowerShell, or similar tools used to automate operational and infrastructure tasks
  • Knowledge of cloud services and infrastructure across AWS, Microsoft Azure, Google Cloud, or comparable cloud environments
  • Strong understanding of network troubleshooting, configuration, connectivity analysis, system dependencies, and performance bottleneck identification
  • Ability to design, interpret, and maintain dashboards, alerts, metrics, logs, and operational reporting that support service health and decision-making
  • Ability to conduct root cause analysis, post-incident reviews, and corrective action planning in complex technical environments
  • Strong problem-solving skills and the ability to work under pressure during outages, impairments, and time-sensitive operational issues
  • Excellent written and verbal communication skills, with the ability to explain technical findings, incident impacts, and reliability recommendations to technical and non-technical stakeholders
Job Responsibility
Job Responsibility
  • Monitor and maintain system reliability, availability, and performance across on-premises, cloud, and hybrid IT environments supporting MC&FP mission requirements
  • Implement proactive performance monitoring, automated alerting, incident response workflows, and resilience engineering practices to reduce downtime and improve operational visibility
  • Develop, maintain, and improve scalable automated infrastructure solutions that support reliable system operations and repeatable service delivery
  • Implement rollback strategies, recovery approaches, and chaos engineering practices to validate resilience, reduce operational risk, and improve system stability
  • Analyze usage patterns, capacity trends, and performance indicators to support dynamic scaling, resource optimization, and system improvement decisions
  • Develop and maintain real-time operational dashboards, reports, and metrics that enable rapid decision-making, leadership awareness, and system optimization
  • Respond to and resolve system outages, impairments, and service disruptions while coordinating with technical teams to minimize mission impact
  • Conduct post-incident reviews to identify root causes, document lessons learned, and implement preventative measures that reduce recurrence
  • Collaborate with software developers, cloud engineers, cybersecurity personnel, and operations teams to improve services, reliability patterns, deployment practices, and operational standards
  • Create and maintain system documentation, configuration standards, operational runbooks, monitoring procedures, and service reliability guidance
  • Fulltime
Read More
Arrow Right

Senior Reliability Engineer - AV Labs

We are looking for a hardware focused Senior Reliability Engineer to focus on se...
Location
Location
United States , Sunnyvale
Salary
Salary:
180000.00 - 200000.00 USD / Year
uber.com Logo
Uber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of relevant industry experience in software engineering, site reliability, or systems engineering
  • Experience with modern observability platforms (e.g., Prometheus, Grafana, ELK) in edge, IoT, or hardware-integrated environments
  • coding skills in one or more of Go, Python, or C++, with experience building and operating production systems
  • Proficiency in Linux internals and shell scripting for triaging and debugging edge devices or hardware-adjacent systems
  • Ability to debug across services, containers (Docker), and networking stacks
  • Proven track record owning reliability, infrastructure, or platform systems for large-scale production workloads
  • Experience designing and operating observability systems (metrics, logging, alerting, and dashboards)
  • Experience defining and implementing SLIs and SLOs for system availability or data yield
  • Deep understanding of networking protocols (TCP/IP, gRPC, or MQTT) and data handling in bandwidth-constrained environments
  • Experience driving complex technical projects and architectural reviews across multiple teams from design through production
Job Responsibility
Job Responsibility
  • Architect Observability Systems: Design and scale an observability platform capable of ingesting and analyzing real-time health telemetry from thousands of distributed vehicle nodes
  • Build for Edge Constraints: Develop systems that remain performant despite hardware diversity, intermittent connectivity, and rapid fleet scaling
  • Define Criticality Models: Establish alerting strategies that distinguish transient anomalies from systemic issues impacting sensor uptime and data yield
  • Detect Complex Failure Modes: Design detection logic for 'silent' failures, such as sensor degradation, compute saturation, or recording pipeline stalls
  • Scale Through Automation: Design automated detection, triage, and mitigation mechanisms to eliminate manual intervention as the fleet grows
  • Partner on Mitigation: Collaborate with Operations and Engineering to build safe, automated responses to recurring hardware and software failure scenarios
  • Drive Operational Efficiency: Build technical interfaces to help Operations surface issues and Engineering diagnose and deploy mitigations rapidly (TTD/TTM)
  • Lead Technical Strategy: Drive reliability-focused design reviews and translate operational pain points into concrete technical requirements and roadmaps
  • Uncover Proactive Insights: Apply advanced data analytics to identify latent patterns in fleet telemetry, enabling the proactive detection of systemic regressions and hardware degradation before they impact operations
What we offer
What we offer
  • Uber's bonus program
  • equity award & other types of comp
  • 401(k) plan
  • various benefits
  • Fulltime
Read More
Arrow Right

Principal Service Reliability Engineer

We are seeking a Principal Service Reliability Engineer (SRE) to lead the reliab...
Location
Location
United States , Redmond
Salary
Salary:
142800.00 - 304200.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, or systems administration OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, or systems administration OR Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration
  • Proven track record of defining and operationalizing SLOs, SLIs, and error budgets across multiple services or organizations
  • Experience leading reliability efforts for enterprise-scale or globally distributed systems
  • Advanced debugging and troubleshooting skills across application, platform, and infrastructure layers
  • Demonstrated ability to mentor senior engineers and influence engineering culture at scale
  • Experience driving platform-level improvements (e.g., standardized observability, shared reliability tooling, automated remediation frameworks)
  • Extensive experience operating large-scale, distributed production systems, including cloud-native platforms (Azure preferred)
  • Demonstrated ability to drive cross-team technical initiatives and influence architecture and engineering standards
  • Deep experience in observability, incident management, and production operations at scale
  • Strong understanding of Azure networking, distributed systems performance, and reliability engineering principles
Job Responsibility
Job Responsibility
  • Define and drive reliability strategy across services, including measurable targets for availability, latency, and performance aligned to business priorities
  • Establish and enforce SLO/SLI frameworks and error budgets across multiple teams, ensuring consistent adoption and accountability
  • Lead complex incident management and systemic RCA efforts, identifying cross-service failure patterns and driving durable, long-term fixes
  • Influence architecture and platform design to improve operability, scalability, fault isolation, and disaster recovery at enterprise scale
  • Drive reliability engineering standards for observability (metrics, logs, traces), capacity planning, and production readiness across the organization
  • Eliminate operational toil through automation, enabling self-healing systems and reducing manual intervention
  • Embed security, compliance, and resiliency practices into design and operational processes, ensuring alignment with enterprise requirements
  • Partner with engineering leadership to prioritize reliability investments and balance feature velocity with system stability
  • Lead and mentor engineers while shaping a strong reliability culture across teams and org boundaries
  • Fulltime
Read More
Arrow Right

Senior Reliability Engineer

Are you looking for a career move that will put you at the heart of a global fin...
Location
Location
Poland , Warsaw
Salary
Salary:
241750.00 - 411650.00 PLN / Year
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Software senior developer, scripting in Python/Shell/Bash/Java/Go
  • Experience in CI/CD, in large enterprise tech-stack infra-architecture
  • Hands-on experience deploying and scaling the Model Context Protocol (MCP) in complex, enterprise environments
  • Understanding of SRE/DevOps/CI/CD
  • Strong analytical, algorithmic, and problem-solving skills
  • Excellent teamwork, proactive attitude, strong communication skills, both written and oral
Job Responsibility
Job Responsibility
  • You will work in an agile software development environment, developing quality and scalable software solutions using leading-edge technologies
  • You will work closely with developers, engineers and non-technology employees to help them be more productive with the use of the CI/CD tools
  • You will collaborate with Citi Developer Services engineers to automate manual and repetitive processes, integrate services with AI (by building and maintaining MCP - Model Context Protocol servers), enhance system resiliency, and coordinate service issue investigations by deploying best practices
  • Automate manual activities, repetitive processes, reporting, controls, etc., configure and tune them
  • Build and maintain the foundational MCP servers that allows AI models to securely interact with enterprise systems
  • Continuously improve systems resiliency, reliability, and business cost - through a design and development of software solutions and streamlined processes
  • Mitigate risk by analyzing the root cause of production issues, impacts to business, and required corrective actions.
What we offer
What we offer
  • Employer paid Defined Contribution Pension Plan contribution of 6% of employee’s pensionable earnings (PPE Program)
  • Employer paid Private Medical Care Package for employees and Private Medical Care Packages for certain family members available at preferential rates
  • Employer paid Life Insurance Program for employees and Life Insurance for certain family members available at preferential rates
  • Employee Assistance Program financed by Employer
  • Paid Parental Leave Program (maternity and paternity leave
  • statutory and 2 weeks additional paid paternity leave)
  • Sport Card for employees subsidised via Social Benefits Fund and Sport Cards for certain family members available at preferential rates
  • Additional benefits from Company’s Social Benefit Fund, in particular: Holidays Allowance, support for sport and cultural activities, team building events
  • Additional day off for volunteering
  • Cafeteria/ flex benefit – a company benefits system which enables employees to select and purchase benefits offered by a provider and available for employees on the platform
  • Fulltime
Read More
Arrow Right

Senior Service Engineer - CTJ - Top Secret

Microsoft has an exciting opportunity for an experienced Senior Service Engineer...
Location
Location
United States , Reston
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science, Information Technology, Mechanical Engineering, Electrical Engineering, Aerospace Engineering, Data Science, Cybersecurity, or related field AND 3+ years technical experience in software engineering, network engineering, service engineering, systems engineering, or industrial controls OR equivalent experience.
  • Candidates must have an active TS and be willing and eligible to upgrade to TS/SCI (with polygraph) or have an active TS/SCI and be willing and eligible to upgrade to TS/SCI (with polygraph).
  • This position requires verification of U.S. citizenship due to citizenship-based legal restrictions.
  • This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Responds to incidents during regular on-call rotations by identifying the level of impact, troubleshooting issues, and deploying appropriate fixes to resolve root cause(s). Alerts product teams and owners to major customer impacting issues and escalates resolution of complex and highly impactful issues affecting multiple components or features to other engineers or engineering teams as needed. Shares details related to incidents and their resolution through postmortem reports and during regular review meetings.
  • Demonstrates understanding of service and/or system design, interactions between technology layers and components, functions of physical infrastructure (networking and servers) and cloud-based infrastructure, and dependencies at scale. Contributes to improving service architecture and design of a hybrid (physical and cloud-based) service through automation, software development, and networking with minimal guidance. Adjusts configurations and defines infrastructures to improve the availability, reliability, efficiency, observability, and/or performance of supported products and services, with minimal guidance from other engineers.
  • Designs, implements, and maintains Infrastructure as Code (IaC) solutions to provision and manage hybrid cloud and on premises infrastructure in a repeatable, secure, and auditable manner, minimizing configuration drift and manual intervention.
  • Authors, reviews, and maintains IaC templates and modules to deploy networking, compute, storage, identity, security controls, and PKI-related infrastructure across isolated and regulated environments, following least privilege and defense in depth principles.
  • Stays current in knowledge and expertise as technology landscape evolves specifically as that technology relates to Cloud networking, physical networking, automation, and Linux (Debian and Red Hat based distributions). Contributes to the adoption of new solutions. Proactively seeks opportunities to learn and receive feedback.
  • Shares insights and best practices that can be applied to improve development and operations of the system, platform, or product components and features by participating in design reviews, incident drills and debriefs, and regular meetings, as well as interactions with more experienced Service Engineers and members of product engineering teams.
  • Embody our culture and values.
  • Fulltime
Read More
Arrow Right

Senior Service Engineer

The Senior Service Engineer is responsible for coordinating, managing, and overs...
Location
Location
United Kingdom , Keele
Salary
Salary:
Not provided
jobs.360resourcing.co.uk Logo
360 Resourcing Solutions
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Training in facilities management, maintenance systems, or engineering services
  • Training in health and safety, contractor management, and risk assessment
  • Significant experience in facilities, maintenance, or service engineering within a regulated industry such as medical devices or pharmaceuticals
  • Proven experience managing external contractors and service providers in an operational environment
  • Experience coordinating maintenance and servicing of equipment and building systems (e.g., HVAC, utilities, safety systems)
  • Strong knowledge of facilities engineering, building services, and equipment maintenance coordination
  • Understanding of regulatory, quality, and documentation requirements in a medical device environment
  • Ability to manage service schedules, prioritize tasks, and resolve issues efficiently
  • Strong vendor and contractor management skills, including performance monitoring
  • Competence in reviewing technical service documentation and reports
Job Responsibility
Job Responsibility
  • Oversee and manage external contractors responsible for building, facilities, and utility services, ensuring work is completed safely, on time, and to required standards
  • Coordinate servicing, maintenance, and inspection of manufacturing, laboratory, and facility equipment, including scheduling and service planning
  • Act as the primary point of contact for service providers, managing work permits, access, supervision, and compliance with site procedures
  • Ensure all service and maintenance activities comply with company quality systems, health and safety requirements, and regulatory standards
  • Review and approve service reports, maintenance records, and contractor documentation for accuracy and completeness
  • Support equipment uptime by coordinating corrective and preventive maintenance activities
  • Liaise with Engineering, Quality, Manufacturing, and EHS teams to minimize operational impact during service activities
  • Support audits and inspections by providing service, maintenance, and contractor-related documentation
  • Identify opportunities for improvement in service coordination, contractor performance, and facilities reliability
  • Assist with budgeting, service contracts, and vendor performance reviews as require
What we offer
What we offer
  • Company events
  • Company pension
  • Employee discount
  • Free or subsidised travel
  • Free parking on-site
  • Fulltime
Read More
Arrow Right

Senior Service Engineer

The Cloud & AI organization accelerates Microsoft’s mission and bold ambitions t...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science, Information Technology, Mechanical Engineering, Electrical Engineering, Aerospace Engineering, Data Science, Cybersecurity, or related field AND 3+ years technical experience in software engineering, network engineering, service engineering, systems engineering, or industrial controls OR equivalent experience
  • 2+ years technical experience in service engineering, systems engineering, or cloud operations
  • 1+ years experience managing and improving enterprise services, including production system support
  • Experience working with scalable resources in Azure
  • Proficiency in: Azure deployment and integration, proficiency in SQL and CosmosDB, Power Platform and automation frameworks, vulnerability triage, data analysis and visualization using Kusto/KQL, Power BI, and scripting languages (e.g., PowerShell and Python)
  • Strong troubleshooting skills across multiple platforms (Windows, Linux, mobile OS) and ability to resolve complex service issues
  • Excellent written and verbal communication skills, with sound judgment and decision-making abilities for high-stakes scenarios
  • Familiarity with low-code ecosystems (Power Platform, Fabric) and AI/agentic workloads
  • Experience with incident response, problem management, and operational telemetry analysis
  • Knowledge of security and compliance frameworks (e.g., SDL, NIST AI RMF) and ability to apply them in service operations
Job Responsibility
Job Responsibility
  • Own service health and operational reliability for Security Services
  • Translate production signals into platform improvements
  • Drive innovation in next‑gen Agent security
  • Develop deep operational expertise in securing agentic workloads
  • Increase service maturity through automation and analytics
  • Partner across security, engineering, and operations teams
  • Fulltime
Read More
Arrow Right

Senior Service Engineer

Are you excited about working on one of Microsoft’s most strategic and high‑visi...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of experience in incident management, service engineering, program management, or related technical roles
  • Strong track record commanding high-pressures, complex, cross-team incidents across cloud or large-scale distributed systems
  • 5+ years of hands-on experience working with cloud technologies (Azure preferred)
  • Strong understanding of Azure architecture, core services, and internal operational workflows
  • Exceptional communication skills, with the ability to simplify complex technical issues for senior executives and customers
  • Experience collaborating in matrixed engineering environments with diverse stakeholders (PG, EngOPS, Field, GPMs, PMs, SREs)
  • Strong analytical skills
  • ability to drive insight from data and influence direction through evidence
  • Proven experience driving pilots, building prototypes, or contributing to innovation in live‑site or automation scenarios
  • Demonstrated experience in AI/ML-based solutions—automation, anomaly detection, NLP, or reliability tooling. Exposure to Power BI, Kusto (KQL), or other analytical tooling
Job Responsibility
Job Responsibility
  • Lead high‑severity Azure incidents with strong command presence and clear decision‑making under pressure
  • Drive the end‑to‑end incident lifecycle, including detection, triage, mitigation, communication, and post‑incident learning
  • Partner across Azure product groups, EngOPS, and field teams to accelerate diagnosis, reduce time‑to‑mitigation, and drive sustainable fixes
  • Represent the voice of the customer by surfacing systemic issues, platform gaps, and reliability risks to engineering teams
  • Drive operational maturity through repeatable processes, strong governance, high‑quality execution, and measurable reliability metrics
  • Identify live‑site patterns and hotspots across services and lead cross‑team action plans to address them
  • Convert customer and incident pain points into automation, AI‑assisted workflows, and process improvements
  • Lead or co‑own pilots, proofs‑of‑concept, and tech accelerators that enhance incident response velocity and quality
  • Contribute to internal playbooks, frameworks, and tooling that leverage AI/ML for improved live‑site management
  • Fulltime
Read More
Arrow Right