SRE Lead Job at Randstad (Kuala Lumpur)

Lead SRE

We are looking for a Lead SRE to join our Inetum Team and be part of a work cult...

Location

Portugal , Lisbon

Salary:

Not provided

Inetum

Expiration Date

Until further notice

Requirements

SRE IT production processes
Agile / DevOps Mindset Problem Solving
Scripting: Python, YML, Shell
Monitoring: Dynatrace, Nagios
Linux
Admin Network (DNS, Firewall, Switch)
DevOps stack: Git & Git Flow, Artifactory, Jenkins or Gitlab CI, Ansible Tower, Digital ai Release
Cloud: Kubernetes, Docker, Argo CD, ArgoCD, Vault, Helm
End-to-end IT organization and processes (from development to run / operate)
Technical Architecture

Job Responsibility

Train SREs and their managers on SRE practices
Co-construct the transformation strategy and the support plan by participating in workshops, brainstorming with the transformation team and producing training content
Coach and support

Fulltime

Site Reliability Engineering Support Lead

Site Reliability Engineering Support Lead role focused on application support, d...

Location

Ireland , Dublin

Salary:

Not provided

Citi

Expiration Date

Until further notice

Requirements

Solid SRE process experience
5+ years of Leading high-performance, 24x7, DevOps or SysOps team
Proficiency in Windows administration, Office 365, Exchange, SharePoint, Active Directory, Backup, Networking and Infrastructure
Experience with Microsoft OS Windows & Server
Experience in ticket tracking and resolving on time
Hands-on experience on ticketing tools (ServiceNow)
Excellent verbal, written, presentation and interpersonal communication skills
Ability to make complex technical matters easy-to-comprehend for non-technical persons.

Job Responsibility

Taking end-to-end Ownership of Application Support for Production Systems Issues resolution
Implementing, monitoring, and maintaining CI/CD frameworks
Developing new capabilities, coordinating implementation across a large number of teams including infrastructure, developer tools and information security
Influencing a culture of Site Reliability Engineering. Engaging in training and mentoring to help develop other engineers with SRE mind set
Providing the first line of after-deployment technical support at L1 and L2 level for applications and and/or associated production systems diagnostics, and network health monitoring
Coordination and/or for deploying hands-on fixes, patches and software updates at the application level, and as appropriate at the network level
Managing a team of technical support engineers who provide technical support to users
Escalating complex problems to the L3 level of expertise within organization, along with observations from investigative and diagnostic assessments
Co-ordinating in the investigation of repeated technical issues affecting user system and seeing through to resolution
Escalating, resolving, guiding team, and tracking production incidents to closure

What we offer

Competitive base salary (which is annually reviewed)
Hybrid working model (up to 2 days working at home per week)
Additional benefits to support you and your family to be well, live well and save well.

Fulltime

Lead Site Reliability Engineer

Groupon is a marketplace where customers discover new experiences and services e...

Location

India , Bangalore

Salary:

Not provided

Groupon

Expiration Date

Until further notice

Requirements

10+ years in systems engineering
at least 5+ years in SRE or DevOps roles
expertise in cloud platforms (GCP, AWS) and container orchestration (Kubernetes, Docker)
proficiency in programming and scripting languages like Python, Go, and Bash
advanced knowledge of Infrastructure as Code (IaC) tools such as Terraform and Ansible
deep understanding of networking, DNS, load balancing, and security principles
proven track record of managing high-availability systems in demanding environments
exceptional analytical and problem-solving skills

Job Responsibility

Architect and maintain fault-tolerant systems, ensuring uptime SLAs of 99.9% or higher
drive automation in infrastructure management and deployment using Terraform, Ansible, Kubernetes, and similar tools
create and optimize CI/CD pipelines to ensure reliable, secure, and efficient software delivery
build and enhance comprehensive observability solutions, including monitoring, logging, and alerting systems using Prometheus, Grafana, and the ELK stack
collaborate with stakeholders to define and achieve SLIs, SLOs, and error budgets aligned with business needs
lead incident response during on-call rotations, ensuring rapid resolution and root cause analysis for critical issues
design and execute performance testing, capacity planning, and scalability strategies for evolving workloads
proactively identify and resolve bottlenecks, increasing system performance and developer efficiency
mentor junior engineers, fostering a collaborative and growth-oriented team environment
guide architectural decisions that drive innovation and enhance system reliability

What we offer

The opportunity to work with cutting-edge technologies in a transformative environment
a collaborative and innovative work values alignment that values your expertise and contributions
professional growth and leadership development pathways tailored to your aspirations
a chance to leave a lasting impact by shaping the future of reliable and scalable systems

Engineering Lead Analyst

The Engineering Lead Analyst is a senior level position responsible for leading ...

Location

India , Pune

Salary:

Not provided

Citi

Expiration Date

Until further notice

Requirements

6-10 years of relevant experience in an Engineering role
Experience working in Financial Services or a large complex and/or global environment
Project Management experience
Consistently demonstrates clear and concise written and verbal communication
Comprehensive knowledge of design metrics, analytics tools, benchmarking activities and related reporting to identify best practices
Demonstrated analytic/diagnostic skills
Ability to work in a matrix environment and partner with virtual teams
Ability to work independently, multi-task, and take ownership of various parts of a project or initiative
Ability to work under pressure and manage to tight deadlines or unexpected changes in expectations or requirements
Proven track record of operational process change and improvement

Job Responsibility

Serve as a technology subject matter expert for internal and external stakeholders
Provide direction for all firm mandated controls and compliance initiatives
Lead projects within the group and create a technology domain roadmap
Ensure that all integration of functions meet business goals
Define necessary system enhancements to deploy new products and process enhancements
Recommend product customization for system integration
Identify problem causality, business impact and root causes
Exhibit knowledge of how own specialty area contributes to the business
Apply knowledge of competitors, products and services
Advise or mentor junior team members

Fulltime

Director, Service Reliability Engineering

As Director of SRE, you will lead the team responsible for accelerating and auto...

Location

United States , Bethesda

Salary:

125600.00 - 203700.00 USD / Year

Marriott Bonvoy

Expiration Date

Until further notice

Requirements

Undergraduate degree in computer science, software engineering, or a related field (or equivalent experience)
10+ years of experience in SRE, devsecops or IT operations
At least 5 years’ experience in a previous leadership role within SRE, devsecops or IT Operations
At least five years of experience in the following technologies - Presentation Management: HTML, CSS, JS, Backbone, Node JS, Android, iOS, Application Platforms: NGINX, Java, Akana, Play Framework, Tomcat, Docker, Openshift, Application Data: PostgreSQL, Couchbase, Cassandra, Integration Services: Apache Kafka, Apache Spark, Akana, Analytics Platforms: Hadoop, dashDB, Cognos, Tableau, Security: Forgerock, OpenID, OAUTH, Ping Identity, Public Cloud: Azure, Google Cloud, AliCloud, Amazon Web Services, CI/CD: Harness
Experience with test automation
Working knowledge and proven track record of implementing disaster indifferent architecture
Experience with CDN and Akamai tools
Linux/Unix system administration experience
Proficient in scripting and programming languages (like Python, Go, Bash, Shell)
Hands on experience with infrastructure as code (like Terraform), container orchestration (like Kubernetes), and reliability automation

Job Responsibility

Define and execute Marriott’s SRE vision, aligning with business objectives and technology roadmaps
Build, mentor and lead a high-performing SRE team, fostering a culture of collaboration and innovation
Establish reliability, observability and automation goals to improve system uptime, performance and scalability
Partner with engineering, operations and security teams to drive best practices and continuous improvement
Implement reliability-focused engineering practices, including SLAs, SLOs/SLIs and error budgets
Design and maintain resilient, scalable and fault-tolerant architectures across cloud and hybrid environments
Develop strategies to proactively identify and mitigate risks to system performance and availability
Drive root cause analysis (RCA) and post-mortem processes to prevent recurring incidents
Champion automation in monitoring, deployment and incident resolution to reduce toil and enhance efficiency
Lead and optimize incident response processes, ensuring rapid detection, diagnosis, and resolution of system failures

What we offer

Bonus program
comprehensive health care benefits
401(k) plan with up to 5% company match
employee stock purchase plan at 15% discount
accrued paid time off (including sick leave where applicable)
life insurance
group disability insurance
travel discounts
adoption assistance
paid parental leave

Fulltime

SRE Observability Lead Engineer

The SRE Observability Lead Engineer is a hands-on leader responsible for shaping...

Location

United Kingdom , London

Salary:

Not provided

Citi

Expiration Date

Until further notice

Requirements

Relevant experience in Observability, SRE, Infrastructure Engineering, or Platform Architecture, including several years in senior leadership roles
Deep expertise in observability tools and stacks such as Grafana, Prometheus, OpenTelemetry, ELK, Splunk, and similar platforms
Strong hands-on experience across hybrid infrastructure, including on-prem, cloud (AWS, GCP, Azure), and container platforms (ECS, Kubernetes)
Proven ability to design scalable telemetry and instrumentation strategies, resolve production observability gaps, and integrate them into large-scale systems
Experience leading teams and managing people across geographically distributed locations
Strong ability to influence platform, cloud, and engineering leaders to ensure observability tooling is built for reuse and scale
Deep understanding of SRE fundamentals, including SLIs, SLOs, error budgets, and telemetry-driven operations
Strong collaboration skills and experience working across federated teams, building consensus and delivering change
Ability to stay up to date with industry trends and apply them to improve internal tooling and design decisions
Excellent written and verbal communication skills

Job Responsibility

Define and own the strategic vision and multi-year roadmap for Observability across Services Technology, aligned with enterprise reliability and production goals
Translate strategy into an actionable delivery plan in partnership with Services Architecture & Engineering function, delivering incremental, high-value milestones toward a unified, scalable observability architecture
Lead and mentor SREs across Services, fostering a technical growth and SRE mindset
Build and offer a suite of central observability services across LoBs – including standardized telemetry libraries, onboarding templates, dashboard packs, and alerting standards
Drive reusability and efficiency by creating common patterns and golden paths for observability adoption across critical client flows and platforms
Partner with infrastructure, CTO and other SMBF tooling teams, to ensure observability tooling is scalable, resilient, and avoids duplication (“cottage industries”)
Work hands-on to troubleshoot telemetry and instrumentation issues across on-prem, cloud (AWS, GCP, etc.), and ECS/Kubernetes-based environments
Collaborate closely with the architecture function to support implementation of observability NFRs in the SDLC, ensuring new apps go live with sufficient coverage and insight
Support SRE Communities of Practice (CoP) and foster strong relationships with SREs, developers, and platform leads across Services and beyond to accelerate adoption & promote SRE best practices like SLO adoption, Capacity Planning
Use Jira/Agile workflows to track and report on observability maturity across Services LoBs – coverage, adoption, and contribution to improved client experience

What we offer

27 days annual leave (plus bank holidays)
A discretional annual performance related bonus
Private Medical Care & Life Insurance
Employee Assistance Program
Pension Plan
Paid Parental Leave
Special discounts for employees, family, and friends
Access to an array of learning and development resources

Fulltime

Orion Tech SRE Lead - Senior Vice President

The Orion Tech- SRE Lead is a hands-on leader responsible for shaping and delive...

Location

India , Chennai; Pune

Salary:

Not provided

Citi

Expiration Date

Until further notice

Requirements

16+ years of experience in Observability, SRE, Infrastructure Engineering, or Platform Architecture, including 5+ years in senior leadership roles
Deep expertise in observability tools and stacks such as Grafana, Prometheus, OpenTelemetry, ELK, Splunk, and similar platforms
Strong hands-on experience across hybrid infrastructure, including on-prem, cloud (AWS, Google Cloud), and container platforms (ECS, Kubernetes)
Proven ability to design scalable telemetry and instrumentation strategies, resolve production observability gaps, and integrate them into large-scale systems
Experience leading teams and managing people across geographically distributed locations
Strong ability to influence platform, cloud, and engineering leaders to ensure observability tooling is built for reuse and scale
Deep understanding of SRE fundamentals, including SLIs, SLOs, error budgets, and telemetry-driven operations
Strong collaboration skills and experience working across horizontal infrastructure teams, building consensus and delivering changes
Ability to stay up to date with market trends and apply them to improve internal tooling and design decisions
Good understanding of AI tech stack, should be able to create a business case and solve using Citibank AI solutions

Job Responsibility

Define and own the roadmap for Engineering enablers for Project Orion team aligned with enterprise reliability and SRE Services organization goals
Translate Organization strategy into an actionable delivery plan in partnership with Services Products, Operations & Engineering function, delivering incremental, high-value milestones
Understand Critical Business Services functional scope and translate into End-to-End monitoring solutions
Periodic review and analyze application monitoring TOIL and collaborate with stakeholders and remediate them as per organization goal
Identify manual operations use cases which are performed by Level 1 functions. Create a strategic plan to automate
Drive reusability and efficiency by tracking problem statements raised by Orion Level 1 Function by providing milestone delivery plan
Ability to Design & Build strategic observability dashboard including gold signals like SLO, SLI, Latency & business metrics in a single pane of glass
Lead and mentor SREs, fostering a technical growth and SRE mindset
Work hands-on to troubleshoot telemetry and instrumentation issues across on-prem, cloud (AWS, GCP, etc.), and ECS/Kubernetes-based environments
Use Jira/Agile workflows to track and report on strategic enablers coverage, adoption, and contribution to improved client experience

Fulltime

Engineering Manager for Observability/CI/CD and Cloud

Lead the AI-Driven Evolution of Groupon’s Global Engineering Platform. At Groupo...

Location

Dublin; Madrid; Prague; Valencia; Warsaw

Salary:

Not provided

Groupon

Expiration Date

Until further notice

Requirements

5+ years’ experience leading infrastructure, DevOps, or SRE teams (5+ people), ideally in high-change, scale-up environments
Deep technical expertise in cloud-native platforms, observability, infrastructure as code, and CI/CD tooling
Proven success operationalizing AI tools within engineering workflows
Strategic, resilient, and pragmatic approach: ready to own results and thrive under shifting priorities
Exceptional communication: able to simplify complexity and effectively partner with C-level and global teams
Bachelor’s or Master’s in Computer Science (or similar)—or equivalent industry experience

Job Responsibility

Lead & Inspire: Build and mentor a high-performing, globally distributed team of CI/CD and Observability engineers (5-10 direct reports), coaching them in cutting-edge AI-assisted workflows and best practices
Modernize Core Infrastructure: Spearhead the migration from legacy platforms (Jenkins, ELK) to cloud-native solutions (GitHub Actions, Google Cloud Logging, GCP Prometheus/Grafana). Eliminate “straggler” pipelines and drive cost-efficient, reliable operations
AI-First Engineering: Operationalize AI tools (Claude Code, Copilot, ChatGPT, etc.) for everything from log analysis and incident summaries to automated infrastructure as code, making AI-augmented engineering a daily norm
Architect & Optimize: Oversee a hybrid tech stack (Kubernetes, Envoy, Terraform, GCP, AWS), ensuring platforms are fast, scalable, and “self-healing” via LLM integrations
Collaborate Globally: Act as a thought leader and cross-functional partner, advocating for AI-driven developer experience and collaborating with leaders in SRE, Product, and Cloud
Drive Transformation: Deliver strategic projects with tight deadlines and direct business impact, such as the Jenkins-to-GHA and ELK-to-GCP migrations, while maintaining a high standard of technical excellence and cost efficiency

What we offer

Drive real, high-visibility change at the heart of a company undergoing major transformation
Work on complex technical and operational challenges in a fast-paced, AI-first environment
Accelerate your impact—and your team’s—using industry-leading AI and automation tools
Influence engineering practices across a global platform impacting millions of users

SRE Lead

Randstad

Location:
Malaysia , Kuala Lumpur

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:
February 18, 2026

Expiration:
March 21, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for SRE Lead

Lead SRE

Site Reliability Engineering Support Lead

Lead Site Reliability Engineer

Engineering Lead Analyst

Director, Service Reliability Engineering

SRE Observability Lead Engineer

Orion Tech SRE Lead - Senior Vice President

Engineering Manager for Observability/CI/CD and Cloud

SRE Lead

Randstad

Location:Malaysia , Kuala Lumpur

Category:IT - Software Development

Contract Type:Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:February 18, 2026

Expiration:March 21, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for SRE Lead

Lead SRE

Site Reliability Engineering Support Lead

Lead Site Reliability Engineer

Engineering Lead Analyst

Director, Service Reliability Engineering

SRE Observability Lead Engineer

Orion Tech SRE Lead - Senior Vice President

Engineering Manager for Observability/CI/CD and Cloud

Location:
Malaysia , Kuala Lumpur

Category:
IT - Software Development

Contract Type:
Not provided

Job Posted:
February 18, 2026

Expiration:
March 21, 2026