Lead Site Reliability Engineer Job at Glean (Palo Alto)

Lead Site Reliability Engineer

We're building a Site Reliability Engineering center in Mexico City, and we're h...

Location

Mexico , Mexico City

Salary:

Not provided

Capital One

Expiration Date

Until further notice

Requirements

Professional English fluency
Bachelor's degree
At least 6 years of experience in SRE, production operations, or reliability engineering
Experience in DevOps Engineering (internship experience does not apply)
5+ years of experience in at least one of the following: Java, Python, Go
At least 4 years of experience with Cloud Native technologies (Amazon Web Services, Microsoft Azure, Google Cloud Platform)
3+ years of experience with container orchestration services including Docker or Kubernetes
Experience with Shell or Bash scripting
At least 3 years of Unix or Linux system administration experience

Job Responsibility

Own reliability for batch settlement systems - ensure cycle completion windows are met, data integrity is maintained, and failures are detected before they reach downstream consumers
Build and improve observability for settlement pipelines - dashboards, alerts, and anomaly detection that make system health legible and reduce reliance on tribal knowledge
Drive automation of operational toil - certificate rotation, environment provisioning, compliance artifact generation, and manual validation steps that currently require human intervention
Partner with UK-based settlement engineers - acquire domain expertise on Durbin compliance windows, cross-border DCI routing, and acquirer/issuer SLA adherence
Participate in incident management - respond to settlement failures, drive root cause analysis, and implement durable fixes that prevent recurrence
Contribute to regulatory readiness - ensure SRE practices produce audit-ready artifacts for SOX and PCI-DSS exams without manual toil

What we offer

Healthy Body, Healthy Mind
Save Money, Make Money
Time, Family and Advice

Fulltime

Lead Site Reliability Engineer

Trimble is looking for a Site Reliability Engineering Lead to join Business Syst...

Location

India , Chennai

Salary:

Not provided

Trimble Inc.

Expiration Date

Until further notice

Requirements

Bachelor's or Master's degree in Computer Engineering, Computer Science, or a related field
7+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles with at least 2+ years in a leadership or mentoring capacity
Deep AWS expertise (EC2, S3, RDS, IAM, VPC, Lambda, CloudFormation/Terraform, etc.)
Strong knowledge of Infrastructure-as-Code (IaC) using Terraform, AWS CDK, or CloudFormation
Proven experience with CI/CD tools (Jenkins, GitHub Actions, GitLab CI, or similar)
Proficiency in containerization and orchestration (Docker, Kubernetes, ECS, or EKS)
Expertise in monitoring and observability tools (Datadog, New Relic, Prometheus, Grafana, ELK, CloudWatch, etc.)
Strong scripting or programming background (Python, Bash, or Go)
Sound understanding of networking, security, and identity/access management in the cloud
Experience designing high-availability and disaster recovery strategies for critical workloads

Job Responsibility

Become well-versed in the opportunities and challenges of the business and Trimble's customers
Become an expert in Business Systems services, especially the interfaces—APIs, protocols (e.g. OAuth), and user interfaces
Establish, then utilize tight working relationships with stakeholders across the company, especially Trimble's engineering community
Prototype and create proofs of concept as required
Scope and deploy new integrations
Investigate, diagnose, and solve customer integration issues
Effectively communicate technical issues with stakeholders in non-technical language
Contribute to utilities and SDKs to help integration and migration efforts

Fulltime

Lead Site Reliability engineer

Solution, Reliability and Monitoring Entity main objective is to define, provide...

Location

India , Bangalore

Salary:

Not provided

Airbus

Expiration Date

Until further notice

Requirements

Bachelor’s or Master’s degree in Computer Science, information technology or other related discipline with 7+ years of experience
Solid experience designing and building secure solutions in AWS (Amazon Web Services)
Extensive experience in systems administration or a combination of software/systems experience
Some experience in scripting and automation of asset
Solid knowledge of Operating Systems & ability to perform troubleshooting required
Extensive knowledge of Cloud Technology concepts & ability to perform complex troubleshooting required
Solid knowledge of networking for enterprise environments required
Solid knowledge of Virtual Machine concepts and management of infrastructure
Demonstrated ability to identify root cause of issues and to recommend permanent, long term, fixes
Demonstrated ability to perform complex troubleshooting in AWS environment and providing guidance to other teams

Job Responsibility

Define, implement, and manage cloud-based infrastructure
Work closely with the Software Factory’s (SWF) Solution Architects to facilitate the transition from Development to In-Support phase
Creating/Animating an hosting network with SWF
Representing Hosting Group in the different Trains
Coordinating with Solution Architects (SAs) to support the technical architecture decisions related to Hosting
Supporting SWF for new components onboarding
Coordinate with SWF Systems & Architecture team for future planning
Contribute to Prioritization Reviews for the different trains
Guide products in Service Level Objectives (SLO) definitions & monitoring based on Hosting Operations feedbacks
Define, share and broadcast Guidelines and Non-Functional Requirements (NFR) related to: hosting, deployment and monitoring

Fulltime

Lead Site Reliability Engineer

Our client is committed to building trust and making the world more agreeable fo...

Location

Salary:

Not provided

N-iX

Expiration Date

Until further notice

Requirements

8+ years of experience in a relevant programming language
Extensive knowledge of Cosmos DB management and optimization
Strong Terraform IaC deployment experience
Proven ability to interact with stakeholders and promote best practices
Dashboarding/data visualization experience

Job Responsibility

Identify and assess Cosmos DB resource utilization and recommend optimization strategies
Engage directly with resource owners to present findings and implement rightsizing
Design, build, and maintain dashboards to visualize Cosmos DB usage and opportunities for improvement
Develop Terraform-based solutions for efficient cloud database management
Stay updated on best practices around cloud cost optimization and security

What we offer

Flexible working format - remote, office-based or flexible
A competitive salary and good compensation package
Personalized career growth
Professional development tools (mentorship program, tech talks and trainings, centers of excellence, and more)
Active tech communities with regular knowledge sharing
Education reimbursement
Memorable anniversary presents
Corporate events and team buildings
Other location-specific benefits

Lead Site Reliability Engineer

Our client is committed to building trust and making the world more agreeable fo...

Location

Salary:

Not provided

N-iX

Expiration Date

Until further notice

Requirements

8+ years in software development with languages such as C#, C++, or Java
Hands-on experience with Service Bus in a global enterprise setting
Proven expertise in Terraform and deployment automation
Experience with DR processes, dashboard creation, and resource rightsizing
Strong communication skills to drive engagement with service owners

Job Responsibility

Build and maintain DRI dashboards to identify resource utilization and optimization opportunities for Service Bus
Collaborate with service owners to recommend and implement right-sizing strategies
Author high-quality, scalable automation code to streamline disaster recovery processes
Develop and deploy IaC solutions using Terraform
Drive adoption of automation and robust monitoring for service health and disaster recovery
Participate in on-call rotations and refine processes for improved system reliability

What we offer

Flexible working format - remote, office-based or flexible
A competitive salary and good compensation package
Personalized career growth
Professional development tools (mentorship program, tech talks and trainings, centers of excellence, and more)
Active tech communities with regular knowledge sharing
Education reimbursement
Memorable anniversary presents
Corporate events and team buildings
Other location-specific benefits

Fulltime

Lead Site Reliability Engineer/ Expert

Responsible for ensuring highly reliable, scalable, and resilient production sys...

Location

Egypt; India , Cairo; Delhi

Salary:

Not provided

SITA

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field. Master’s degree preferred for senior roles
Relevant certifications such as ITIL, CCNP/CCIE, Palo Alto Security, SASE, SDWAN, Juniper Mist/Aruba, CompTIA Security+, or Certified Kubernetes Administrator (CKA)
Certifications in cloud platforms (AWS, Azure, Google Cloud) or DevOps methodologies
Certifications in automation and IaC tools (Ansible, Terraform)
Certifications in observability and monitoring platforms (Dynatrace, Prometheus, Grafana, ELK)
Certifications in ServiceNow, Jira, or other operational tooling
8+ years in IT operations, service management, or infrastructure reliability, including roles such as Site Reliability Engineer, Problem Manager, or DevOps Engineer
Strong experience with high availability systems, resilience engineering, and DR readiness
Deep expertise in RCA, incident management, PMIR, and implementing permanent fixes for recurring issues
Hands on experience with CI/CD, automation, IaC, and self healing/auto remediation workflows

Job Responsibility

Design & maintain resilient systems ensuring high availability, scalability, and fault tolerance
Ensure effective Disaster Recovery (DR), failover strategies, and resilience engineering across environments
Improve platform reliability, observability, and performance across cloud and on‑premises systems
Establish and maintain SLIs, SLOs, and error budgets to measure and govern service reliability
Take ownership of production availability, capacity planning, performance tuning, and long‑term reliability initiatives
Drive automation for infrastructure provisioning, deployment, monitoring, and operational workflows
Develop and implement auto‑remediation and self‑healing solutions to reduce manual intervention
Manage CI/CD pipelines and Infrastructure as Code (IaC) frameworks for secure, repeatable deployments
Implement and manage zero‑downtime deployment strategies (blue‑green, canary, rolling)
Support containerized and cloud‑native platforms including Kubernetes, Docker, and distributed systems

What we offer

Work from home up to 2 days/week (depending on your team's needs)
Make your workday suit your life and plans
Take up to 30 days a year to work from any location in the world
Employee Assistance Program (EAP), for you and your dependents 24/7, 365 days/year
Champion Health - a personalized platform that supports a range of wellbeing needs
Access to world-class learning platforms and programs (LinkedIn Learning, Microsoft's Enterprise Skills Initiative, Airport Council International, Pluralsight, Harvard Business Publishing, Stanford)
Competitive benefits that make sense with both your local market and employment status

Fulltime

Site Reliability Engineer (Lead)

10Pearls is an award-winning end-to-end digital innovation company that helps bu...

Location

Pakistan , Islamabad

Salary:

Not provided

10Pearls

Expiration Date

Until further notice

Requirements

Bachelor's degree in computer science or related field
5–8 years in SRE or production-engineering roles running distributed systems at scale
Deep Kubernetes expertise — operators, RBAC, network policy, storage, upgrades
Hands-on with Keycloak / Vault / MinIO / Harbor / Kong or equivalent identity/secrets/storage/registry/gateway stacks
Strong Linux fundamentals and at least one systems language (Go, Rust) or shell/Python for tooling
Proven SLO/SLI authorship and error-budget-driven decision-making
Experience with observability stacks (Prometheus, Grafana, OpenTelemetry, Loki, Tempo)
Calm, clear communication during incidents
strong post-mortem writing
Hands-on with infra-as-code — Helm, Kustomize, Terraform

Job Responsibility

Substrate operation — own the Kubernetes cluster plus Keycloak (identity), Vault (secrets), MinIO (object storage), Harbor (registry), Kong (gateway) — from bootstrap to day-2 operations
SLO framework — define, publish, and defend SLOs for every tier-1 service
own error budgets and burn-rate alerting
Incident response — build the on-call rotation, paging, runbook library, and post mortem culture
lead incident command during P1/P2 events
Release operations — co-own the blue-green / canary release model with L6 Delivery
sign off production-bound releases
Air-gap operations — ensure every operational runbook works in a fully offline environment — no assumption of external dependencies
Lead the Platform squad — technically lead 1 Infrastructure Engineer, 1 Observability Engineer, 2 DevOps Engineers
set standards for infra-as-code and automation

Fulltime

Technical Lead-Site Reliability Engineer

We are seeking an experienced Site Reliability Engineer to support Vodafone’s st...

Location

Egypt , Cairo

Salary:

Not provided

Vodafone

Expiration Date

Until further notice

Requirements

Experienced in Site Reliability Engineering, DevOps, or production support roles within complex, enterprise-scale environments
Skilled in Unix/Linux administration with strong shell scripting experience
Experienced with CI/CD tools such as Git, Jenkins, Nexus, SonarQube, and configuration or automation tools
Proficient in infrastructure as code using tools such as Terraform or CloudFormation
Comfortable working with public cloud platforms such as AWS or Azure
Able to develop using one or more high-level programming languages, including Python, Java, or JavaScript
Experienced in containerisation and orchestration technologies, including Docker and Kubernetes
Familiar with monitoring and observability tools such as Prometheus, Grafana, CloudWatch, or Centreon
Knowledgeable in microservices architecture, APIs, and web services (REST, SOAP, JSON, XML)
Experienced with relational and NoSQL data stores such as PostgreSQL, MariaDB, Redis, MongoDB, or similar technologies

Job Responsibility

Drive reliability, availability, and performance across IoT platforms through proactive monitoring, automation, and operational improvements
Design, deploy, review, and troubleshoot technical integrations with multiple platforms, services, and connected devices
Implement and enhance CI/CD practices to enable high levels of operational automation and zero-touch operations
Partner with development teams to improve services through rigorous testing, release management, and operational readiness
Act as a technical subject matter expert, supporting and coaching team members to build capability across relevant technologies
Lead and support incident and problem management activities, ensuring timely resolution, root cause analysis, and preventive actions in line with agreed SLAs
Contribute to system design reviews, including HLDs and LLDs, translating architectural decisions into operational requirements
Balance feature delivery speed with platform reliability through clearly defined service level objectives
Design, implement, and continuously enhance monitoring, alerting, and observability solutions to maintain a holistic view of system health
Manage production environments through proactive capacity planning, performance optimisation, and release deployments

What we offer

The opportunity to work on large-scale, business-critical IoT platforms with global reach
Exposure to modern cloud-native architectures, DevOps practices, and automation at enterprise scale
Collaboration with international teams across Vodafone Group and strategic partners
A role that blends hands-on engineering with system design, reliability strategy, and continuous improvement
A supportive environment that values learning, knowledge sharing, and professional growth

Select Country

Lead Site Reliability Engineer

Job Description

Job Responsibility

Requirements

What we offer

Looking for more opportunities?

Lead Site Reliability Engineer

Lead Site Reliability Engineer

Lead Site Reliability Engineer

Lead Site Reliability engineer

Lead Site Reliability Engineer

Lead Site Reliability Engineer

Lead Site Reliability Engineer/ Expert

Site Reliability Engineer (Lead)

Technical Lead-Site Reliability Engineer

Our AI answers in your language