CrawlJobs Logo

Lead Site Reliability Engineer

Mexico, Mexico City · Job Posted June 29, 2026
Apply Position
Job Link Share

Job Description

We're building a Site Reliability Engineering center in Mexico City, and we're hiring a Manager-level Backend Engineer to own the reliability and operational maturity of our settlement platforms. These are batch-critical systems that process every credit and debit transaction across the network. This is a foundational role. You'll be one of the first engineers in CDMX responsible for ensuring settlement cycles complete accurately, on time, and in compliance with SOX and PCI-DSS requirements. You'll work across hybrid infrastructure (on-prem data centers and AWS), partner closely with UK-based engineers, and build the automation and observability that allows Mexico City to operate settlement.

Job Responsibility

  • Own reliability for batch settlement systems - ensure cycle completion windows are met, data integrity is maintained, and failures are detected before they reach downstream consumers
  • Build and improve observability for settlement pipelines - dashboards, alerts, and anomaly detection that make system health legible and reduce reliance on tribal knowledge
  • Drive automation of operational toil - certificate rotation, environment provisioning, compliance artifact generation, and manual validation steps that currently require human intervention
  • Partner with UK-based settlement engineers - acquire domain expertise on Durbin compliance windows, cross-border DCI routing, and acquirer/issuer SLA adherence
  • Participate in incident management - respond to settlement failures, drive root cause analysis, and implement durable fixes that prevent recurrence
  • Contribute to regulatory readiness - ensure SRE practices produce audit-ready artifacts for SOX and PCI-DSS exams without manual toil

Requirements

  • Professional English fluency
  • Bachelor's degree
  • At least 6 years of experience in SRE, production operations, or reliability engineering
  • Experience in DevOps Engineering (internship experience does not apply)
  • 5+ years of experience in at least one of the following: Java, Python, Go
  • At least 4 years of experience with Cloud Native technologies (Amazon Web Services, Microsoft Azure, Google Cloud Platform)
  • 3+ years of experience with container orchestration services including Docker or Kubernetes
  • Experience with Shell or Bash scripting
  • At least 3 years of Unix or Linux system administration experience

Nice to have

  • Experience developing automation solutions using agentic AI tools (Claude Code, Copilot CLI)
  • Troubleshooting and debugging skills across distributed systems
  • Familiarity with payments, financial services, or other regulated high-availability domains
  • Knowledge or experience of Networking concepts (TCP/DNS/TLS)

What we offer

  • Healthy Body, Healthy Mind
  • Save Money, Make Money
  • Time, Family and Advice

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Lead Site Reliability Engineer

8 matching positions

Lead Site Reliability Engineer

Trimble is looking for a Site Reliability Engineering Lead to join Business Syst...
Location
Location
India , Chennai
Salary
Salary:
Not provided
trimble.com Logo
Trimble Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Engineering, Computer Science, or a related field
  • 7+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles with at least 2+ years in a leadership or mentoring capacity
  • Deep AWS expertise (EC2, S3, RDS, IAM, VPC, Lambda, CloudFormation/Terraform, etc.)
  • Strong knowledge of Infrastructure-as-Code (IaC) using Terraform, AWS CDK, or CloudFormation
  • Proven experience with CI/CD tools (Jenkins, GitHub Actions, GitLab CI, or similar)
  • Proficiency in containerization and orchestration (Docker, Kubernetes, ECS, or EKS)
  • Expertise in monitoring and observability tools (Datadog, New Relic, Prometheus, Grafana, ELK, CloudWatch, etc.)
  • Strong scripting or programming background (Python, Bash, or Go)
  • Sound understanding of networking, security, and identity/access management in the cloud
  • Experience designing high-availability and disaster recovery strategies for critical workloads
Job Responsibility
Job Responsibility
  • Become well-versed in the opportunities and challenges of the business and Trimble's customers
  • Become an expert in Business Systems services, especially the interfaces—APIs, protocols (e.g. OAuth), and user interfaces
  • Establish, then utilize tight working relationships with stakeholders across the company, especially Trimble's engineering community
  • Prototype and create proofs of concept as required
  • Scope and deploy new integrations
  • Investigate, diagnose, and solve customer integration issues
  • Effectively communicate technical issues with stakeholders in non-technical language
  • Contribute to utilities and SDKs to help integration and migration efforts
  • Fulltime
Read More
Arrow Right

Lead Site Reliability engineer

Solution, Reliability and Monitoring Entity main objective is to define, provide...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
airbus.com Logo
Airbus
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science, information technology or other related discipline with 7+ years of experience
  • Solid experience designing and building secure solutions in AWS (Amazon Web Services)
  • Extensive experience in systems administration or a combination of software/systems experience
  • Some experience in scripting and automation of asset
  • Solid knowledge of Operating Systems & ability to perform troubleshooting required
  • Extensive knowledge of Cloud Technology concepts & ability to perform complex troubleshooting required
  • Solid knowledge of networking for enterprise environments required
  • Solid knowledge of Virtual Machine concepts and management of infrastructure
  • Demonstrated ability to identify root cause of issues and to recommend permanent, long term, fixes
  • Demonstrated ability to perform complex troubleshooting in AWS environment and providing guidance to other teams
Job Responsibility
Job Responsibility
  • Define, implement, and manage cloud-based infrastructure
  • Work closely with the Software Factory’s (SWF) Solution Architects to facilitate the transition from Development to In-Support phase
  • Creating/Animating an hosting network with SWF
  • Representing Hosting Group in the different Trains
  • Coordinating with Solution Architects (SAs) to support the technical architecture decisions related to Hosting
  • Supporting SWF for new components onboarding
  • Coordinate with SWF Systems & Architecture team for future planning
  • Contribute to Prioritization Reviews for the different trains
  • Guide products in Service Level Objectives (SLO) definitions & monitoring based on Hosting Operations feedbacks
  • Define, share and broadcast Guidelines and Non-Functional Requirements (NFR) related to: hosting, deployment and monitoring
  • Fulltime
Read More
Arrow Right

Lead Site Reliability Engineer

Glean is seeking a Site Reliability Engineering Lead to foster a culture of engi...
Location
Location
United States , Palo Alto
Salary
Salary:
200000.00 - 260000.00 USD / Year
glean.com Logo
Glean
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, a related field, or equivalent practical experience
  • 8+ years of experience in a senior-level role within Site Reliability Engineering or similar role, particularly in managing cloud-based services and infrastructure
  • 5+ years of experience with software development in one or more programming languages
  • 3+ years of experience managing people or teams, leading projects, and designing, analyzing, and troubleshooting distributed systems running in Cloud
  • Strong knowledge of cloud platforms such as Google Cloud Platform, AWS, or Azure
  • Practical experience with containerization technologies, including Docker and Kubernetes
  • Familiarity with infrastructure as code tools like Terraform is essential
  • Solid understanding of networking, security principles, and best SRE and security practices
  • Proficiency in using monitoring and alerting tools to detect and respond to potential issues effectively
Job Responsibility
Job Responsibility
  • Foster a culture of engineering excellence, drive technical strategy, and develop a high-performing, collaborative team
  • Ensure services meet stringent Service Level Objectives (SLOs)
  • Build resilient, automated production environments in the cloud
  • Lead a team and be responsible for products globally
  • Provide technical leadership to key projects
  • Manage the complex challenges of scale and fast growth
  • Keep Glean applications up and running
  • Drive technical excellence and foster a culture of reliability across engineering teams
  • Set best practices for incident management, performance optimization, and automation
  • Influence best practices, drive cross-team collaborations, and contribute to the execution of key objectives
What we offer
What we offer
  • Comprehensive benefits package
  • Medical, Vision, and Dental coverage
  • Generous time-off policy
  • Opportunity to contribute to 401k plan
  • Home office improvement stipend
  • Annual education and wellness stipends
  • Vibrant company culture through regular events
  • Healthy lunches daily
  • Fulltime
Read More
Arrow Right

Lead Site Reliability Engineer

Our client is committed to building trust and making the world more agreeable fo...
Location
Location
Salary
Salary:
Not provided
n-ix.com Logo
N-iX
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in a relevant programming language
  • Extensive knowledge of Cosmos DB management and optimization
  • Strong Terraform IaC deployment experience
  • Proven ability to interact with stakeholders and promote best practices
  • Dashboarding/data visualization experience
Job Responsibility
Job Responsibility
  • Identify and assess Cosmos DB resource utilization and recommend optimization strategies
  • Engage directly with resource owners to present findings and implement rightsizing
  • Design, build, and maintain dashboards to visualize Cosmos DB usage and opportunities for improvement
  • Develop Terraform-based solutions for efficient cloud database management
  • Stay updated on best practices around cloud cost optimization and security
What we offer
What we offer
  • Flexible working format - remote, office-based or flexible
  • A competitive salary and good compensation package
  • Personalized career growth
  • Professional development tools (mentorship program, tech talks and trainings, centers of excellence, and more)
  • Active tech communities with regular knowledge sharing
  • Education reimbursement
  • Memorable anniversary presents
  • Corporate events and team buildings
  • Other location-specific benefits
Read More
Arrow Right

Lead Site Reliability Engineer

Our client is committed to building trust and making the world more agreeable fo...
Location
Location
Salary
Salary:
Not provided
n-ix.com Logo
N-iX
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in software development with languages such as C#, C++, or Java
  • Hands-on experience with Service Bus in a global enterprise setting
  • Proven expertise in Terraform and deployment automation
  • Experience with DR processes, dashboard creation, and resource rightsizing
  • Strong communication skills to drive engagement with service owners
Job Responsibility
Job Responsibility
  • Build and maintain DRI dashboards to identify resource utilization and optimization opportunities for Service Bus
  • Collaborate with service owners to recommend and implement right-sizing strategies
  • Author high-quality, scalable automation code to streamline disaster recovery processes
  • Develop and deploy IaC solutions using Terraform
  • Drive adoption of automation and robust monitoring for service health and disaster recovery
  • Participate in on-call rotations and refine processes for improved system reliability
What we offer
What we offer
  • Flexible working format - remote, office-based or flexible
  • A competitive salary and good compensation package
  • Personalized career growth
  • Professional development tools (mentorship program, tech talks and trainings, centers of excellence, and more)
  • Active tech communities with regular knowledge sharing
  • Education reimbursement
  • Memorable anniversary presents
  • Corporate events and team buildings
  • Other location-specific benefits
  • Fulltime
Read More
Arrow Right

Lead Site Reliability Engineer

Groupon is a marketplace where customers discover new experiences and services e...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in systems engineering
  • at least 5+ years in SRE or DevOps roles
  • expertise in cloud platforms (GCP, AWS) and container orchestration (Kubernetes, Docker)
  • proficiency in programming and scripting languages like Python, Go, and Bash
  • advanced knowledge of Infrastructure as Code (IaC) tools such as Terraform and Ansible
  • deep understanding of networking, DNS, load balancing, and security principles
  • proven track record of managing high-availability systems in demanding environments
  • exceptional analytical and problem-solving skills
Job Responsibility
Job Responsibility
  • Architect and maintain fault-tolerant systems, ensuring uptime SLAs of 99.9% or higher
  • drive automation in infrastructure management and deployment using Terraform, Ansible, Kubernetes, and similar tools
  • create and optimize CI/CD pipelines to ensure reliable, secure, and efficient software delivery
  • build and enhance comprehensive observability solutions, including monitoring, logging, and alerting systems using Prometheus, Grafana, and the ELK stack
  • collaborate with stakeholders to define and achieve SLIs, SLOs, and error budgets aligned with business needs
  • lead incident response during on-call rotations, ensuring rapid resolution and root cause analysis for critical issues
  • design and execute performance testing, capacity planning, and scalability strategies for evolving workloads
  • proactively identify and resolve bottlenecks, increasing system performance and developer efficiency
  • mentor junior engineers, fostering a collaborative and growth-oriented team environment
  • guide architectural decisions that drive innovation and enhance system reliability
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • a collaborative and innovative work values alignment that values your expertise and contributions
  • professional growth and leadership development pathways tailored to your aspirations
  • a chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Lead Site Reliability Engineer/ Expert

Responsible for ensuring highly reliable, scalable, and resilient production sys...
Location
Location
Egypt; India , Cairo; Delhi
Salary
Salary:
Not provided
sita.aero Logo
SITA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field. Master’s degree preferred for senior roles
  • Relevant certifications such as ITIL, CCNP/CCIE, Palo Alto Security, SASE, SDWAN, Juniper Mist/Aruba, CompTIA Security+, or Certified Kubernetes Administrator (CKA)
  • Certifications in cloud platforms (AWS, Azure, Google Cloud) or DevOps methodologies
  • Certifications in automation and IaC tools (Ansible, Terraform)
  • Certifications in observability and monitoring platforms (Dynatrace, Prometheus, Grafana, ELK)
  • Certifications in ServiceNow, Jira, or other operational tooling
  • 8+ years in IT operations, service management, or infrastructure reliability, including roles such as Site Reliability Engineer, Problem Manager, or DevOps Engineer
  • Strong experience with high availability systems, resilience engineering, and DR readiness
  • Deep expertise in RCA, incident management, PMIR, and implementing permanent fixes for recurring issues
  • Hands on experience with CI/CD, automation, IaC, and self healing/auto remediation workflows
Job Responsibility
Job Responsibility
  • Design & maintain resilient systems ensuring high availability, scalability, and fault tolerance
  • Ensure effective Disaster Recovery (DR), failover strategies, and resilience engineering across environments
  • Improve platform reliability, observability, and performance across cloud and on‑premises systems
  • Establish and maintain SLIs, SLOs, and error budgets to measure and govern service reliability
  • Take ownership of production availability, capacity planning, performance tuning, and long‑term reliability initiatives
  • Drive automation for infrastructure provisioning, deployment, monitoring, and operational workflows
  • Develop and implement auto‑remediation and self‑healing solutions to reduce manual intervention
  • Manage CI/CD pipelines and Infrastructure as Code (IaC) frameworks for secure, repeatable deployments
  • Implement and manage zero‑downtime deployment strategies (blue‑green, canary, rolling)
  • Support containerized and cloud‑native platforms including Kubernetes, Docker, and distributed systems
What we offer
What we offer
  • Work from home up to 2 days/week (depending on your team's needs)
  • Make your workday suit your life and plans
  • Take up to 30 days a year to work from any location in the world
  • Employee Assistance Program (EAP), for you and your dependents 24/7, 365 days/year
  • Champion Health - a personalized platform that supports a range of wellbeing needs
  • Access to world-class learning platforms and programs (LinkedIn Learning, Microsoft's Enterprise Skills Initiative, Airport Council International, Pluralsight, Harvard Business Publishing, Stanford)
  • Competitive benefits that make sense with both your local market and employment status
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer (Lead)

10Pearls is an award-winning end-to-end digital innovation company that helps bu...
Location
Location
Pakistan , Islamabad
Salary
Salary:
Not provided
10pearls.com Logo
10Pearls
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science or related field
  • 5–8 years in SRE or production-engineering roles running distributed systems at scale
  • Deep Kubernetes expertise — operators, RBAC, network policy, storage, upgrades
  • Hands-on with Keycloak / Vault / MinIO / Harbor / Kong or equivalent identity/secrets/storage/registry/gateway stacks
  • Strong Linux fundamentals and at least one systems language (Go, Rust) or shell/Python for tooling
  • Proven SLO/SLI authorship and error-budget-driven decision-making
  • Experience with observability stacks (Prometheus, Grafana, OpenTelemetry, Loki, Tempo)
  • Calm, clear communication during incidents
  • strong post-mortem writing
  • Hands-on with infra-as-code — Helm, Kustomize, Terraform
Job Responsibility
Job Responsibility
  • Substrate operation — own the Kubernetes cluster plus Keycloak (identity), Vault (secrets), MinIO (object storage), Harbor (registry), Kong (gateway) — from bootstrap to day-2 operations
  • SLO framework — define, publish, and defend SLOs for every tier-1 service
  • own error budgets and burn-rate alerting
  • Incident response — build the on-call rotation, paging, runbook library, and post mortem culture
  • lead incident command during P1/P2 events
  • Release operations — co-own the blue-green / canary release model with L6 Delivery
  • sign off production-bound releases
  • Air-gap operations — ensure every operational runbook works in a fully offline environment — no assumption of external dependencies
  • Lead the Platform squad — technically lead 1 Infrastructure Engineer, 1 Observability Engineer, 2 DevOps Engineers
  • set standards for infra-as-code and automation
  • Fulltime
Read More
Arrow Right