CrawlJobs Logo

Site Reliability Engineering (SRE) / Lead Engineer

Mexico, Guadalajara · Job Posted May 04, 2026
Apply Position
Job Link Share

Job Description

We are currently seeking a Site Reliability Engineering (SRE) / Lead Engineer to join our team in Guadalajara, Jalisco (MX-JAL), Mexico (MX). Site Reliability Engineer (SRE) / Lead Engineer candidate will have deep expertise in Application Performance Monitoring (APM), Infrastructure as Code (IaC), automation, and distributed tracing using OpenTelemetry. As a SRE lead, he will guide the design, implementation, and continuous improvement of observability solutions, ensuring system reliability, performance, and scalability while fostering best practices in SRE and DevOps.

Job Responsibility

  • Lead the strategic development and management of observability and reliability frameworks across the organization, ensuring alignment with business goals and technical requirements
  • Design and implementation of monitoring and observability solutions, collaborating with engineering teams to define standards and best practices
  • Manage Infrastructure as Code (IaC) initiatives using Terraform, coordinating with cloud and infrastructure teams to ensure scalable and secure deployments
  • Drive automation strategies for monitoring, alerting, and logging pipelines, focusing on process improvements and operational efficiency
  • Develop and maintain comprehensive observability roadmaps, including distributed tracing, logging, and metrics collection strategies
  • Collaborate with product management, sales, and pre-sales teams to provide technical expertise and support during solution design and customer engagements
  • Lead cross-functional teams to enhance CI/CD pipelines and deployment reliability, ensuring smooth integration of observability tools and practices
  • Engage with vendors and strategic partners to evaluate, select, and integrate observability and monitoring solutions, ensuring alignment with organizational needs and fostering strong collaborative relationships
  • Mentor and develop junior engineers and analysts, fostering a culture of reliability, observability, and operational excellence

Requirements

  • 8-10+ years of experience in SRE, Observability, or DevOps roles, with leadership responsibilities
  • Hands-on experience with OpenTelemetry for distributed tracing and observability instrumentation
  • Proven expertise with Application Performance Monitoring (APM) tools such as New Relic, Datadog, AppDynamics, or Dynatrace
  • Strong proficiency in Infrastructure as Code (IaC) using Terraform
  • Solid understanding of cloud platforms including AWS, GCP, or Azure
  • Experience with automation/configuration management tools like Ansible, Chef, or Puppet
  • Deep knowledge of CI/CD pipelines and tools such as GitHub Actions, Jenkins, or Azure DevOps
  • Experience managing Kubernetes and containerized environments (Docker, Helm)
  • Familiarity with log aggregation and analysis platforms like ELK Stack or Splunk
  • Excellent leadership, communication, and collaboration skills

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineering (SRE) / Lead Engineer

8 matching positions

Site Reliability Engineering (SRE) Team Lead

We are looking for a highly skilled and experienced Site Reliability Engineering...
Location
Location
United States , Irving
Salary
Salary:
Not provided
onemainfinancial.com Logo
OneMain Financial
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BA/BS in Computer Science, Engineering, related field, or equivalent experience
  • 7+ years of experience in site reliability engineering, systems engineering, or related roles, with at least 2 years in a leadership position
  • Proven experience leading and scaling high-performing engineering teams
  • Deep expertise in cloud platforms (AWS, GCP, Azure) and container orchestration (Kubernetes, Docker)
  • Strong skills in infrastructure as code tools (Terraform, Ansible, CloudFormation) and CI/CD pipelines
  • Proficiency with monitoring and alerting systems (Prometheus, Grafana, ELK, Datadog)
  • Solid programming and scripting skills (Python, Go, Bash, or similar)
  • Strong understanding of distributed systems, networking, security, and databases
  • Excellent leadership, communication, and collaboration skills
  • Experience managing incident response and on-call rotations
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow a team of site reliability engineers, promoting a culture of reliability, automation, and continuous improvement
  • Drive the design, implementation, and maintenance of scalable and fault-tolerant infrastructure to support high-availability services
  • Oversee incident management processes, including triage, root cause analysis, and postmortems to improve system reliability and prevent recurrence
  • Collaborate cross-functionally with software engineering, product, and operations teams to integrate reliability best practices into the software development lifecycle
  • Define and implement operational metrics, SLIs/SLOs, and dashboards to monitor system health and drive proactive improvements
  • Manage and assess the observability of critical environments proactively addressing gaps that may arise
  • Oversee the release management processes, artifacts and tools that drive a repeatable software delivery lifecycle
  • Champion automation efforts to reduce manual intervention, improve deployment pipelines, and optimize infrastructure management
  • Lead capacity planning, disaster recovery, and performance tuning efforts
  • Ensure security and compliance standards are upheld across infrastructure and operations
What we offer
What we offer
  • Health and wellbeing options including medical, prescription, dental, vision, hearing, accident, hospital indemnity, and life insurances
  • Up to 4% matching 401(k)
  • Employee Stock Purchase Plan (10% share discount)
  • Tuition reimbursement
  • Paid time off (15 days’ vacation per year, plus 2 personal days, prorated based on start date)
  • Paid sick leave as determined by state or local ordinance, prorated based on start date
  • Paid holidays (7 days per year, based on start date)
  • Paid volunteer time (3 days per year, prorated based on start date)
  • Access to Talkspace and Hinge for on-demand physical therapy via an app
  • Family back-up care
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering (SRE) / Observability Technical Lead

Join a dynamic team as a Site Reliability Engineer, leading observability and re...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in SRE, Observability, or DevOps roles, with leadership responsibilities
  • Proven expertise with Application Performance Monitoring (APM) tools such as New Relic, Datadog, AppDynamics, or Dynatrace
  • Hands-on experience with OpenTelemetry (OTel) for distributed tracing and observability instrumentation
  • Strong proficiency in Infrastructure as Code (IaC) using Terraform
  • Solid understanding of cloud platforms including AWS, GCP, or Azure
  • Experience with automation/configuration management tools like Ansible, Chef, or Puppet
  • Deep knowledge of CI/CD pipelines and tools such as GitHub Actions, Jenkins, or Azure DevOps
  • Experience managing Kubernetes and containerized environments (Docker, Helm)
  • Familiarity with log aggregation and analysis platforms like ELK Stack or Splunk
  • Excellent leadership, communication, and collaboration skills
Job Responsibility
Job Responsibility
  • Lead the strategic development and management of observability and reliability frameworks across the organization, ensuring alignment with business goals and technical requirements
  • Design and implementation of monitoring and observability solutions, collaborating with engineering teams to define standards and best practices
  • Manage Infrastructure as Code (IaC) initiatives using Terraform, coordinating with cloud and infrastructure teams to ensure scalable and secure deployments
  • Drive automation strategies for monitoring, alerting, and logging pipelines, focusing on process improvements and operational efficiency
  • Develop and maintain comprehensive observability roadmaps, including distributed tracing, logging, and metrics collection strategies
  • Collaborate with product management, sales, and pre-sales teams to provide technical expertise and support during solution design and customer engagements
  • Lead cross-functional teams to enhance CI/CD pipelines and deployment reliability, ensuring smooth integration of observability tools and practices
  • Engage with vendors and strategic partners to evaluate, select, and integrate observability and monitoring solutions, ensuring alignment with organizational needs and fostering strong collaborative relationships
  • Mentor and develop junior engineers and analysts, fostering a culture of reliability, observability, and operational excellence
What we offer
What we offer
  • Tailored benefits that support your physical, emotional, and financial wellbeing
  • Continuous growth and development opportunities
  • Flexible work options
  • Fulltime
Read More
Arrow Right

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...
Location
Location
Canada , Mississauga
Salary
Salary:
115000.00 - 128000.00 CAD / Year
pointclickcare.com Logo
PointClickCare
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years' experience in software engineering
  • Experience with SRE principles
  • Experience with AI/ML in production environments
  • A passion for automation, intelligent systems, and operational excellence
  • Strong debugging, problem-solving, and system design skills
  • Languages: Python, Java, Bash, Terraform
  • Platforms: Azure, Kubernetes, Docker
  • Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
  • ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
  • CI/CD: Jenkins, ArgoCD, Spinnaker
Job Responsibility
Job Responsibility
  • Build ML-based anomaly detection and pattern recognition systems
  • Enhance telemetry with smart tagging and metadata for better AI insights
  • Develop event-driven workflows and self-healing systems using AI triggers
  • Automate incident response with generative AI and custom AI agent orchestration
  • Use time-series forecasting and predictive modelling to anticipate failures
  • Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
  • Build scalable, fault-tolerant systems in a cloud-native environment
  • Participate in on-call rotations and lead incident response for critical systems
  • Skilled in API integration for streamlined data exchange and system connectivity
  • Run internal AIOps workshops and help teams adopt AI maturity models
What we offer
What we offer
  • Benefits starting from Day 1!
  • Retirement Plan Matching
  • Flexible Paid Time Off
  • Wellness Support Programs and Resources
  • Parental & Caregiver Leaves
  • Fertility & Adoption Support
  • Continuous Development Support Program
  • Employee Assistance Program
  • Allyship and Inclusion Communities
  • Employee Recognition … and more!
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer (SRE)

The Senior SRE is responsible for deployment, updates, and operational support f...
Location
Location
India , Chennai
Salary
Salary:
Not provided
dalet.com Logo
Dalet
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Cloud platforms: AWS, Azure
  • Containerisation & Orchestration: Kubernetes
  • Infrastructure as Code: Terraform
  • Configuration Management: Ansible
  • Packaging & Deployment: Helm
  • Databases: MariaDB, MongoDB
  • Monitoring, observability, networking, and cloud security.
Job Responsibility
Job Responsibility
  • Act as a senior technical authority for APAC Site Reliability Engineering activities
  • Drive best practices in reliability, operations, and engineering standards
  • Promote technical excellence, collaboration, and accountability across stakeholders
  • Make infrastructure complexity transparent to both internal teams and customers, ensuring a consistently excellent client experience
  • Implement, track, and evolve service performance measures such as SLAs, SLOs, and SLIs
  • Anticipate risks related to service availability, capacity, performance regressions, and security vulnerabilities
  • Drive continuous improvement, including leading and facilitating Root Cause Analysis (RCA) activities
  • Ensure timely execution of deployments, upgrades, maintenance activities, and change requests
  • Anticipate workload, plan deliverables, and ensure qualification/validation of upcoming tasks
  • Collaborate closely with engineering to improve platform components, automation, and operational processes
What we offer
What we offer
  • Great career opportunities around the world
  • Truly collaborative environment with supportive leadership
  • Cutting edge technologies (AI, Cloud, Cybersecurity...)
  • Talented and passionate team members
  • Fun working environment
  • Fulltime
Read More
Arrow Right

Manager of Site Reliability Engineering (SRE)

The Manager of Site Reliability Engineering leads and develops a team of SRE pra...
Location
Location
United States , Birmingham
Salary
Salary:
Not provided
genpt.com Logo
Genuine Parts Company
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Typically requires a bachelor's degree and 7 years of experience in a technology and/or software engineering role or an equivalent combination
  • Proven experience working in large, complex enterprise environments (Fortune 500 or equivalent)
  • Strong understanding and demonstrated implementation of Site Reliability Engineering (SRE) principles at scale
  • Hands-on experience with infrastructure-as-code (IaC) tools such as Terraform, and ArgoCD
  • In-depth knowledge and practical experience with CI/CD pipelines and automation of software delivery
  • Championing DevOps practices and embedding reliability early in the SDLC
  • Significant hands-on experience in Site Reliability Engineering or related roles focused on cloud infrastructure reliability
  • Strong software engineering background with proficiency in infrastructure-as-code tools (e.g., Terraform, ArgoCD) and CI/CD automation
  • Deep knowledge of cloud platforms, specifically Google Cloud Platform (GCP), Kubernetes, container orchestration, and cloud-native architecture
  • Familiarity with monitoring and observability tools such as Dynatrace, Datadog, or equivalents
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow a high-performing team of Site Reliability Engineers, fostering a culture of ownership, continuous improvement, and operational excellence
  • Implement and champion Site Reliability Engineering principles and DevOps best practices within the team to ensure service reliability, availability, and performance
  • Define and track key SRE metrics such as service uptime, incident response and resolution times
  • Drive automation efforts including CI/CD pipeline enhancements, infrastructure-as-code practices, and self-service infrastructure provisioning to increase deployment velocity while reducing manual toil
  • Own and continuously improve observability practices including system monitoring, logging, alerting, and diagnostics to ensure rapid issue detection and resolution
  • Participate in incident response processes including incident management, root cause analysis, post-mortems, and continuous improvement to enhance system resilience
  • Partner closely with software engineering, product management, architecture, and security teams to embed reliability and security early in the software development lifecycle (SDLC)
  • Oversee the management and scalability of cloud infrastructure environments, primarily on Google Cloud Platform (GCP), with a focus on Kubernetes, container orchestration, and hybrid cloud integrations
  • Advocate for and apply best practices in performance tuning, capacity planning, and system design for high availability
  • Develop and execute a long-term roadmap for our hybrid cloud platform, aligning with evolving business objectives and technology trends
What we offer
What we offer
  • comprehensive benefit plans and programs designed to support your health and wellness, provide income protection and build financial security for your retirement
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer (Lead)

10Pearls is an award-winning end-to-end digital innovation company that helps bu...
Location
Location
Pakistan , Islamabad
Salary
Salary:
Not provided
10pearls.com Logo
10Pearls
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science or related field
  • 5–8 years in SRE or production-engineering roles running distributed systems at scale
  • Deep Kubernetes expertise — operators, RBAC, network policy, storage, upgrades
  • Hands-on with Keycloak / Vault / MinIO / Harbor / Kong or equivalent identity/secrets/storage/registry/gateway stacks
  • Strong Linux fundamentals and at least one systems language (Go, Rust) or shell/Python for tooling
  • Proven SLO/SLI authorship and error-budget-driven decision-making
  • Experience with observability stacks (Prometheus, Grafana, OpenTelemetry, Loki, Tempo)
  • Calm, clear communication during incidents
  • strong post-mortem writing
  • Hands-on with infra-as-code — Helm, Kustomize, Terraform
Job Responsibility
Job Responsibility
  • Substrate operation — own the Kubernetes cluster plus Keycloak (identity), Vault (secrets), MinIO (object storage), Harbor (registry), Kong (gateway) — from bootstrap to day-2 operations
  • SLO framework — define, publish, and defend SLOs for every tier-1 service
  • own error budgets and burn-rate alerting
  • Incident response — build the on-call rotation, paging, runbook library, and post mortem culture
  • lead incident command during P1/P2 events
  • Release operations — co-own the blue-green / canary release model with L6 Delivery
  • sign off production-bound releases
  • Air-gap operations — ensure every operational runbook works in a fully offline environment — no assumption of external dependencies
  • Lead the Platform squad — technically lead 1 Infrastructure Engineer, 1 Observability Engineer, 2 DevOps Engineers
  • set standards for infra-as-code and automation
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Lead

Engineer the future of global finance. At Citi, our Tech team doesn’t just suppo...
Location
Location
Canada , Mississauga
Salary
Salary:
120800.00 - 170800.00 USD / Year
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6–10 years of relevant experience in a hands‑on technical role
  • Experience contributing to architecture discussions and ensuring solutions align with enterprise standards and long‑term maintainability
  • Experience working with senior stakeholders or technology partners
  • Demonstrated experience supporting IT service improvements or platform stability initiatives
  • Strong communication and presentation skills, with the ability to convey technical concepts clearly
  • Experience supporting or contributing to technical roadmaps or operational workstreams
  • Experience participating in resilience‑related activities such as incident simulations, disaster recovery exercises, or stability testing
  • Ability to collaborate with cross‑functional support teams and technology groups
  • Strong organizational and workload‑planning skills
  • Consistently demonstrates clear and concise written and verbal communication skills
Job Responsibility
Job Responsibility
  • Demonstrates a strong understanding of how application support contributes to the overall technology function and organizational objectives
  • Assist with vendor relationship management, including coordination with offshore managed services
  • Support efforts to improve service levels for end users by enhancing operational efficiencies and strengthening incident management, problem management, and knowledge‑sharing practices
  • Partner with development teams to guide improvements in application stability and supportability
  • Contribute to frameworks for managing capacity, throughput, and latency
  • Assist in defining and implementing application onboarding guidelines and standards
  • Support team members by fostering a collaborative environment and encouraging skill development
  • Participate in cost‑reduction efforts through Root Cause Analysis reviews, knowledge management, performance tuning, and user training
  • Participate in business review meetings to help align technology tools and strategies with business requirements
  • Ensure adherence to support processes and tool standards, and assist in enhancing processes to promote consistency and quality across the support program
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer (SRE) – Cloud & Distributed Systems

We are seeking an experienced Senior Site Reliability Engineer (SRE) to design, ...
Location
Location
United States , Austin
Salary
Salary:
Not provided
dutechsystems.com Logo
Dutech Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in SRE, DevOps, or Systems Engineering
  • Strong expertise in Linux/Unix systems and system internals
  • Proficiency in at least one programming/scripting language (Python, Go, Java, Bash)
  • Experience designing and operating distributed systems
  • Hands-on experience with cloud platforms (AWS or GCP)
  • Experience with Docker and Kubernetes
  • Strong understanding of monitoring, alerting, and logging concepts
  • Experience managing SLIs, SLOs, and error budgets
  • Experience with incident management and RCA processes
Job Responsibility
Job Responsibility
  • Design, implement, and manage highly available, distributed systems
  • Maintain and optimize cloud infrastructure (AWS/GCP)
  • Develop automation scripts using Python, Go, Java, or Bash
  • Manage containerized environments using Docker and Kubernetes
  • Define and monitor SLIs, SLOs, and error budgets
  • Implement monitoring, logging, and alerting solutions
  • Lead incident management, root cause analysis (RCA), and postmortems
  • Ensure system security and compliance within operational workflows
  • Improve system reliability through performance tuning and optimization
  • Collaborate with engineering teams to enhance deployment and release processes
Read More
Arrow Right