CrawlJobs Logo

SRE Lead Design & Support Engineer

Mexico, Miguel Hidalgo · Job Posted January 29, 2026
Apply Position
Job Link Share

Job Description

This is a critical enabler achieving a high resiliency during operations and also continuously improving through design during the software development lifecycle. The Lead SRE design & support engineer is integral part of the global team with its main purpose to provide a delightful customer experience for the user of the global consumer, commercial, supply chain and enablement functions in the PepsiCo digital products application portfolio of 260+ applications, enabling a full SRE Practice incident prevention / proactive resolution model. The scope of this role is focussed on the cloud architecture application full stack devlopment, B2B pepsiconnect and Direct to Customer and other S&T roadmap applications. Ensures that PepsiCo DPA applications service performance, reliability and availability expected by our customers and internal groups. It requires a blend of technical expertise on SRE tools, modern applications cloud architecture i.e. full stack, IT operations experience, and analytics & influence skills.

Job Responsibility

  • Drive new shift left activities critical to apply Site Reliability Engineering (SRE) and quality assurance principles within the application design / Project roadmap that enablees resilient outcomes
  • Apply pre-emptive approach into production minimizing business impact, via SRE-driven orchestration of connecting all components of the ecosystem diagnosing anomalies prior to user & remediating through automation
  • Ensure ecosystem availability and performance in production environments, Pro-actively preventing P1, P2, potential P3s
  • Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications
  • Accountable to institute non-functional requirements (NFRs) are embedded early including SLA/SLO/SLI and error budgets into the product’s offerings as part of the engineering solution
  • Leads the team diagnosing any anomalies prior to any user and driving the necessary remediations across the teams involved in end-to-end ecosystem availability, performance and consumption of the cloud architected application ecosystem leveraging SRE Orchestration solutions
  • Collaborates with Engineering & support teams, including participation in escalations, and blameless postmortems
  • Work closely with customer-facing support teams to empower them with SRE insights and tooling
  • Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member
  • Continuously optimize the L2/support operations work via SRE workflow automation
  • Shape the SRE orchestration platform design with inputs from Production Operations, Business usage & Product and engineering teams
  • Actively engage and drive AI Ops adoption across teams

Requirements

  • 8+ years of work experience evolving to a SRE engineer
  • 3-5 years of experience in continuously improving and transforming IT operations ways of working
  • Bachelor’s degree in Computer Science, Information Technology or a related field
  • Proven experience as an SRE in designing the events diagnostics, performance measures and alert solutions to meet the SLA/SLO/SLIs
  • Highly quantitative, have great judgment, able to connect dots across ecosytems, and efficiently work cross-functionally across teams
  • A strong expertise of SRE (Software Reliability Engineering) and IT Service Management (ITSM) processes
  • Hands on experience in Python, SQL /No-SQl( MySQL, Mongo DB, Cassandra, Postgress), AppDynamics, ELK Stack Grafana, Splunk, Dynatrace, Kafka and any SRE Ops toolsets
  • A firm understanding of cloud archticture for distributed environments
  • Front-end technologies: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js
  • Back-end technologies: Server-side languages (Java, Spring Boot, and related technologies that build the server-side logic, APIs, and database interaction with MySQL, MongoDB, Cassandra, Couchbase)
  • Infrastructure: Azure/AWS cloud platforms and/or Client / server environments

Nice to have

Prior experience involving in shaping transformation developing SRE solutions would be a plus

What we offer

  • Opportunities to learn and develop every day through a wide range of programs
  • Internal digital platforms that promote self-learning
  • Development programs according to Leadership skills
  • Specialized training according to the role
  • Learning experiences with internal and external providers
  • Recognition programs for seniority, behavior, leadership, moments of life, among others
  • Financial wellness programs that will help you reach your goals in all stages of life
  • A flexibility program that will allow you to balance your personal and work life, adapting your working day to your lifestyle
  • Wellness Line, thousands of Agreements and Discounts, Scholarship programs for your children, Aid Plans for different moments of life

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

SRE Lead Design & Support Engineer

8 matching positions

Site Reliability Engineering (SRE) / Lead Engineer

We are currently seeking a Site Reliability Engineering (SRE) / Lead Engineer to...
Location
Location
Mexico , Guadalajara
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8-10+ years of experience in SRE, Observability, or DevOps roles, with leadership responsibilities
  • Hands-on experience with OpenTelemetry for distributed tracing and observability instrumentation
  • Proven expertise with Application Performance Monitoring (APM) tools such as New Relic, Datadog, AppDynamics, or Dynatrace
  • Strong proficiency in Infrastructure as Code (IaC) using Terraform
  • Solid understanding of cloud platforms including AWS, GCP, or Azure
  • Experience with automation/configuration management tools like Ansible, Chef, or Puppet
  • Deep knowledge of CI/CD pipelines and tools such as GitHub Actions, Jenkins, or Azure DevOps
  • Experience managing Kubernetes and containerized environments (Docker, Helm)
  • Familiarity with log aggregation and analysis platforms like ELK Stack or Splunk
  • Excellent leadership, communication, and collaboration skills
Job Responsibility
Job Responsibility
  • Lead the strategic development and management of observability and reliability frameworks across the organization, ensuring alignment with business goals and technical requirements
  • Design and implementation of monitoring and observability solutions, collaborating with engineering teams to define standards and best practices
  • Manage Infrastructure as Code (IaC) initiatives using Terraform, coordinating with cloud and infrastructure teams to ensure scalable and secure deployments
  • Drive automation strategies for monitoring, alerting, and logging pipelines, focusing on process improvements and operational efficiency
  • Develop and maintain comprehensive observability roadmaps, including distributed tracing, logging, and metrics collection strategies
  • Collaborate with product management, sales, and pre-sales teams to provide technical expertise and support during solution design and customer engagements
  • Lead cross-functional teams to enhance CI/CD pipelines and deployment reliability, ensuring smooth integration of observability tools and practices
  • Engage with vendors and strategic partners to evaluate, select, and integrate observability and monitoring solutions, ensuring alignment with organizational needs and fostering strong collaborative relationships
  • Mentor and develop junior engineers and analysts, fostering a culture of reliability, observability, and operational excellence
  • Fulltime
Read More
Arrow Right

Lead Software Engineer - SRE

Wells Fargo is seeking a Lead Site Reliability Engineer (SRE) to join the WIMT P...
Location
Location
United States , CHARLOTTE; SAINT LOUIS
Salary
Salary:
119000.00 - 187000.00 USD / Year
https://www.wellsfargo.com/ Logo
Wells Fargo
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • 5+ years of experience leading observability and monitoring tooling - Splunk, AppDynamics, Splunk Observability, Grafana, Open Telemetry
  • 5+ years in infrastructure (windows and Linux) support
  • 5+ years proven success in toil reduction initiatives
  • 5+ years in cloud application management especially OpenShift Container Platform
Job Responsibility
Job Responsibility
  • Design and implement scalability, reliability, and observability strategies for cloud and on-premise environments
  • Define SLIs (Service Level Indicators), SLOs (Service Level Objectives), and Error Budgets to improve system reliability
  • Provide vision, direction and expertise to leadership on implementing innovative and significant business solutions
  • Maintain knowledge of industry best practices and new technologies and recommend innovations that enhance operations or provide a competitive advantage to the organization
  • Strategically engage with all levels of professionals and managers across the enterprise and serve as an expert advisor to leadership
  • Review and analyze complex, large-scale technology solutions for tactical and strategic business objectives, enterprise technological environment, and technical challenges that require in-depth evaluation of multiple factors, including intangibles or unprecedented technical factors
  • Drive adoption of NFRs, best practices-quality and compliance across observability and performance engineering
  • Ensure high availability and performance of production systems through proactive monitoring and incident response
  • Collaborate and consult with key technical experts, senior technology team, and external industry groups to resolve complex technical issues and achieve goals
  • Lead projects, teams, or serve as a peer mentor
What we offer
What we offer
  • Health benefits
  • 401(k) Plan
  • Paid time off
  • Disability benefits
  • Life insurance, critical illness insurance, and accident insurance
  • Parental leave
  • Critical caregiving leave
  • Discounts and savings
  • Commuter benefits
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Senior Support Engineer

The Technical Support team is responsible for ensuring that developers and enter...
Location
Location
United States , San Francisco
Salary
Salary:
234000.00 - 260000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Have a Bachelor’s degree in Computer Science or a related field
  • Have 8+ years of experience in technical operations roles such as SRE/NOC, designing monitoring systems and resolving production issues in fast-paced and mission-critical environments
  • Have deep familiarity with modern monitoring, alerting, and observability practices
  • Have proven experience leading incident response for high‑severity outages or service disruptions
  • Have strong skills in scripting or software engineering (e.g., Python or similar) to automate repetitive tasks and integrate tools
  • Have solid understanding of cloud infrastructure and distributed systems fundamentals
  • Are effective at working cross‑functionally in a high‑trust environment
  • Strong communication skills to explain technical issues and resolutions to both engineering and non‑technical stakeholders
Job Responsibility
Job Responsibility
  • Be among the foremost technical and troubleshooting experts for our API platform at OpenAI
  • Proactively identify and implement opportunities to scale support operations by leveraging automation and advancements in AI technologies
  • Configure and use advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time
  • In partnership with engineering, contribute to reliability reviews and preparedness for new features, launches, or strategic customer requirement updates
  • Design and refine incident response processes and documentation across strategic customers, engineering and support teams
  • Analyze operational metrics and incident RCAs to identify areas for improvement
  • Proactively recommend and implement enhancements to monitoring dashboards, alert configurations, and support workflows
  • Provide support coverage during holidays and weekends based on business needs
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

Technical Support Engineer

Our client is a globally connected technology organization offering cloud-based ...
Location
Location
Turkey , İstanbul
Salary
Salary:
Not provided
seteuropa.com Logo
SET Europa
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience: 3-5+ years of experience in Cloud Computing (IaaS/PaaS/SaaS), DevOps, or Enterprise Architecture
  • Proven Track Record: Experience in supporting Fortune 500 or large-scale enterprise customers
  • Project Leadership: Demonstrated ability to lead complex cloud migration projects or large-scale system troubleshooting under high pressure
  • Certification (Highly Preferred): Alibaba Cloud ACP (Professional) or ACE (Expert) level. Equivalent certifications like AWS Professional/Specialty, Azure Solutions Architect, or Google Cloud Professional Architect
  • Infrastructure Mastery: Deep understanding of Linux/Windows kernel tuning and performance optimization
  • Advanced Networking: Expert knowledge in VPC, BGP, VPN, Express Connect (Direct Connect), and SD-WAN. Ability to analyze packet loss/latency using tools like Wireshark/Tcpdump at a professional level
  • Database & Big Data: Not just 'familiar,' but capable of performance tuning and migration for at least two engines (e.g., MySQL AND Redis/MongoDB)
  • Cloud-Native & Modern Tech: Proficiency in Containerization (Docker/Kubernetes) and Microservices
  • Hands-on experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation
  • Automation & Scripting: Strong ability to automate repetitive tasks using Python, Go, or Shell to improve support efficiency (SRE mindset)
Job Responsibility
Job Responsibility
  • Complex Incident Management: Beyond daily consulting, act as the final escalation point for L1 issues. Lead the troubleshooting of high-priority (P0/P1) incidents involving complex hybrid cloud architectures
  • Product & Engineering Synergy: Not just 'tracking' bugs, but providing deep-dive technical insights to R&D teams. Influence the product roadmap by identifying systemic architectural flaws and proposing optimization solutions
  • Customer Success & Risk Mitigation: Conduct proactive technical audits and architectural reviews for Key Accounts (KA). Use diagnostic tools not just to 'avoid risks' but to design high-availability (HA) and disaster recovery (DR) strategies
  • Knowledge Empowerment: Create and maintain high-quality technical documentation, troubleshooting playbooks, and internal Knowledge Base (KB) articles to improve the overall team's technical capability.
  • Fulltime
Read More
Arrow Right

Devops Sre Engineer

We are looking for a mid-senior SRE/DevOps Engineer (5–8 years) to build and sca...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
acuverconsulting.com Logo
Acuver Consulting
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5–8 years of experience in DevOps / SRE roles
  • Strong hands-on experience with AWS (preferred) and/or GCP
  • Expertise in: Kubernetes & Docker
  • Terraform (Infrastructure as Code)
  • CI/CD tools (GitLab, Jenkins, or similar)
  • Experience with: Event-driven / asynchronous architectures (Kafka, Pub/Sub, etc.)
  • Monitoring & logging tools (Prometheus, Grafana, ELK, etc.)
  • Microservices and distributed systems
  • Solid understanding of: Networking, load balancing, scaling strategies
  • High availability and fault-tolerant systems
Job Responsibility
Job Responsibility
  • Design and implement robust CI/CD pipelines (GitLab CI, Jenkins, or similar)
  • Enable automated build, test, and deployment workflows
  • Implement blue-green / canary deployments for zero-downtime releases
  • Ensure release traceability, rollback mechanisms, and deployment governance
  • Design, provision, and manage infrastructure on AWS (primary) and/or GCP
  • Build infrastructure using Infrastructure as Code (Terraform preferred)
  • Create reusable modules for scalable, secure, and standardized environments
  • Optimize cost, performance, and scalability of cloud resources
  • Deploy and manage applications using Docker & Kubernetes
  • Manage Kubernetes workloads using Helm charts
  • Fulltime
Read More
Arrow Right

Senior Network Engineer (Lead)

Piper Companies is seeking a Senior Network Engineer (Lead) for a large‑scale en...
Location
Location
United States , Newton
Salary
Salary:
130000.00 - 155000.00 USD / Year
pipercompanies.com Logo
Piper Companies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of enterprise or service provider network engineering experience, including leadership responsibilities
  • Deep expertise with BGP, OSPF/IS‑IS, MPLS, Segment Routing, EVPN/VXLAN, and QoS
  • Hands-on experience with ACI and/or EVPN/VXLAN data center fabric technologies
  • Strong experience with SD‑WAN (Viptela) and hybrid cloud connectivity
  • Background with ISE (AAA/802.1X/SGTs), next‑gen firewalls (ASA/FTD), and remote access VPN
  • Proficiency with automation tools such as Python, Ansible, or Terraform
  • Experience with ThousandEyes or similar observability tools
  • Strong communication skills and the ability to mentor junior engineers
  • CCNP required
  • CCIE strongly preferred
Job Responsibility
Job Responsibility
  • Lead the design and optimization of enterprise L2/L3 networks across campus, data center, WAN, and cloud environments
  • Serve as the primary engineer for BGP, OSPF/IS‑IS, MPLS/Segment Routing, EVPN/VXLAN, QoS, and Multicast
  • Architect and support ACI fabrics, including tenants, VRFs, contracts, and multi‑pod/multi‑site deployments
  • Engineer and maintain SD‑WAN (Viptela) architecture, policies, and cloud on‑ramps
  • Support security platforms including ISE (802.1X/SGTs), ASA/FTD/Firepower, Umbrella, Duo, and remote access VPN
  • Enhance observability using telemetry, NetFlow/IPFIX, and ThousandEyes
  • Lead lifecycle upgrades, migrations, and major incident response efforts
  • Collaborate with architects, SRE teams, security engineers, and cross‑functional stakeholders
  • Maintain accurate network documentation including diagrams, standards, and configuration baselines
  • Participate in an after‑hours on‑call rotation to support network reliability
What we offer
What we offer
  • Comprehensive benefits package including Medical, Dental, Vision, 401k, PTO, holidays, and sick leave as required by law
  • Fulltime
Read More
Arrow Right

Lead Engineer – Platform Engineering

We are looking for a Lead DevOps Engineer to join the Platform Engineering team ...
Location
Location
United States , St Petersburg, Florida
Salary
Salary:
Not provided
raymondjames.com Logo
Raymond James
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep experience with virtualization platforms (e.g., VMware vSphere/ESXi, Hyper‑V, KVM/Nutanix)
  • Hands‑on experience with configuration management tools such as Ansible
  • Implement and support enterprise load balancer solutions (e.g., F5 BIG-IP, NGINX, Azure/AWS load balancers), including configuration, automation, and traffic‑routing policies
  • Familiarity with AI‑assisted operations tools (AIOps), or how they can fit into the workflow
  • Solid understanding of CI/CD systems (GitHub Actions, Azure DevOps, Jenkins, GitLab CI)
  • Advanced scripting skills in Python, PowerShell, and/or Bash
  • Experience with provisioned workflow development in Service Now
  • Strong knowledge of monitoring and logging platforms (Prometheus/Grafana, Splunk, Elastic, Datadog, etc.)
  • Understanding of security best practices, IAM/RBAC, secrets management, and compliance frameworks
  • Strong networking and systems fundamentals (TCP/IP, DNS, load balancing, storage)
Job Responsibility
Job Responsibility
  • Design, build, and maintain automation for VM provisioning, configuration, and lifecycle management
  • Enhance and support CI/CD pipelines for infrastructure and platform services
  • Provide technical leadership and mentorship to engineers across the platform engineering team
  • Use AI‑assisted tooling when beneficial for anomaly detection, event correlation, and operational insights
  • Work on standardized VM images, templates, and OS baselines to ensure consistency and security
  • Improve platform reliability through monitoring, alerting, and SRE‑aligned practices
  • Develop and maintain observability tooling, dashboards, and automated remediation workflows
  • Ensure security best practices across VM platforms, including RBAC, secrets management, and patching
  • Optimize VM capacity, performance, and resource utilization across environments
  • Collaborate with development, cloud, and security teams to deliver stable, self‑service platform capabilities
  • Fulltime
Read More
Arrow Right

Senior Support Engineer

The Senior Support Engineer collaborates directly with strategic enterprise acco...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science or a related field
  • 5+ years of experience in technical operations roles such as SRE/NOC
  • Strong software engineering foundation
  • Deep familiarity with modern monitoring, alerting, and observability practices
  • Proven experience leading incident response for high‑severity outages
  • Strong skills in scripting or software engineering (e.g., Python or similar)
  • Solid understanding of cloud infrastructure and distributed systems fundamentals
  • Effective at working cross‑functionally in a high‑trust environment
  • Strong communication skills
Job Responsibility
Job Responsibility
  • Be among the foremost technical and troubleshooting experts for our API platform
  • Proactively identify and implement opportunities to scale support operations
  • Configure and use advanced monitoring and alerting workflows
  • Contribute to reliability reviews and preparedness for new features
  • Design and refine incident response processes and documentation
  • Analyze operational metrics and incident RCAs to identify areas for improvement
  • Provide support coverage during holidays and weekends based on business needs
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family
  • Mental health and wellness support
  • PRSA plan with 6% employer matching
  • Unlimited time off
  • Annual learning & development stipend ($1,500 USD equivalent per year)
  • Generous equity
  • Relocation assistance
  • Fulltime
Read More
Arrow Right