CrawlJobs Logo

SRE Lead Design & Support Engineer

pepsico.com Logo

Pepsico

Location Icon

Location:
Mexico , Miguel Hidalgo

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

This is a critical enabler achieving a high resiliency during operations and also continuously improving through design during the software development lifecycle. The Lead SRE design & support engineer is integral part of the global team with its main purpose to provide a delightful customer experience for the user of the global consumer, commercial, supply chain and enablement functions in the PepsiCo digital products application portfolio of 260+ applications, enabling a full SRE Practice incident prevention / proactive resolution model. The scope of this role is focussed on the cloud architecture application full stack devlopment, B2B pepsiconnect and Direct to Customer and other S&T roadmap applications. Ensures that PepsiCo DPA applications service performance, reliability and availability expected by our customers and internal groups. It requires a blend of technical expertise on SRE tools, modern applications cloud architecture i.e. full stack, IT operations experience, and analytics & influence skills.

Job Responsibility:

  • Drive new shift left activities critical to apply Site Reliability Engineering (SRE) and quality assurance principles within the application design / Project roadmap that enablees resilient outcomes
  • Apply pre-emptive approach into production minimizing business impact, via SRE-driven orchestration of connecting all components of the ecosystem diagnosing anomalies prior to user & remediating through automation
  • Ensure ecosystem availability and performance in production environments, Pro-actively preventing P1, P2, potential P3s
  • Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications
  • Accountable to institute non-functional requirements (NFRs) are embedded early including SLA/SLO/SLI and error budgets into the product’s offerings as part of the engineering solution
  • Leads the team diagnosing any anomalies prior to any user and driving the necessary remediations across the teams involved in end-to-end ecosystem availability, performance and consumption of the cloud architected application ecosystem leveraging SRE Orchestration solutions
  • Collaborates with Engineering & support teams, including participation in escalations, and blameless postmortems
  • Work closely with customer-facing support teams to empower them with SRE insights and tooling
  • Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member
  • Continuously optimize the L2/support operations work via SRE workflow automation
  • Shape the SRE orchestration platform design with inputs from Production Operations, Business usage & Product and engineering teams
  • Actively engage and drive AI Ops adoption across teams

Requirements:

  • 8+ years of work experience evolving to a SRE engineer
  • 3-5 years of experience in continuously improving and transforming IT operations ways of working
  • Bachelor’s degree in Computer Science, Information Technology or a related field
  • Proven experience as an SRE in designing the events diagnostics, performance measures and alert solutions to meet the SLA/SLO/SLIs
  • Highly quantitative, have great judgment, able to connect dots across ecosytems, and efficiently work cross-functionally across teams
  • A strong expertise of SRE (Software Reliability Engineering) and IT Service Management (ITSM) processes
  • Hands on experience in Python, SQL /No-SQl( MySQL, Mongo DB, Cassandra, Postgress), AppDynamics, ELK Stack Grafana, Splunk, Dynatrace, Kafka and any SRE Ops toolsets
  • A firm understanding of cloud archticture for distributed environments
  • Front-end technologies: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js
  • Back-end technologies: Server-side languages (Java, Spring Boot, and related technologies that build the server-side logic, APIs, and database interaction with MySQL, MongoDB, Cassandra, Couchbase)
  • Infrastructure: Azure/AWS cloud platforms and/or Client / server environments

Nice to have:

Prior experience involving in shaping transformation developing SRE solutions would be a plus

What we offer:
  • Opportunities to learn and develop every day through a wide range of programs
  • Internal digital platforms that promote self-learning
  • Development programs according to Leadership skills
  • Specialized training according to the role
  • Learning experiences with internal and external providers
  • Recognition programs for seniority, behavior, leadership, moments of life, among others
  • Financial wellness programs that will help you reach your goals in all stages of life
  • A flexibility program that will allow you to balance your personal and work life, adapting your working day to your lifestyle
  • Wellness Line, thousands of Agreements and Discounts, Scholarship programs for your children, Aid Plans for different moments of life

Additional Information:

Job Posted:
January 29, 2026

Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for SRE Lead Design & Support Engineer

Engineering Manager, Infrastructure

As an Engineering Manager for the Infrastructure team, you’ll lead the engineers...
Location
Location
Canada; United States
Salary
Salary:
195000.00 - 285000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on software or infrastructure engineering experience
  • 2+ years of experience leading teams of senior and staff-level engineers in platform, SRE, or infrastructure domains
  • Proven ability to design and operate large-scale distributed systems in cloud environments (preferably GCP or AWS)
  • Expertise with Kubernetes, Docker, Terraform, Ubuntu, and CI/CD pipelines
  • Familiarity with observability tools (Grafana, Prometheus, ELK, Datadog, NewRelic) and performance tuning
  • Strong grounding in networking, security, and reliability principles
  • Experience managing infrastructure costs, availability SLAs, and high-throughput systems at scale
Job Responsibility
Job Responsibility
  • Lead, coach, and grow a distributed team of high-impact Infrastructure Engineers
  • Partner with senior engineering leadership on strategic initiatives such as cloud migration, infrastructure scaling, platform reliability, and cost efficiency
  • Define and implement modern operational excellence practices, including SLOs, error budgets, incident reviews, and performance monitoring
  • Guide technical decision-making across key areas like Kubernetes, GCP, observability, networking, CI/CD, and IaC (Terraform, Ansible)
  • Collaborate with AI, Data, and Product Engineering teams to ensure infrastructure scalability for ML and AI-native workloads
  • Run effective 1:1s, career development conversations, and quarterly performance reviews
  • Support recruiting efforts to attract top engineering talent across time zones
What we offer
What we offer
  • Equity
  • Company bonus or sales commissions/bonuses
  • 401(k) plan
  • At least 10 paid holidays per year
  • Flex PTO
  • Parental leave
  • Employee assistance program and wellbeing benefits
  • Global travel coverage
  • Life/AD&D/STD/LTD insurance
  • FSA/HSA and medical, dental, and vision benefits
  • Fulltime
Read More
Arrow Right
New

Lead Systems Operations Engineer - Platform Reliability Engineering, Sre, Observability And Monitoring, Platform Support

Wells Fargo is seeking a Lead Systems Operations Engineer. Platform Reliability ...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
https://www.wellsfargo.com/ Logo
Wells Fargo
Expiration Date
May 21, 2026
Flip Icon
Requirements
Requirements
  • 5+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • 5+ years of experience in Systems Operations, SRE, Platform Engineering, or Production Support with deep expertise in at least one platform domain: Database, Cloud, Network, Compute/Storage, Middleware, or Enterprise Application Support
  • Strong hands-on experience applying SRE practices, including SLI/SLO definition, error budgets, and reliability metrics
  • Proven experience troubleshooting and resolving large-scale, distributed production systems
  • Hands-on experience with observability and monitoring tools such as Grafana, Splunk, Prometheus, Cribl, ThousandEyes, AppDynamics, or equivalent, including dashboards, alerting, logs, and metrics
  • Strong scripting and automation skills using Python, Bash, and/or PowerShell to reduce operational toil
  • Experience building automation or reliability tooling using APIs, Git-based workflows, and modern engineering practices
  • Solid understanding of incident, problem, and change management in enterprise production environments
  • Strong communication and influencing skills across engineering teams and senior leadership
  • Experience with capacity management, performance engineering, and resiliency design (HA, fault tolerance, RTO/RPO)
Job Responsibility
Job Responsibility
  • Lead complex, broad impact initiatives including provision of high level systems consultation for the technology teams
  • Work as key participant in large scale planning of computer systems and network infrastructure for Systems Operations functional area
  • Review and analyze complex technical challenges, as well as escalated support issues related to core business solutions that require in depth evaluation of multiple factors, such as alternatives, enhancements, periodic systems reviews, or improvements to existing systems
  • Make decisions on technical changes and enhancements
  • Consult with engineering team on change design requiring solid understanding of technical process controls or standards that influence and drive new initiatives
  • Collaborate and consult with technical peers, colleagues, and mid to more experienced level managers to resolve systems support issues and achieve goals
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Site Reliability Engineering at Affirm is a small, yet crucial, team that helps ...
Location
Location
Poland
Salary
Salary:
301000.00 - 401000.00 PLN / Year
affirm.com Logo
Affirm
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience designing, developing and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin
  • A track record of developing highly available distributed systems using technologies like AWS, MySQL and Kubernetes
  • Meaningful experience contributing in or driving parts of the Incident Lifecycle process, enabling actionable insights that improve the quality culture, reliability, resilience, and system performance
  • 4+ years working in a Site Reliability or Production Engineering team
  • Experience defining a technical plan for the delivery of a significant feature or system component with an elegant, simple and extensible design
  • Experience in making impactful changes in a large code base, and have developed a suite of tools and practices that enable you and your team to do so safely
  • Strong verbal and written communication skills that support effective collaboration with our global engineering team
Job Responsibility
Job Responsibility
  • Own and deliver quarterly goals for your team, lead engineers on your team through ambiguity to solve open-ended problems, and ensure that everyone is supported throughout delivery
  • Support your peers and stakeholders in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics by participating in ideation, articulating technical constraints, and partnering on decisions that properly consider risks and trade-offs
  • Proactively identify technical solutions and operational processes that strengthen incident readiness, response, and post-incident analysis
  • Support the operations and availability of your team’s artifacts by creating and monitoring metrics, escalating when needed, and supporting “keep the lights on” & on-call efforts
  • Foster a culture of quality and ownership on your team by setting or improving code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks
  • Help develop talent on your team by providing feedback and guidance, and leading by example
What we offer
What we offer
  • Flexible Spending Wallets for tech, food and lifestyle
  • Away Days - wellness days to take off work and recharge
  • Learning & Development programs
  • Parental benefits
  • Employee Resource & Community Groups
  • Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
  • Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
  • Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
  • ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount
  • Fulltime
Read More
Arrow Right

Senior Lead Systems Operations Engineer

Wells Fargo is seeking a Senior Lead Systems Operations Engineer.
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
https://www.wellsfargo.com/ Logo
Wells Fargo
Expiration Date
May 24, 2026
Flip Icon
Requirements
Requirements
  • 7+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • 7+ years of experience in Systems Operations, SRE, Platform Engineering, or Production Support with deep expertise in at least one platform domain: Database, Cloud, Network, Compute/Storage, Middleware, or Enterprise Application Support
  • Strong hands-on experience applying SRE practices, including SLI/SLO definition, error budgets, and reliability metrics
  • Proven experience troubleshooting and resolving large-scale, distributed production systems
  • Hands-on experience with observability and monitoring tools such as Grafana, Splunk, Prometheus, Cribl, ThousandEyes, AppDynamics, or equivalent, including dashboards, alerting, logs, and metrics
  • Strong scripting and automation skills using Python, Bash, and/or PowerShell to reduce operational toil
  • Experience building automation or reliability tooling using APIs, Git-based workflows, and modern engineering practices
  • Solid understanding of incident, problem, and change management in enterprise production environments
  • Strong communication and influencing skills across engineering teams and senior leadership
  • Experience with capacity management, performance engineering, and resiliency design (HA, fault tolerance, RTO/RPO)
Job Responsibility
Job Responsibility
  • Act as an advisor to senior leadership to develop or influence platform support solutions for highly complex business and technical needs or technology initiatives
  • Lead highly complex, broad impact initiatives including provision of high-level systems consultation for the technology teams related to large scale planning of computer systems and network infrastructure for Systems Operations functional areas
  • Lead the strategy and resolution of highly complex and unique challenges requiring in-depth evaluation across multiple areas or the enterprise, delivering solutions that are long-term, large-scale and require vision, creativity, innovation, advanced analytical and inductive thinking
  • Translate advanced technology experience, in-depth knowledge of the organizations tactical and strategic business objectives, the enterprise technological environment, the organization structure, and strategic technological opportunities and requirements into technical engineering solutions
  • Provide vision, direction and expertise to senior leadership on implementing innovative and significant business solutions
  • Maintain knowledge of industry best practices and new technologies and recommend innovations that enhance operations or provide a competitive advantage to the organization
  • Strategically engage with all levels of professionals and managers across the enterprise and serve as an expert advisor to leadership
  • Provide training and mentoring to less experienced team members on guidebook changes and lead team to meet technical deliverables, while leveraging solid understanding of technical process controls or standards
  • Act as a Platform Reliability Engineering (PRE) subject matter expert, providing deep technical leadership in one core domain (Database, Cloud, Network, Compute/Storage, Middleware, or Application Support)
  • Lead analysis and resolution of complex, systemic production reliability issues, translating recurring incidents into long-term engineering solutions
  • Fulltime
Read More
Arrow Right

Engineering Manager, Infrastructure Engineering

This is not a traditional SRE or DevOps role. Whatnot's Reliability Engineering ...
Location
Location
Poland , Kraków
Salary
Salary:
Not provided
whatnot.com Logo
Whatnot
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of experience in infrastructure or platform engineering
  • 5+ years managing engineering teams
  • Experience leading managers or multiple teams a plus
  • Proven track record building and operating large-scale distributed systems with strong reliability, observability, and incident response practices
  • Deep technical grounding in one or more of: SLO design, monitoring/alerting, incident tooling, traffic control mechanisms, load and chaos testing, or platform engineering
  • Experience leading teams that ship developer-facing platforms, frameworks, or internal tools
  • Strong software engineering fundamentals
  • Demonstrated ability to guide teams through complex system challenges, large-scale migrations, and longer-term reliability initiatives
  • Exceptional communication and leadership skills
  • A passion for enabling teams to build fast while building safely through well-designed tooling and proactive detection mechanisms
Job Responsibility
Job Responsibility
  • Lead and mentor a team of highly skilled software engineers, supporting their technical growth, execution, and long-term career development
  • Set technical direction and quality standards for the team while empowering senior ICs to own design and architecture decisions
  • Develop and execute the strategic roadmap for reliability engineering at Whatnot
  • Build and operationalize best practices that empower product and platform teams to design and run reliable systems
  • Own the strategic roadmap for reliability tooling, including incident response systems, SLO measurement platforms, and developer-facing reliability libraries
  • Lead the team in designing and building traffic control systems as reusable platform components
  • Lead the design and execution of load testing at scale
  • Drive continuous improvement in incident detection and mitigation
  • Collaborate with cross-functional teams to influence product and architectural decisions that improve overall reliability and customer impact
  • Partner with Infrastructure and Engineering leadership to shape reliability strategy and investment priorities across the organization
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

Site Reliability Engineering at Affirm is a small, yet crucial, team that helps ...
Location
Location
Poland
Salary
Salary:
358000.00 - 458000.00 PLN / Year
affirm.com Logo
Affirm
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience designing, developing, advocating as a point subject of reference, and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin
  • Extensive track record of developing highly available distributed systems using technologies like AWS, MySQL, Spark and Kubernetes
  • Track record of managing, driving and improving the Incident Livecycle process from live incident management through retrospective and post-incident analysis to provide actional insights to enhance overall system reliability, resilience, and performance
  • 7+ years experience in Site Reliability or Production Engineering teams
  • Demonstrate curiosity with empathy, and strong opinions loosely held
  • Experience delivering major features, system components or deprecating existing functionality in a system through the definition of a technical and execution plan
  • Write high quality code that is easily understood and used by others
  • Thrive in ambiguity, and are comfortable moving from low level language idioms all the way to the architecture of large systems to understand how they work
  • Growth and impact trajectory demonstrates that you have mastered gathering and iterating on feedback from your engineering and cross-functional peers
  • Strong verbal and written communication skills that support effective collaboration with our global engineering team and key stakeholders of an organization
Job Responsibility
Job Responsibility
  • Set technical strategy vision for your team on a multi year-long time scale, and help your team tie it together with critical, business-impacting projects
  • Collaborate across teams in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics to ensure technical sustainability, risks and trade-offs are well understood and managed
  • Act as a force-multiplier for your team through your definition and advocacy of technical solutions and operational processes
  • Take ownership of your team’s operations and availability by ensuring you have the right monitoring, triage rotations, playbooks, policies, testing and alerting in place to support “keep the lights on” & on-call efforts
  • Foster a culture of quality and ownership on your team by setting code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks
  • Help develop talent on your team by providing feedback and guidance, and leading by example
What we offer
What we offer
  • Flexible Spending Wallets for tech, food and lifestyle
  • Away Days - wellness days to take off work and recharge
  • Learning & Development programs
  • Parental leave
  • Employee Resource & Community Groups
  • Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
  • Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
  • Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
  • ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

Affirm is reinventing credit to make it more honest and friendly, giving consume...
Location
Location
Spain
Salary
Salary:
101000.00 - 131000.00 EUR / Year
affirm.com Logo
Affirm
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience designing, developing, advocating as a point subject of reference, and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin
  • Extensive track record of developing highly available distributed systems using technologies like AWS, MySQL, Spark and Kubernetes
  • Track record of managing, driving and improving the Incident Livecycle process from live incident management through retrospective and post-incident analysis to provide actional insights to enhance overall system reliability, resilience, and performance
  • 7+ years experience in Site Reliability or Production Engineering teams
  • Experience delivering major features, system components or deprecating existing functionality in a system through the definition of a technical and execution plan
  • Ability to write high quality code that is easily understood and used by others
  • Strong verbal and written communication skills that support effective collaboration with our global engineering team and key stakeholders of an organization
  • Equivalent practical experience or a Bachelor’s degree in a related field
  • Based in Spain for the role
Job Responsibility
Job Responsibility
  • Set technical strategy vision for your team on a multi year-long time scale, and help your team tie it together with critical, business-impacting projects
  • Collaborate across teams in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics to ensure technical sustainability, risks and trade-offs are well understood and managed
  • Act as a force-multiplier for your team through your definition and advocacy of technical solutions and operational processes
  • Take ownership of your team’s operations and availability by ensuring you have the right monitoring, triage rotations, playbooks, policies, testing and alerting in place to support “keep the lights on” & on-call efforts
  • Foster a culture of quality and ownership on your team by setting code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks
  • Help develop talent on your team by providing feedback and guidance, and leading by example
  • Participate in an on-call rotation
What we offer
What we offer
  • Flexible Spending Wallets for tech, food and lifestyle
  • Away Days - wellness days to take off work and recharge
  • Learning & Development programs
  • Parental benefit
  • Employee Resource & Community Groups
  • Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
  • Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
  • Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
  • ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount
  • Visa sponsorship
  • Fulltime
Read More
Arrow Right

Manager, Engineering Operations

Tucows Domains is looking for a Manager, Engineering Operations to lead the team...
Location
Location
Canada
Salary
Salary:
154000.00 - 168000.00 CAD / Year
tucows.com Logo
Tucows
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience in DevOps, infrastructure engineering, site reliability engineering, or systems engineering
  • Experience leading or mentoring engineers in a technical environment
  • Hands-on experience with AWS services such as EC2, VPC, S3, RDS, IAM, and CloudWatch
  • Experience operating and supporting production infrastructure at scale
  • Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
  • Experience leading engineering or infrastructure teams while remaining technically engaged
  • Strong knowledge of AWS cloud infrastructure and distributed systems
  • Experience managing or operating hybrid infrastructure (cloud and on-premises)
  • Familiarity with Infrastructure as Code (Terraform, CloudFormation or similar)
  • Experience with CI/CD pipelines, automation, and modern DevOps practices
Job Responsibility
Job Responsibility
  • Lead the Engineering Operations Team
  • Build, mentor, and support a team of infrastructure and operations engineers
  • Set clear priorities, goals, and performance expectations for the team
  • Foster a culture of ownership, operational excellence, and continuous improvement
  • Own Infrastructure Reliability & Operations
  • Oversee the reliability and performance of infrastructure across AWS and hybrid environments
  • Drive improvements in observability, incident response, and operational processes
  • Lead root cause analysis and ensure learnings translate into long-term improvements
  • Guide Infrastructure & Platform Strategy
  • Partner with cross-functional groups such as Security Operations, Compliance and diverse Engineering teams to design and operate scalable, resilient infrastructure
What we offer
What we offer
  • Generous benefits
  • Reasonable accommodation for individuals with disabilities
  • Total rewards offering including fair compensation and generous benefits
  • Fulltime
Read More
Arrow Right