CrawlJobs Logo

Sr. Manager Sre

Mexico, Mexico City · Job Posted June 29, 2026
Apply Position
Job Link Share

Job Description

We're building a Site Reliability Engineering center in Mexico City, and we're hiring a Senior Manager-level SRE to serve as the technical anchor for the site - defining the reliability vision, driving cross-team execution, and pioneering automation and AI-driven approaches that transform how we operate three payment networks at scale. This is a strategic technical leadership role. You won't manage people directly, but you'll shape how multiple teams work - setting architectural direction for observability, automation, and operational excellence, alert signal reduction, and reliability platform convergence. You'll be the most senior IC engineer in Mexico City, partnering with the Director (people leader) to translate organizational goals into technical roadmaps and ensuring the engineering quality bar stays high as the site scales. You'll operate across the full landscape: batch settlement systems processing every domestic and international credit/debit transaction, real-time observability platforms that must detect failures before customers do, and AI-powered automation that eliminates the toil standing between us and a proactive reliability culture.

Job Responsibility

  • Define and maintain a 12-18 month technical vision and roadmap for GPN SRE in Mexico City - decompose destination architecture into deliverable steps, sequence investments, and align execution across teams
  • Drive reliability transformation across settlement, observability, and automation domains - establish SLOs, error budgets, severity frameworks, and operational standards that teams build against
  • Pioneer AI and agentic automation approaches - design and build AI-driven solutions (using Claude Code, Copilot CLI, and LLM frameworks) for alert classification, runbook generation, automated remediation, and incident analysis
  • set patterns that other engineers extend
  • Own the technical strategy for domain-specific knowledge ramp-up: identify which domain expertise requires deep engineering investment vs. documentation, and architect systems that reduce reliance on tribal knowledge
  • Lead cross-team technical initiatives - drive observability platform convergence, standardize on COF tooling, and eliminate arbitrary uniqueness across towers
  • Serve as the senior escalation point for complex production incidents - diagnose cascading failures across distributed systems (storage, network, application), drive resolution, and ensure durable fixes land
  • Architect automation for high-risk operational processes - certificate rotation, compliance artifact generation, settlement cycle validation - ensuring security and reliability are built in from design
  • Mentor and elevate engineers across teams - conduct design reviews, establish engineering standards, coach on debugging and system thinking, and create an environment where Principal Associates and Managers grow into domain experts
  • Introduce and advocate for engineering practices that raise the bar - AI engineering, innersourcing, reuse over rebuild, open source contribution, blameless postmortems, and chaos engineering
  • Influence beyond the CDMX site - partner with US and UK leadership on architectural decisions, represent CDMX engineering in cross-org forums, and shape GPN-wide reliability strategy

Requirements

  • Professional English fluency
  • Bachelor's degree
  • At least 8+ years of experience in SRE, production operations, or reliability engineering
  • Experience in DevOps Engineering (internship experience does not apply)
  • 8+ years of experience in at least one of the following: Java, Python, Go
  • At least 6 years of experience with Cloud Native technologies (Amazon Web Services, Microsoft Azure, Google Cloud Platform)
  • 5+ years of experience with container orchestration services including Docker or Kubernetes
  • Experience with Shell or Bash scripting
  • At least 5 years of Unix or Linux system administration experience

Nice to have

  • Experience developing automation solutions using agentic AI tools (Claude Code, Copilot CLI)
  • Troubleshooting and debugging skills across distributed systems
  • Familiarity with payments, financial services, or other regulated high-availability domains
  • Knowledge or experience of Networking concepts (TCP/DNS/TLS)

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Sr. Manager Sre

8 matching positions

Manager / Sr Manager, Engineering (AI Posture)

Location
Location
United States , Santa Clara
Salary
Salary:
185000.00 - 298000.00 USD / Year
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5-7 years of experience managing software engineering teams within a large-scale organization
  • 5+ years of hands-on software engineering experience with a strong systems-level foundation
  • Proven ability to plan, execute, and deliver complex roadmaps with high predictability, owning distributed cloud products end-to-end
  • Demonstrated experience leading teams through the delivery of complex, data-rich web applications with a focus on performance and usability
  • Demonstrated experience designing and operating large-scale cloud architectures on platforms such as GCP, AWS, or Azure
  • Strong collaboration skills with a track record of aligning cross-disciplinary teams around shared objectives
Job Responsibility
Job Responsibility
  • Build, mentor, and lead a high-performing software engineering team, fostering a culture of empowerment and driving both individual growth and collective impact
  • Partner closely with Product Management and cross-functional teams (Infrastructure, UX, SRE & QA) to define priorities and shape multi-quarter product roadmaps, ensuring alignment across all stakeholders
  • Own the end-to-end software development lifecycle, translating product strategy into executable plans and ensuring consistent, high-quality, on-time delivery
  • Provide architectural leadership for scalable, distributed systems guiding the design and implementation of high-throughput, cloud-native applications
  • Drive production readiness by enforcing best practices around deployment, observability, reliability, and runtime stability, focusing on the details to ensure operational excellence
  • Align stakeholders across business units through clear communication of technical strategy, trade-offs, priorities, risks, and execution plans
  • Engage directly with strategic customers to lead technical deep dives and architecture reviews, and to influence future product direction
  • Foster a culture of high engineering standards, accountability, and continuous improvement, with a strong emphasis on quality and security
  • Fulltime
Read More
Arrow Right
New

Applications Support Sr Manager

Engineer the future of global finance. At Citi, our Tech team doesn't just suppo...
Location
Location
Colombia , Bogotá
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6-10 years experience in an Apps Support role would be an added advantage, but not essential
  • Experience with people management
  • Progressive, hands-on experience in application support, Site Reliability Engineering (SRE), or technical operations, specifically for mission-critical, high-volume financial applications
  • Demonstrable direct experience with cloud-native architectures, including active configuration and management of microservices, containers (e.g., Kubernetes), and serverless technologies
  • Extensive practical experience with major Public Cloud platforms (e.g., AWS, Azure, GCP) and enterprise private cloud environments
  • Proven track record in implementing and operating comprehensive observability stacks (e.g., Prometheus, Grafana, ELK stack, Jaeger, distributed tracing)
  • Deep understanding and direct application of resiliency engineering principles (e.g., circuit breakers, bulkheads, retry mechanisms) and robust disaster recovery strategies
  • Strong technical background in instant payments or real-time financial transaction processing systems is highly desirable
  • Expertise in automation, scripting (e.g., Python, Go, Shell), and infrastructure-as-code principles (e.g., Terraform, CloudFormation)
  • Excellent communication, interpersonal, and team leadership skills, with the ability to manage and motivate a technical team while remaining deeply technical
Job Responsibility
Job Responsibility
  • Hands-On Operational Leadership: Directly manage, mentor, and develop a technical support team while actively engaging in day-to-day operational tasks, incident response, and problem resolution for the Instant Payments application
  • Direct Operational Management: Take direct ownership of ensuring the operational stability and performance of the Instant Payments application across diverse cloud environments (Citi's Enterprise Cloud and Public Cloud), including active monitoring and system checks
  • Technical Implementation & Optimization: Lead the implementation, configuration, and continuous optimization of observability (monitoring, logging, tracing tools), resiliency (designing and implementing auto-healing and retry mechanisms), and recoverability (executing disaster recovery strategies) solutions for the cloud-native Instant Payments application. This includes writing and maintaining scripts for these functions
  • Service Level Execution & Improvement: Directly contribute to improving service levels by implementing operational efficiencies, performing incident management, problem management, and enhancing knowledge sharing practices for the Instant Payments application
  • Application Onboarding & Technical Guidance: Actively participate in defining and implementing application onboarding guidelines and standards. Provide direct technical guidance to development teams on stability and supportability improvements for the Instant Payments application
  • Incident & Problem Resolution: Lead and execute troubleshooting efforts for complex technical issues, perform in-depth root cause analysis, and implement permanent fixes for the Instant Payments application
  • Cost Efficiency & Automation: Identify and implement opportunities for cost reduction and operational efficiencies through proactive analysis, performance tuning, and the development of automation scripts and tools. Ensure adherence to support process and tool standards
  • Technical Communication: Effectively communicate technical details, application status, operational risks, and support initiatives to product teams, development teams, and relevant stakeholders
  • Risk & Compliance: Directly ensure operational risk is managed effectively and compliance with applicable policies, rules, and regulations is maintained for the Instant Payments application support function
What we offer
What we offer
  • opportunity to grow your career
  • give back to your community
  • make a real impact
  • mentorship
  • continuous learning
  • flexibility with potential hybrid work opportunities
  • Fulltime
Read More
Arrow Right

Manager, Sre Risk Advisory And Oversight

Manager, SRE Risk Advisory and Oversight at Capital One. Capital One is one of t...
Location
Location
United States , McLean; New York
Salary
Salary:
197300.00 - 245600.00 USD / Year
capitalone.com Logo
Capital One
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree or military experience
  • At least 4 years of experience in Technology Management, Software Engineering, Site Reliability Engineering, or Cyber Risk Management
  • At least 2 years of experience with cloud implementations (AWS, GCP, or Azure)
  • At least 1 year of experience with open-source programming languages
Job Responsibility
Job Responsibility
  • Perform Deep-Dive Risk Analysis: Conduct independent, technical risk assessments of cloud infrastructure architectures, software delivery lifecycles, and observability frameworks to identify systemic resilience and stability risks
  • Support Effective Challenge: Evaluate first-line cloud engineering practices against enterprise risk appetites, ensuring robust strategies are maintained for automation, system resiliency, performance, and monitoring
  • Build Storytelling & Reporting Materials: Partner with team leadership (Sr. Managers and Directors) to translate complex, highly technical engineering data into structured risk reports, presentation decks, and executive storytelling materials
  • SRE Subject Matter Expertise: Serve as a trusted technical analyst on core SRE pillars, assessing the design and maturity of Service Level Indicators/Objectives (SLIs/SLOs), error budgets, release pipelines (CI/CD), and toil reduction efforts
  • Evaluate AI & Tech Integration: Actively evaluate the integration of cutting-edge technologies—specifically cloud-native stacks, containerization, and the application of emerging Gen AI/ML tooling within software delivery—to ensure reliable operational boundaries
  • Formulate Risk Recommendations: Collaborate across the second line of defense to design, adjust, and recommend appropriate mitigating controls and guardrails for emerging cloud tech
  • Stakeholder Partnership: Build and maintain collaborative relationships with first-line engineers, architects, and technical owners to ensure risk assessments are thoroughly understood and communicated transparently
What we offer
What we offer
  • performance based incentive compensation, which may include cash bonus(es) and/or long term incentives (LTI)
  • health, financial and other benefits
  • Fulltime
Read More
Arrow Right

Apps Development Sr Manager - Vice President

A senior-level position responsible for accomplishing results by designing, impl...
Location
Location
Canada , Mississauga
Salary
Salary:
120800.00 - 170800.00 USD / Year
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of relevant experience in DevOps, Site Reliability Engineering (SRE), or Platform Engineering
  • Hands-on working experience with container orchestration using OpenShift and Kubernetes
  • Strong, demonstrable experience with CI/CD tools, specifically Tekton and Harness
  • Extensive experience with observability and monitoring stacks, including Prometheus and Grafana
  • Proficiency in Infrastructure as Code (IaC) and configuration management tools
  • Experience with scripting and automation
  • Ability to work proactively and independently to address project requirements, and articulate issues/challenges with enough lead time to mitigate project delivery risks
  • A history of conducting code reviews and ensuring high standards for infrastructure and automation code
  • Basic knowledge of industry practices and standards in the DevOps and SRE space
  • Consistently demonstrates clear and concise written and verbal communication
Job Responsibility
Job Responsibility
  • Design, build, and maintain the CI/CD infrastructure and tools, with a focus on Tekton and Harness
  • Manage, scale, and secure OpenShift container platforms, ensuring high availability and reliability
  • Develop and manage infrastructure as code (IaC) to automate provisioning and configuration of environments
  • Implement and manage a comprehensive observability stack using tools like Prometheus, Grafana, and others to monitor system health, performance, and reliability
  • Collaborate with development teams to create a seamless developer experience and ensure applications are built with scalability, reliability, and security in mind
  • Utilize in-depth knowledge and skills across multiple infrastructure and development areas to provide technical oversight for the platform
  • Contribute to the formulation of strategies for platform engineering and DevOps functional areas
  • Provide evaluative judgment based on the analysis of factual data in complicated and unique situations, including root cause analysis and problem resolution
  • Impact the DevOps and Platform Engineering area through monitoring delivery of end results and ensuring essential procedures are followed and contribute to defining standards
  • Appropriately assess risk when technical decisions are made, demonstrating particular consideration for the firm's reputation and safeguarding Citigroup, its clients, and assets, by driving compliance with applicable laws, rules, and regulations, adhering to Policy, and applying sound ethical judgment
  • Fulltime
Read More
Arrow Right

Sr Manager, Site Reliability (SASE)

We are looking for a visionary Senior Manager of Site Reliability Engineering to...
Location
Location
United States , Santa Clara
Salary
Salary:
182000.00 - 294425.00 USD / Year
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in SRE, Infrastructure or DevOps environments
  • 5+ years managing global teams of 15+ engineers across multiple time zones
  • Deep understanding of Cloud Native ecosystems (Azure/AWS/GCP), Kubernetes and CI/CD pipelines
  • Proven track record of implementing ML-driven monitoring (e.g., anomaly detection, automated root cause analysis, event correlation)
  • Exceptional ability to translate 'deep tech' into business value for C-suite stakeholders
  • Experience using AI tools like Claude, Gemini or Copilot to build solutions is mandatory
Job Responsibility
Job Responsibility
  • Directly manage and scale a high-performing, multi-geographical SRE team (US and India), fostering a culture of psychological safety, continuous learning, and 'operational pride'
  • Standardize SRE practices globally while respecting local nuances, ensuring 24/7 coverage models (Follow-the-Sun) are seamless and burnout-resistant
  • Manage the financial aspects of global headcount and cloud infrastructure spend
  • Drive the Autonomous SRE Roadmap: Transition the organization from reactive monitoring to proactive, AI-driven observability and incident remediation using machine learning to reduce Mean Time to Recovery (MTTR)
  • Act as the lead consultant for infrastructure product teams to define what 'reliability' looks like for next-gen AI services
  • Partner with the Platform Engineering team to build and internalize 'Golden Paths' that bake in SLOs, error budgets, and automated canary analysis
  • Work hand-in-hand with InfoSec and Compliance to automate guardrails (Policy-as-Code) and ensure global data sovereignty requirements are met
  • Influence R&D leadership to prioritize non-functional requirements and technical debt reduction
What we offer
What we offer
  • restricted stock units
  • bonus
  • Fulltime
Read More
Arrow Right

Sr Manager, Platform DevOps

This role leads globally distributed DevOps/SRE teams across the US and India, w...
Location
Location
United States , Frisco; Atlanta; Bellevue
Salary
Salary:
160000.00 - 288500.00 USD / Year
https://www.t-mobile.com Logo
T-Mobile
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree plus 7 years of related work experience OR a combination of education and experience deemed equivalent. Acceptable areas of study include Computer Science, Engineering, IT or equivalent experience. (Required)
  • 7-10 years Relevant Product Management experience in an agile software product development environment. (Required)
  • 2-4 years Experience in a leadership role. (Required)
  • 7-10 years Technical Leadership: Strong command of cloud infrastructure (AWS & Azure), CI/CD systems, GitLab administration, IaC tools (Terraform/CloudFormation/Bicep), automation, and modern DevOps/SRE methodologies. (Preferred)
  • 2-4 years Experience managing teams of 5 or more resources in direct reporting relationships in a Platform Management organization. (Preferred)
  • At least 18 years of age
  • Legally authorized to work in the United States
  • Strong understanding of Software Development Life Cycle (SDLC) and Agile methodologies
  • Experience delivering complex technology initiatives across engineering and operations
  • Expertise in vulnerability management, cloud security procedures, secure SDLC, compliance frameworks, and regulatory alignment
Job Responsibility
Job Responsibility
  • Lead and manage distributed DevOps/SRE teams (US and India) globally, ensuring effective workforce planning, shift and availability management, performance development, mentorship, and continuous skill growth aligned with organizational needs
  • Own the security and vulnerability management lifecycle, ensuring timely remediation, cloud posture hardening, secure configuration management, and alignment with enterprise security, governance, and risk controls
  • Lead implementation of observability platforms across monitoring, logging, tracing, and alerting
  • develop dashboards and insights to proactively identify failures, bottlenecks, and performance deviations
  • Define and implement continuous improvement practices across technical fields and organizational processes
  • Drive SRE frameworks, including SLA/SLI/SLO definitions, reliability measurement, error-budget policies, and adoption of standards that improve operational excellence
  • Provide end-to-end ownership of incident management, including response coordination, root-cause analysis (RCA), post-incident reviews, and implementation of corrective actions to strengthen system resilience
  • Oversee technical vendor relationships to incorporate feature and function requests into product releases
  • Drive and maintain the current and future technical roadmap in collaboration with design and architecture teams
  • Collaborate with product, architecture, quality, and security organizations to align technical priorities and delivery objectives
What we offer
What we offer
  • competitive base salary
  • annual stock grant
  • employee stock purchase plan
  • 401(k)
  • free, year-round money coaches
  • medical, dental and vision insurance
  • flexible spending account
  • paid time off and up to 12 paid holidays
  • paid parental and family leave
  • family building benefits
  • Fulltime
Read More
Arrow Right

Sr. Manager- AI Platform Lead

We are developing an Enterprise AI Platform to help all employees build, deploy,...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
amgen.com Logo
Amgen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science, Engineering, or equivalent practical experience with a total 12-17 years of industry experience
  • 8+ years of engineering experience building/platforming cloud services or developer platforms, with 3+ years leading engineering teams or technical programs
  • Proven experience designing and operating cloud-native platforms (Kubernetes, containers, microservices, service meshes)
  • Hands-on experience with LLM serving or model-serving patterns (hosting models, request routing, batching, scaling, latency/cost tradeoffs) — or adjacent experience (large-scale inference endpoints, model CI/CD)
  • Practical knowledge of API/Gateway patterns, authentication/authorization, and secure integrations
  • Familiarity with cost attribution and FinOps concepts for compute/AI workloads and toolchains for measuring and controlling model/agent costs
  • Strong track record working with product managers and senior technical stakeholders to deliver platform capabilities and roadmaps
  • Excellent communication skills: able to explain technical tradeoffs to technical and non-technical audiences
  • Experience with observability and SRE practices (metrics, tracing, logging, incident management)
Job Responsibility
Job Responsibility
  • Provide technical leadership and clear, pragmatic product-centric architecture for our AI platforms
  • Translate product and business requirements into scalable platform capabilities (agent hosting, LLM serving, gateway/integration architecture, observability and operations)
  • Drive platform decisions around LLM serving (model endpoints, caching, batching, latency vs. cost tradeoffs), AI Gateways (routing, policy, rate-limiting, auditing), and agent hosting patterns (single/multi-tenant, sandboxing, lifecycle)
  • Own platform reliability, scalability and cost: define SLIs/SLOs, capacity planning, cost attribution and FinOps practices
  • Collaborate with Product Owners, Principal Engineers and stakeholders to define the roadmap, acceptance criteria, and delivery milestones
  • Lead, coach and grow a high-performing engineering team focused on platform services, integrations (low/no-code tooling such as n8n, and pro-code agent hosting frameworks like AgentCore or equivalents), CI/CD for agents/models, and marketplace features
  • Establish standards for security, compliance and model governance (data handling, access controls, logging and auditability), particularly for regulated environments
  • Be hands-on when needed — prototype architectures, review designs, troubleshoot production incidents, and participate in code/design reviews
What we offer
What we offer
  • Competitive and comprehensive Total Rewards Plans that are aligned with local industry standards
  • Fulltime
Read More
Arrow Right

Sr Principal Site Reliability Engineer (Sovereign Cloud)

The Prisma Access team is seeking a seasoned Principal Site Reliability Engineer...
Location
Location
Bulgaria , Sofia
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of experience in Infrastructure, SRE, or DevOps roles
  • BS or MS in Computer Science, a related field, or equivalent professional experience
  • 7+ years of experience with GCP, and expertise in their architecture, services and PKI concepts for cloud security
  • Expert troubleshooting skills to resolve cloud infrastructure and service issues, effectively identifying root cause and devising effective solutions
  • Proficiency in automation using Python and shell scripting
  • Expertise in Infrastructure as Code (IaC) with Terraform and Helm, leveraging AI tools for development
  • Solid experience with Kubernetes, container networking, and container workloads
  • Strong Linux administration skills
  • Proficiency with CI/CD pipelines, GitOps principles, and tooling like GitLab and Jenkins
  • Excellent written and verbal communication skills, with the ability to collaborate effectively to drive outcomes
Job Responsibility
Job Responsibility
  • Design, build, and operate reliable, secure Cloud infrastructure across multi-cloud environments for our sovereign customers
  • Lead cross-functional initiatives to ensure applications are production-ready, scalable, secure, and resilient
  • Develop expertise in new technologies, embracing continuous learning and the adoption of AI tools
  • Develop tools and automation frameworks, championing Infrastructure as Code (IaC) and Monitoring as Code (MaC) principles
  • Automate robust deployments and orchestrate end-to-end monitoring and alerting solutions
  • Participate in on-call rotations to support critical business and production systems
  • Lead root cause analysis of critical issues, driving improvements and preventing recurrence
  • Champion the success of SRE and DevOps initiatives, aligning technical decisions with business goals
  • Fulltime
Read More
Arrow Right