Sre design & support engineer Job at Pepsico (Hyderabad)

SRE Lead Design & Support Engineer

This is a critical enabler achieving a high resiliency during operations and als...

Location

Mexico , Miguel Hidalgo

Salary:

Not provided

Pepsico

Expiration Date

Until further notice

Requirements

8+ years of work experience evolving to a SRE engineer
3-5 years of experience in continuously improving and transforming IT operations ways of working
Bachelor’s degree in Computer Science, Information Technology or a related field
Proven experience as an SRE in designing the events diagnostics, performance measures and alert solutions to meet the SLA/SLO/SLIs
Highly quantitative, have great judgment, able to connect dots across ecosytems, and efficiently work cross-functionally across teams
A strong expertise of SRE (Software Reliability Engineering) and IT Service Management (ITSM) processes
Hands on experience in Python, SQL /No-SQl( MySQL, Mongo DB, Cassandra, Postgress), AppDynamics, ELK Stack Grafana, Splunk, Dynatrace, Kafka and any SRE Ops toolsets
A firm understanding of cloud archticture for distributed environments
Front-end technologies: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js
Back-end technologies: Server-side languages (Java, Spring Boot, and related technologies that build the server-side logic, APIs, and database interaction with MySQL, MongoDB, Cassandra, Couchbase)

Job Responsibility

Drive new shift left activities critical to apply Site Reliability Engineering (SRE) and quality assurance principles within the application design / Project roadmap that enablees resilient outcomes
Apply pre-emptive approach into production minimizing business impact, via SRE-driven orchestration of connecting all components of the ecosystem diagnosing anomalies prior to user & remediating through automation
Ensure ecosystem availability and performance in production environments, Pro-actively preventing P1, P2, potential P3s
Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications
Accountable to institute non-functional requirements (NFRs) are embedded early including SLA/SLO/SLI and error budgets into the product’s offerings as part of the engineering solution
Leads the team diagnosing any anomalies prior to any user and driving the necessary remediations across the teams involved in end-to-end ecosystem availability, performance and consumption of the cloud architected application ecosystem leveraging SRE Orchestration solutions
Collaborates with Engineering & support teams, including participation in escalations, and blameless postmortems
Work closely with customer-facing support teams to empower them with SRE insights and tooling
Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member
Continuously optimize the L2/support operations work via SRE workflow automation

What we offer

Opportunities to learn and develop every day through a wide range of programs
Internal digital platforms that promote self-learning
Development programs according to Leadership skills
Specialized training according to the role
Learning experiences with internal and external providers
Recognition programs for seniority, behavior, leadership, moments of life, among others
Financial wellness programs that will help you reach your goals in all stages of life
A flexibility program that will allow you to balance your personal and work life, adapting your working day to your lifestyle
Wellness Line, thousands of Agreements and Discounts, Scholarship programs for your children, Aid Plans for different moments of life

New

Senior Systems Operations Engineer - SRE and AIOps

Wells Fargo is seeking a Senior Systems Operations Engineer within the Enterpris...

Location

India , Hyderabad

Salary:

Not provided

Wells Fargo

Expiration Date

June 22, 2026

Requirements

4+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
Strong Java / backend service development experience
Distributed systems and API-based service design
CI/CD pipelines and Git-based workflows
3+ years of experience with scripting and infrastructure automation using Terraform
3+ years of hands-on experience with OpenShift, GCP or Azure platform enablement and application migrations, build out of complex infrastructure programmable patterns using Infrastructure as Code (IaC)
2+ years of knowledge and understanding of Cloud service offerings such as data, analytics, AL/ML on GCP or Azure
2+ years of experience with key services provided by Azure and/or GCP such as BigQuery, Vertix AI, DataProc, Functions. AKS, Service Fabric
2+ years working in a globally distributed team to provide innovative and robust cloud centric solutions
2+ years gathering and analyzing data to diagnose the root cause of cloud workload issues, recommending and implementing solutions to resolve issues in timely manner

Job Responsibility

Lead or participate in managing all installed systems and infrastructure within the Systems Operations functional area
Contribute in increasing system efficiencies and lowering the human intervention time on related tasks
Review and analyze moderately complex operational support systems, application software, and system management tools to ensure the highest levels of systems and infrastructure availability
Work with vendors and other technical personnel for problem resolution
Lead team to meet technical deliverables while leveraging solid understanding of technical process controls or standards
Collaborate with vendors and other technical personnel to resolve technical issues and achieve highest levels of systems and infrastructure availability

Fulltime

Senior Support Engineer

The Technical Support team is responsible for ensuring that developers and enter...

Location

United States , San Francisco

Salary:

234000.00 - 260000.00 USD / Year

OpenAI

Expiration Date

Until further notice

Requirements

Have a Bachelor’s degree in Computer Science or a related field
Have 8+ years of experience in technical operations roles such as SRE/NOC, designing monitoring systems and resolving production issues in fast-paced and mission-critical environments
Have deep familiarity with modern monitoring, alerting, and observability practices
Have proven experience leading incident response for high‑severity outages or service disruptions
Have strong skills in scripting or software engineering (e.g., Python or similar) to automate repetitive tasks and integrate tools
Have solid understanding of cloud infrastructure and distributed systems fundamentals
Are effective at working cross‑functionally in a high‑trust environment
Strong communication skills to explain technical issues and resolutions to both engineering and non‑technical stakeholders

Job Responsibility

Be among the foremost technical and troubleshooting experts for our API platform at OpenAI
Proactively identify and implement opportunities to scale support operations by leveraging automation and advancements in AI technologies
Configure and use advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time
In partnership with engineering, contribute to reliability reviews and preparedness for new features, launches, or strategic customer requirement updates
Design and refine incident response processes and documentation across strategic customers, engineering and support teams
Analyze operational metrics and incident RCAs to identify areas for improvement
Proactively recommend and implement enhancements to monitoring dashboards, alert configurations, and support workflows
Provide support coverage during holidays and weekends based on business needs

What we offer

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible

Fulltime

Api Production Support Engineer - Officer

At Citi, we’re passionate about building and maintaining highly reliable APIs th...

Location

Canada , Mississauga

Salary:

79320.00 - 110680.00 USD / Year

Citi

Expiration Date

Until further notice

Requirements

Extensive experience supporting Java and J2EE based applications
Deep technical knowledge and hands-on experience supporting and troubleshooting environments including AWS, ECS, Oracle DB, and Mongo DB
A strong understanding and practical application of SRE concepts, particularly in defining and measuring SLIs, SLOs and Error Budgets
Demonstrated experience in building and utilizing comprehensive monitoring solutions such as AppDynamics, Splunk, Kibana to proactively alert on production API-related issues and ensure system health
In-depth knowledge and hands-on experience with API Gateway technologies, specifically APIGEE, and CDN solutions like Akamai
Proven ability to proactively identify and address problems, areas for improvement, and performance bottlenecks within complex API ecosystems using software-based solutions
Strong coding experience beyond simple scripts, preferably in Java or Python, for automation and internal tool development
Bachelor’s University degree in Computer Science, Engineering, or a related field
or equivalent experience

Job Responsibility

Champion stability initiatives to enable high availability and resilience for our API applications, including enhancing monitoring, failover mechanisms, and overall system health
Demonstrate calm and analytical capabilities when faced with major incidents on critical API systems, ensuring effective incident, problem, and change management at a global enterprise level
Perform proactive monitoring and management of production API environments, taking a holistic view of system health and performance
Drive the definition, analysis, and reporting of SLIs and SLOs for all supported APIs and clients, ensuring clear performance benchmarks
Contribute to the development and implementation of tools and systems designed to enhance API operational management and the client experience
Measure and optimize API system performance, always pushing capabilities forward, anticipating customer needs, and innovating for continuous improvement
Provide hands-on expert operational support for critical, large-scale distributed API ecosystems
Actively gather and analyze performance metrics from API platforms and underlying infrastructure to assist in performance tuning, fault finding, and capacity planning
Partner closely with API development teams to improve services through rigorous operational feedback loops, testing, and release procedures
Drive the creation of sustainable API operational systems and services through automation and continuous uplifts, including developing, testing, and debugging automated tasks

Fulltime

Cloud Engineer (SRE)

We're looking for a Cloud Engineer (Software Reliability Engineer) to join the g...

Location

United Kingdom , London

Salary:

Not provided

Helpcare AI

Expiration Date

Until further notice

Requirements

Amazon Web Services (AWS)
Microsoft Azure
Go
Google Cloud
Kubernetes
PostgreSQL
3+ years of experience hosting Postgres (not just using Postgres)
4-7 years of experience working with AWS/GCP and Azure
Experience building games or tinkering with game engines at a deep level
Experience with databases and data design, specially PostgreSQL databases

Job Responsibility

Manage in-cluster Postgres instances
Build, operationalise and maintain the infrastructure on Kubernetes
Build Heroic Cloud services (in Go)
Join the on-call rota for potential callouts
Recognize patterns among user issues and suggest ways to improve our product and offerings
Partner with Documentation, Product, Sales, and Engineering teams to guide improvements
Help develop and iterate on our support processes and systems

What we offer

Competitive salary
Unlimited vacation policy
Company requires you to take at least 2 weeks off each year (and observe local holidays)
At least yearly company all-hands and getaways
Pick your own equipment

Fulltime

Lead Software Engineer - SRE

Wells Fargo is seeking a Lead Site Reliability Engineer (SRE) to join the WIMT P...

Location

United States , CHARLOTTE; SAINT LOUIS

Salary:

119000.00 - 187000.00 USD / Year

Wells Fargo

Expiration Date

Until further notice

Requirements

5+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
5+ years of experience leading observability and monitoring tooling - Splunk, AppDynamics, Splunk Observability, Grafana, Open Telemetry
5+ years in infrastructure (windows and Linux) support
5+ years proven success in toil reduction initiatives
5+ years in cloud application management especially OpenShift Container Platform

Job Responsibility

Design and implement scalability, reliability, and observability strategies for cloud and on-premise environments
Define SLIs (Service Level Indicators), SLOs (Service Level Objectives), and Error Budgets to improve system reliability
Provide vision, direction and expertise to leadership on implementing innovative and significant business solutions
Maintain knowledge of industry best practices and new technologies and recommend innovations that enhance operations or provide a competitive advantage to the organization
Strategically engage with all levels of professionals and managers across the enterprise and serve as an expert advisor to leadership
Review and analyze complex, large-scale technology solutions for tactical and strategic business objectives, enterprise technological environment, and technical challenges that require in-depth evaluation of multiple factors, including intangibles or unprecedented technical factors
Drive adoption of NFRs, best practices-quality and compliance across observability and performance engineering
Ensure high availability and performance of production systems through proactive monitoring and incident response
Collaborate and consult with key technical experts, senior technology team, and external industry groups to resolve complex technical issues and achieve goals
Lead projects, teams, or serve as a peer mentor

What we offer

Health benefits
401(k) Plan
Paid time off
Disability benefits
Life insurance, critical illness insurance, and accident insurance
Parental leave
Critical caregiving leave
Discounts and savings
Commuter benefits
Tuition reimbursement

Fulltime

Technical Support Engineer

Our client is a globally connected technology organization offering cloud-based ...

Location

Turkey , İstanbul

Salary:

Not provided

SET Europa

Expiration Date

Until further notice

Requirements

Experience: 3-5+ years of experience in Cloud Computing (IaaS/PaaS/SaaS), DevOps, or Enterprise Architecture
Proven Track Record: Experience in supporting Fortune 500 or large-scale enterprise customers
Project Leadership: Demonstrated ability to lead complex cloud migration projects or large-scale system troubleshooting under high pressure
Certification (Highly Preferred): Alibaba Cloud ACP (Professional) or ACE (Expert) level. Equivalent certifications like AWS Professional/Specialty, Azure Solutions Architect, or Google Cloud Professional Architect
Infrastructure Mastery: Deep understanding of Linux/Windows kernel tuning and performance optimization
Advanced Networking: Expert knowledge in VPC, BGP, VPN, Express Connect (Direct Connect), and SD-WAN. Ability to analyze packet loss/latency using tools like Wireshark/Tcpdump at a professional level
Database & Big Data: Not just 'familiar,' but capable of performance tuning and migration for at least two engines (e.g., MySQL AND Redis/MongoDB)
Cloud-Native & Modern Tech: Proficiency in Containerization (Docker/Kubernetes) and Microservices
Hands-on experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation
Automation & Scripting: Strong ability to automate repetitive tasks using Python, Go, or Shell to improve support efficiency (SRE mindset)

Job Responsibility

Complex Incident Management: Beyond daily consulting, act as the final escalation point for L1 issues. Lead the troubleshooting of high-priority (P0/P1) incidents involving complex hybrid cloud architectures
Product & Engineering Synergy: Not just 'tracking' bugs, but providing deep-dive technical insights to R&D teams. Influence the product roadmap by identifying systemic architectural flaws and proposing optimization solutions
Customer Success & Risk Mitigation: Conduct proactive technical audits and architectural reviews for Key Accounts (KA). Use diagnostic tools not just to 'avoid risks' but to design high-availability (HA) and disaster recovery (DR) strategies
Knowledge Empowerment: Create and maintain high-quality technical documentation, troubleshooting playbooks, and internal Knowledge Base (KB) articles to improve the overall team's technical capability.

Fulltime

Devops Sre Engineer

We are looking for a mid-senior SRE/DevOps Engineer (5–8 years) to build and sca...

Location

India , Bengaluru

Salary:

Not provided

Acuver Consulting

Expiration Date

Until further notice

Requirements

5–8 years of experience in DevOps / SRE roles
Strong hands-on experience with AWS (preferred) and/or GCP
Expertise in: Kubernetes & Docker
Terraform (Infrastructure as Code)
CI/CD tools (GitLab, Jenkins, or similar)
Experience with: Event-driven / asynchronous architectures (Kafka, Pub/Sub, etc.)
Monitoring & logging tools (Prometheus, Grafana, ELK, etc.)
Microservices and distributed systems
Solid understanding of: Networking, load balancing, scaling strategies
High availability and fault-tolerant systems

Job Responsibility

Design and implement robust CI/CD pipelines (GitLab CI, Jenkins, or similar)
Enable automated build, test, and deployment workflows
Implement blue-green / canary deployments for zero-downtime releases
Ensure release traceability, rollback mechanisms, and deployment governance
Design, provision, and manage infrastructure on AWS (primary) and/or GCP
Build infrastructure using Infrastructure as Code (Terraform preferred)
Create reusable modules for scalable, secure, and standardized environments
Optimize cost, performance, and scalability of cloud resources
Deploy and manage applications using Docker & Kubernetes
Manage Kubernetes workloads using Helm charts

Fulltime

Select Country

Sre design & support engineer

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?