CrawlJobs Logo

Sre design & support engineer

India, Hyderabad · Job Posted January 15, 2026
Apply Position
Job Link Share

Job Description

We are looking for a self-driven, software engineering mindset SRE engineer to • Drive new shift left activities critical to apply Site Reliability Engineering (SRE) and quality assurance principles within the application design / Project roadmap that enablees resilient outcomes • Apply pre-emptive approach into production minimizing business impact, via SRE-driven orchestration of connecting all components of the ecosystem diagnosing anomalies prior to user & remediating through automation, The SRE design & support engineer is integral part of the global team with its main purpose to provide a delightful customer experience for the user of the global consumer, commercial, supply chain and enablement functions in the PepsiCo digital products application portfolio of 260+ applications, enabling a full SRE Practice incident prevention / proactive resolution model. The scope of this role is focussed on the cloud architecture application full stack devlopment, B2B pepsiconnect and Direct to Customer and other S&T roadmap applications. Ensures that PepsiCo DPA applications service performance, reliability and availability expected by our customers and internal groups It requires a blend of technical expertise on SRE tools, modern applications cloud architecture i.e. full stack, IT operations experience, and analytics & influence skills.

Job Responsibility

  • Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications
  • Ensuring non-functional requirements (NFRs) are embedded early including SLA/SLO/SLI and error budgets into the product’s offerings as part of the engineering solution
  • Execute as Pro-active SRE Support engineer, preventing P1, P2, potential P3s, diagnosing any anomalies prior to any user and driving the necessary remediations across the teams involved in end-to-end ecosystem availability, performance and consumption of the cloud architected application ecosystem leveraging SRE Orchestration solutions
  • Collaborates with Engineering & support teams, including participation in escalations, , and blameless postmortems,
  • Work closely with customer-facing support teams to empower them with SRE insights and tooling
  • Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member
  • Continuously optimize the L2/support operations work via SRE workflow automation
  • Shape the SRE orchestration platform design with inputs from Production Operations, Business usage & Product and engineering teams
  • Actively engage and drive AI Ops adoption across teams

Requirements

  • 8-11 years of work experience evolving to a SRE engineer
  • 3-5 years of experience in continuously improving and transforming IT operations ways of working
  • Bachelor’s degree in Computer Science, Information Technology or a related field
  • Proven experience as an SRE in designing the events diagnostics, performance measures and alert solutions to meet the SLA/SLO/SLIs
  • The ideal Engineer will be highly quantitative, have great judgment, able to connect dots across ecosytems, and efficiently work cross-functionally across teams to ensure SRE orchestrating solutions are meeting customer/end-user expectations
  • The candidate will take a pragmatic approach resolving incidents, including the ability to systemically triangulate root causes and work effectively with external and internal teams to meet objectives
  • A strong expertise of SRE (Software Reliability Engineering) and IT Service Management (ITSM) processes with a track record for improving service offerings – pro-actively resolving incidents, providing a seamless customer/end-user experience and proactively identifying and mitigating areas of risk
  • Hands on experience in Python, SQL /No-SQl( MySQL, Mongo DB, Cassandra, Postgress), AppDynamics, ELK Stack Grafana, Splunk, Dynatrace, Kafka and any SRE Ops toolsets
  • A firm understanding of cloud archticture for distributed environments
  • Front-end technologies: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js
  • Back-end technologies: Server-side languages (Java, Spring Boot, and related technologies that build the server-side logic, APIs, and database interaction with MySQL, MongoDB, Cassandra, Couchbase)
  • Infrastructure: Azure/AWS cloud platforms and/or Client / server environments

Nice to have

Prior experience involving in shaping transformation developing SRE solutions would be a plus

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Sre design & support engineer

8 matching positions

SRE Lead Design & Support Engineer

This is a critical enabler achieving a high resiliency during operations and als...
Location
Location
Mexico , Miguel Hidalgo
Salary
Salary:
Not provided
pepsico.com Logo
Pepsico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of work experience evolving to a SRE engineer
  • 3-5 years of experience in continuously improving and transforming IT operations ways of working
  • Bachelor’s degree in Computer Science, Information Technology or a related field
  • Proven experience as an SRE in designing the events diagnostics, performance measures and alert solutions to meet the SLA/SLO/SLIs
  • Highly quantitative, have great judgment, able to connect dots across ecosytems, and efficiently work cross-functionally across teams
  • A strong expertise of SRE (Software Reliability Engineering) and IT Service Management (ITSM) processes
  • Hands on experience in Python, SQL /No-SQl( MySQL, Mongo DB, Cassandra, Postgress), AppDynamics, ELK Stack Grafana, Splunk, Dynatrace, Kafka and any SRE Ops toolsets
  • A firm understanding of cloud archticture for distributed environments
  • Front-end technologies: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js
  • Back-end technologies: Server-side languages (Java, Spring Boot, and related technologies that build the server-side logic, APIs, and database interaction with MySQL, MongoDB, Cassandra, Couchbase)
Job Responsibility
Job Responsibility
  • Drive new shift left activities critical to apply Site Reliability Engineering (SRE) and quality assurance principles within the application design / Project roadmap that enablees resilient outcomes
  • Apply pre-emptive approach into production minimizing business impact, via SRE-driven orchestration of connecting all components of the ecosystem diagnosing anomalies prior to user & remediating through automation
  • Ensure ecosystem availability and performance in production environments, Pro-actively preventing P1, P2, potential P3s
  • Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications
  • Accountable to institute non-functional requirements (NFRs) are embedded early including SLA/SLO/SLI and error budgets into the product’s offerings as part of the engineering solution
  • Leads the team diagnosing any anomalies prior to any user and driving the necessary remediations across the teams involved in end-to-end ecosystem availability, performance and consumption of the cloud architected application ecosystem leveraging SRE Orchestration solutions
  • Collaborates with Engineering & support teams, including participation in escalations, and blameless postmortems
  • Work closely with customer-facing support teams to empower them with SRE insights and tooling
  • Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member
  • Continuously optimize the L2/support operations work via SRE workflow automation
What we offer
What we offer
  • Opportunities to learn and develop every day through a wide range of programs
  • Internal digital platforms that promote self-learning
  • Development programs according to Leadership skills
  • Specialized training according to the role
  • Learning experiences with internal and external providers
  • Recognition programs for seniority, behavior, leadership, moments of life, among others
  • Financial wellness programs that will help you reach your goals in all stages of life
  • A flexibility program that will allow you to balance your personal and work life, adapting your working day to your lifestyle
  • Wellness Line, thousands of Agreements and Discounts, Scholarship programs for your children, Aid Plans for different moments of life
Read More
Arrow Right
New

Senior Systems Operations Engineer - SRE and AIOps

Wells Fargo is seeking a Senior Systems Operations Engineer within the Enterpris...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
https://www.wellsfargo.com/ Logo
Wells Fargo
Expiration Date
June 22, 2026
Flip Icon
Requirements
Requirements
  • 4+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • Strong Java / backend service development experience
  • Distributed systems and API-based service design
  • CI/CD pipelines and Git-based workflows
  • 3+ years of experience with scripting and infrastructure automation using Terraform
  • 3+ years of hands-on experience with OpenShift, GCP or Azure platform enablement and application migrations, build out of complex infrastructure programmable patterns using Infrastructure as Code (IaC)
  • 2+ years of knowledge and understanding of Cloud service offerings such as data, analytics, AL/ML on GCP or Azure
  • 2+ years of experience with key services provided by Azure and/or GCP such as BigQuery, Vertix AI, DataProc, Functions. AKS, Service Fabric
  • 2+ years working in a globally distributed team to provide innovative and robust cloud centric solutions
  • 2+ years gathering and analyzing data to diagnose the root cause of cloud workload issues, recommending and implementing solutions to resolve issues in timely manner
Job Responsibility
Job Responsibility
  • Lead or participate in managing all installed systems and infrastructure within the Systems Operations functional area
  • Contribute in increasing system efficiencies and lowering the human intervention time on related tasks
  • Review and analyze moderately complex operational support systems, application software, and system management tools to ensure the highest levels of systems and infrastructure availability
  • Work with vendors and other technical personnel for problem resolution
  • Lead team to meet technical deliverables while leveraging solid understanding of technical process controls or standards
  • Collaborate with vendors and other technical personnel to resolve technical issues and achieve highest levels of systems and infrastructure availability
  • Fulltime
Read More
Arrow Right

Senior Support Engineer

The Technical Support team is responsible for ensuring that developers and enter...
Location
Location
United States , San Francisco
Salary
Salary:
234000.00 - 260000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Have a Bachelor’s degree in Computer Science or a related field
  • Have 8+ years of experience in technical operations roles such as SRE/NOC, designing monitoring systems and resolving production issues in fast-paced and mission-critical environments
  • Have deep familiarity with modern monitoring, alerting, and observability practices
  • Have proven experience leading incident response for high‑severity outages or service disruptions
  • Have strong skills in scripting or software engineering (e.g., Python or similar) to automate repetitive tasks and integrate tools
  • Have solid understanding of cloud infrastructure and distributed systems fundamentals
  • Are effective at working cross‑functionally in a high‑trust environment
  • Strong communication skills to explain technical issues and resolutions to both engineering and non‑technical stakeholders
Job Responsibility
Job Responsibility
  • Be among the foremost technical and troubleshooting experts for our API platform at OpenAI
  • Proactively identify and implement opportunities to scale support operations by leveraging automation and advancements in AI technologies
  • Configure and use advanced monitoring and alerting workflows to proactively detect customer impacting issues in real time
  • In partnership with engineering, contribute to reliability reviews and preparedness for new features, launches, or strategic customer requirement updates
  • Design and refine incident response processes and documentation across strategic customers, engineering and support teams
  • Analyze operational metrics and incident RCAs to identify areas for improvement
  • Proactively recommend and implement enhancements to monitoring dashboards, alert configurations, and support workflows
  • Provide support coverage during holidays and weekends based on business needs
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

Api Production Support Engineer - Officer

At Citi, we’re passionate about building and maintaining highly reliable APIs th...
Location
Location
Canada , Mississauga
Salary
Salary:
79320.00 - 110680.00 USD / Year
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience supporting Java and J2EE based applications
  • Deep technical knowledge and hands-on experience supporting and troubleshooting environments including AWS, ECS, Oracle DB, and Mongo DB
  • A strong understanding and practical application of SRE concepts, particularly in defining and measuring SLIs, SLOs and Error Budgets
  • Demonstrated experience in building and utilizing comprehensive monitoring solutions such as AppDynamics, Splunk, Kibana to proactively alert on production API-related issues and ensure system health
  • In-depth knowledge and hands-on experience with API Gateway technologies, specifically APIGEE, and CDN solutions like Akamai
  • Proven ability to proactively identify and address problems, areas for improvement, and performance bottlenecks within complex API ecosystems using software-based solutions
  • Strong coding experience beyond simple scripts, preferably in Java or Python, for automation and internal tool development
  • Bachelor’s University degree in Computer Science, Engineering, or a related field
  • or equivalent experience
Job Responsibility
Job Responsibility
  • Champion stability initiatives to enable high availability and resilience for our API applications, including enhancing monitoring, failover mechanisms, and overall system health
  • Demonstrate calm and analytical capabilities when faced with major incidents on critical API systems, ensuring effective incident, problem, and change management at a global enterprise level
  • Perform proactive monitoring and management of production API environments, taking a holistic view of system health and performance
  • Drive the definition, analysis, and reporting of SLIs and SLOs for all supported APIs and clients, ensuring clear performance benchmarks
  • Contribute to the development and implementation of tools and systems designed to enhance API operational management and the client experience
  • Measure and optimize API system performance, always pushing capabilities forward, anticipating customer needs, and innovating for continuous improvement
  • Provide hands-on expert operational support for critical, large-scale distributed API ecosystems
  • Actively gather and analyze performance metrics from API platforms and underlying infrastructure to assist in performance tuning, fault finding, and capacity planning
  • Partner closely with API development teams to improve services through rigorous operational feedback loops, testing, and release procedures
  • Drive the creation of sustainable API operational systems and services through automation and continuous uplifts, including developing, testing, and debugging automated tasks
  • Fulltime
Read More
Arrow Right

Cloud Engineer (SRE)

We're looking for a Cloud Engineer (Software Reliability Engineer) to join the g...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
helpcare.ai Logo
Helpcare AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Amazon Web Services (AWS)
  • Microsoft Azure
  • Go
  • Google Cloud
  • Kubernetes
  • PostgreSQL
  • 3+ years of experience hosting Postgres (not just using Postgres)
  • 4-7 years of experience working with AWS/GCP and Azure
  • Experience building games or tinkering with game engines at a deep level
  • Experience with databases and data design, specially PostgreSQL databases
Job Responsibility
Job Responsibility
  • Manage in-cluster Postgres instances
  • Build, operationalise and maintain the infrastructure on Kubernetes
  • Build Heroic Cloud services (in Go)
  • Join the on-call rota for potential callouts
  • Recognize patterns among user issues and suggest ways to improve our product and offerings
  • Partner with Documentation, Product, Sales, and Engineering teams to guide improvements
  • Help develop and iterate on our support processes and systems
What we offer
What we offer
  • Competitive salary
  • Unlimited vacation policy
  • Company requires you to take at least 2 weeks off each year (and observe local holidays)
  • At least yearly company all-hands and getaways
  • Pick your own equipment
  • Fulltime
Read More
Arrow Right

Lead Software Engineer - SRE

Wells Fargo is seeking a Lead Site Reliability Engineer (SRE) to join the WIMT P...
Location
Location
United States , CHARLOTTE; SAINT LOUIS
Salary
Salary:
119000.00 - 187000.00 USD / Year
https://www.wellsfargo.com/ Logo
Wells Fargo
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • 5+ years of experience leading observability and monitoring tooling - Splunk, AppDynamics, Splunk Observability, Grafana, Open Telemetry
  • 5+ years in infrastructure (windows and Linux) support
  • 5+ years proven success in toil reduction initiatives
  • 5+ years in cloud application management especially OpenShift Container Platform
Job Responsibility
Job Responsibility
  • Design and implement scalability, reliability, and observability strategies for cloud and on-premise environments
  • Define SLIs (Service Level Indicators), SLOs (Service Level Objectives), and Error Budgets to improve system reliability
  • Provide vision, direction and expertise to leadership on implementing innovative and significant business solutions
  • Maintain knowledge of industry best practices and new technologies and recommend innovations that enhance operations or provide a competitive advantage to the organization
  • Strategically engage with all levels of professionals and managers across the enterprise and serve as an expert advisor to leadership
  • Review and analyze complex, large-scale technology solutions for tactical and strategic business objectives, enterprise technological environment, and technical challenges that require in-depth evaluation of multiple factors, including intangibles or unprecedented technical factors
  • Drive adoption of NFRs, best practices-quality and compliance across observability and performance engineering
  • Ensure high availability and performance of production systems through proactive monitoring and incident response
  • Collaborate and consult with key technical experts, senior technology team, and external industry groups to resolve complex technical issues and achieve goals
  • Lead projects, teams, or serve as a peer mentor
What we offer
What we offer
  • Health benefits
  • 401(k) Plan
  • Paid time off
  • Disability benefits
  • Life insurance, critical illness insurance, and accident insurance
  • Parental leave
  • Critical caregiving leave
  • Discounts and savings
  • Commuter benefits
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Technical Support Engineer

Our client is a globally connected technology organization offering cloud-based ...
Location
Location
Turkey , İstanbul
Salary
Salary:
Not provided
seteuropa.com Logo
SET Europa
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience: 3-5+ years of experience in Cloud Computing (IaaS/PaaS/SaaS), DevOps, or Enterprise Architecture
  • Proven Track Record: Experience in supporting Fortune 500 or large-scale enterprise customers
  • Project Leadership: Demonstrated ability to lead complex cloud migration projects or large-scale system troubleshooting under high pressure
  • Certification (Highly Preferred): Alibaba Cloud ACP (Professional) or ACE (Expert) level. Equivalent certifications like AWS Professional/Specialty, Azure Solutions Architect, or Google Cloud Professional Architect
  • Infrastructure Mastery: Deep understanding of Linux/Windows kernel tuning and performance optimization
  • Advanced Networking: Expert knowledge in VPC, BGP, VPN, Express Connect (Direct Connect), and SD-WAN. Ability to analyze packet loss/latency using tools like Wireshark/Tcpdump at a professional level
  • Database & Big Data: Not just 'familiar,' but capable of performance tuning and migration for at least two engines (e.g., MySQL AND Redis/MongoDB)
  • Cloud-Native & Modern Tech: Proficiency in Containerization (Docker/Kubernetes) and Microservices
  • Hands-on experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation
  • Automation & Scripting: Strong ability to automate repetitive tasks using Python, Go, or Shell to improve support efficiency (SRE mindset)
Job Responsibility
Job Responsibility
  • Complex Incident Management: Beyond daily consulting, act as the final escalation point for L1 issues. Lead the troubleshooting of high-priority (P0/P1) incidents involving complex hybrid cloud architectures
  • Product & Engineering Synergy: Not just 'tracking' bugs, but providing deep-dive technical insights to R&D teams. Influence the product roadmap by identifying systemic architectural flaws and proposing optimization solutions
  • Customer Success & Risk Mitigation: Conduct proactive technical audits and architectural reviews for Key Accounts (KA). Use diagnostic tools not just to 'avoid risks' but to design high-availability (HA) and disaster recovery (DR) strategies
  • Knowledge Empowerment: Create and maintain high-quality technical documentation, troubleshooting playbooks, and internal Knowledge Base (KB) articles to improve the overall team's technical capability.
  • Fulltime
Read More
Arrow Right

Devops Sre Engineer

We are looking for a mid-senior SRE/DevOps Engineer (5–8 years) to build and sca...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
acuverconsulting.com Logo
Acuver Consulting
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5–8 years of experience in DevOps / SRE roles
  • Strong hands-on experience with AWS (preferred) and/or GCP
  • Expertise in: Kubernetes & Docker
  • Terraform (Infrastructure as Code)
  • CI/CD tools (GitLab, Jenkins, or similar)
  • Experience with: Event-driven / asynchronous architectures (Kafka, Pub/Sub, etc.)
  • Monitoring & logging tools (Prometheus, Grafana, ELK, etc.)
  • Microservices and distributed systems
  • Solid understanding of: Networking, load balancing, scaling strategies
  • High availability and fault-tolerant systems
Job Responsibility
Job Responsibility
  • Design and implement robust CI/CD pipelines (GitLab CI, Jenkins, or similar)
  • Enable automated build, test, and deployment workflows
  • Implement blue-green / canary deployments for zero-downtime releases
  • Ensure release traceability, rollback mechanisms, and deployment governance
  • Design, provision, and manage infrastructure on AWS (primary) and/or GCP
  • Build infrastructure using Infrastructure as Code (Terraform preferred)
  • Create reusable modules for scalable, secure, and standardized environments
  • Optimize cost, performance, and scalability of cloud resources
  • Deploy and manage applications using Docker & Kubernetes
  • Manage Kubernetes workloads using Helm charts
  • Fulltime
Read More
Arrow Right