CrawlJobs Logo

SRE Production Support

United States, Livonia · Job Posted December 11, 2025
Apply Position
Job Link Share

Job Description

We’re passionate about building software that solves problems. We count on our site reliability engineers (SREs) to empower users with a rich feature set, high availability, and stellar performance level to pursue their missions. As we expand customer deployments, we’re seeking an experienced SRE to deliver insights from massive-scale data in real time. Specifically, we’re searching for someone who has fresh ideas and a unique viewpoint, and who enjoys collaborating with a cross-functional team to develop real-world solutions and positive user experiences for every interaction.

Job Responsibility

  • Monitoring and reporting on application behavior analytics, conducts smart triage by identifying, diagnosing, and coordinating resolution of performance problems before they impact end users, and participates in rapid root cause diagnosis of problems occurring within the application and infrastructure
  • Identifying the functional domain in which problems reside (Server Utilization, network Saturation, Application Tuning)
  • Participating in all Major Incident Management and Root Cause Analysis calls and provides expert troubleshooting support as needed
  • Understanding of troubleshooting, incidents and problems, work to resolve issues timely and determine fault or underlying issue. Work with both customer and vendor personnel
  • Monitoring high value Business-centric transactions and manages response actions
  • Maintaining accurate documentation for assigned workspace and procedures, updating procedures including, but not limited to software, hardware layers
  • Understand and utilize de-escalation techniques when working with difficult customers
  • Monitoring Application infrastructure and network through monitoring tools like Splunk, AppDynamics, Dynatrace
  • Proactively detects, reports, logs, and responds to all network performance and availability problems in each part of the Application
  • Follows incident, problem and change management processes related to technology infrastructure being supported. Reviews system requirements and application dependencies to determine monitoring configuration
  • Involving in creating documentation

Requirements

  • Master’s degree in Computer Science or related discipline
  • Ability to program (structured and OOP) using one or more high-level languages, such as Python, Java, C/C++, Ruby, and JavaScript
  • 5 to 6 years of experience in Production Support
  • Minimum 6+ years of professional experience in SRE Production Support
  • Experience with NFS, HDFS, Ceph, and Amazon S3, as well as dynamic resource management frameworks (Apache Mesos, Kubernetes, Yarn)
  • Must Provide 24×7 support on the production servers on a rotation basis

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

SRE Production Support

8 matching positions

Production Support Developer - Trading Technology

At Schwab, you’re empowered to make an impact on your career. Here, innovative t...
Location
Location
United States , Omaha, NE ; Austin, TX
Salary
Salary:
107000.00 - 135000.00 USD / Year
schwab.com Logo
Charles Schwab
Expiration Date
June 28, 2026
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science or a related field, or equivalent practical experience
  • 3+ years of experience in production support, site reliability engineering (SRE), or software operations
  • Working knowledge of Java (Java 17+ preferred) and SQL for troubleshooting
  • Experience supporting applications using observability and monitoring tools such as AppDynamics, Splunk, Grafana, InfluxDB, and Control‑M
  • Oracle Database experience with SQL
  • 3+ years of experience administering Linux systems (RHEL 7/8/9 preferred)
  • Ability to use shell scripting or Python to automate repetitive operational tasks
  • Strong communication skills, particularly during incident response and post‑incident reviews
  • Availability to work nights and weekends as part of a rotating on‑call schedule
Job Responsibility
Job Responsibility
  • Safeguard the stability and resilience of Schwab’s Order Management System in a high-availability environment
  • Own complex production situations—from assessing impact and leading incident response to collaborating across application, infrastructure, database, and vendor partners to restore service
  • Focus on continuous improvement by identifying patterns, reducing recurring issues, and strengthening monitoring, runbooks, and operational practices that improve availability over time
What we offer
What we offer
  • 401(k) with company match and Employee stock purchase plan
  • Paid time for vacation, volunteering, and 28-day sabbatical after every 5 years of service for eligible positions
  • Paid parental leave and family building benefits
  • Tuition reimbursement
  • Health, dental, and vision insurance
  • Fulltime
!
Read More
Arrow Right

Production Support Engineer

Robert Half is actively partnering with an Austin-based client to identify a pro...
Location
Location
United States , Austin
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in Production Support, Site Reliability Engineering, or similar roles within a SaaS environment
  • Proven experience leading or mentoring technical support or SRE teams
  • Extensive understanding of SDLC and production support frameworks
  • Experience working with databases, APIs, and web-based systems
  • Advanced ability to write and analyze complex SQL queries
  • Hands-on experience with monitoring, logging, and debugging tools (e.g., log analytics, diagnostics platforms)
  • Proficient troubleshooting and investigative skills in live production environments
  • Familiarity with cloud platforms (Azure preferred) and PaaS environments
  • Experience with automation and scripting for operational efficiency
  • Excellent communication skills and a strong sense of ownership
Job Responsibility
Job Responsibility
  • Oversee the health, performance, and uptime of a cloud-based application platform
  • Act as a player/coach, actively contributing to issue resolution while guiding the team
  • Lead incident response and resolution efforts, driving improved SLAs and faster recovery times
  • Perform root cause analysis (RCA) and implement preventative solutions
  • Collaborate with engineering, QA, and operations teams to support smooth releases and deployments
  • Identify and implement improvements to system reliability, monitoring, and support processes
  • Maintain a strong customer-first mindset with accountability for production stability
  • Develop and support team members through coaching, mentoring, and regular feedback
  • Create and maintain technical documentation, including SOPs, runbooks, and incident reports
  • Stay current on best practices related to cloud platforms, observability, and production support
What we offer
What we offer
  • Healthcare (medical, dental, and vision plans)
  • 401(k) and retirement plans
  • Commuter benefits
  • Employee and vendor discounts
  • Employee Assistance Program (EAP)
  • Fulltime
Read More
Arrow Right

Api Production Support Engineer - Officer

At Citi, we’re passionate about building and maintaining highly reliable APIs th...
Location
Location
Canada , Mississauga
Salary
Salary:
79320.00 - 110680.00 USD / Year
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience supporting Java and J2EE based applications
  • Deep technical knowledge and hands-on experience supporting and troubleshooting environments including AWS, ECS, Oracle DB, and Mongo DB
  • A strong understanding and practical application of SRE concepts, particularly in defining and measuring SLIs, SLOs and Error Budgets
  • Demonstrated experience in building and utilizing comprehensive monitoring solutions such as AppDynamics, Splunk, Kibana to proactively alert on production API-related issues and ensure system health
  • In-depth knowledge and hands-on experience with API Gateway technologies, specifically APIGEE, and CDN solutions like Akamai
  • Proven ability to proactively identify and address problems, areas for improvement, and performance bottlenecks within complex API ecosystems using software-based solutions
  • Strong coding experience beyond simple scripts, preferably in Java or Python, for automation and internal tool development
  • Bachelor’s University degree in Computer Science, Engineering, or a related field
  • or equivalent experience
Job Responsibility
Job Responsibility
  • Champion stability initiatives to enable high availability and resilience for our API applications, including enhancing monitoring, failover mechanisms, and overall system health
  • Demonstrate calm and analytical capabilities when faced with major incidents on critical API systems, ensuring effective incident, problem, and change management at a global enterprise level
  • Perform proactive monitoring and management of production API environments, taking a holistic view of system health and performance
  • Drive the definition, analysis, and reporting of SLIs and SLOs for all supported APIs and clients, ensuring clear performance benchmarks
  • Contribute to the development and implementation of tools and systems designed to enhance API operational management and the client experience
  • Measure and optimize API system performance, always pushing capabilities forward, anticipating customer needs, and innovating for continuous improvement
  • Provide hands-on expert operational support for critical, large-scale distributed API ecosystems
  • Actively gather and analyze performance metrics from API platforms and underlying infrastructure to assist in performance tuning, fault finding, and capacity planning
  • Partner closely with API development teams to improve services through rigorous operational feedback loops, testing, and release procedures
  • Drive the creation of sustainable API operational systems and services through automation and continuous uplifts, including developing, testing, and debugging automated tasks
  • Fulltime
Read More
Arrow Right

Equities Electronic Trading Support / SRE

Embark on a transformative journey as an Equities Electronic Trading Support/SRE...
Location
Location
United States , New York
Salary
Salary:
120000.00 - 175000.00 USD / Year
barclays.co.uk Logo
Barclays
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Working in Unix/Linux environments for production support, troubleshooting, and performance analysis
  • Writing scripts in Bash and Python to automate operational tasks and improve system reliability
  • Supporting containerized applications using Kubernetes and Docker in production environments
  • Supporting electronic trading systems that use FIX messaging in low-latency environments
Job Responsibility
Job Responsibility
  • Development and delivery of high-quality software solutions by using industry aligned programming languages, frameworks, and tools. Ensuring that code is scalable, maintainable, and optimized for performance
  • Cross-functional collaboration with product managers, designers, and other engineers to define software requirements, devise solution strategies, and ensure seamless integration and alignment with business objectives
  • Collaboration with peers, participate in code reviews, and promote a culture of code quality and knowledge sharing
  • Stay informed of industry technology trends and innovations and actively contribute to the organization’s technology communities to foster a culture of technical excellence and growth
  • Adherence to secure coding practices to mitigate vulnerabilities, protect sensitive data, and ensure secure software solutions
  • Implementation of effective unit testing practices to ensure proper code design, readability, and reliability
What we offer
What we offer
  • Competitive holiday allowance
  • Life assurance
  • Private medical care
  • Pension contribution
  • Fulltime
Read More
Arrow Right

Credit Risk Support Lead- SRE

Join Barclays as a Credit Risk Support Lead- SRE role, where to effectively moni...
Location
Location
India , Pune
Salary
Salary:
Not provided
barclays.co.uk Logo
Barclays
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 14+ years’ experience in production support
  • High energy, hands-on and results & goal-oriented
  • Expertise in log debugging, root cause analysis and troubleshooting live issues
  • Experience on observability tools like ESaaS, AppD / ITRS , Netcool
  • Experience in data analysis to identify underlying themes impacting stability, performance, and customer experience
  • Ensures and promotes ITIL best practices for Incident, Problem, Change, Release management (including managing and running triages, conducting root cause analysis, post incident reviews etc)
  • Strong Credit Risk business knowledge
  • Negotiate SLAs/OLAs with customer and other support elements
  • Business (IT) Continuity Management
  • KPI reporting and monitoring
Job Responsibility
Job Responsibility
  • Provision of technical support for the service management function to resolve more complex issues for a specific client of group of clients. Develop the support model and service offering to improve the service to customers and stakeholders.
  • Execution of preventative maintenance tasks on hardware and software and utilisation of monitoring tools/metrics to identify, prevent and address potential issues and ensure optimal performance.
  • Maintenance of a knowledge base containing detailed documentation of resolved cases for future reference, self-service opportunities and knowledge sharing.
  • Analysis of system logs, error messages and user reports to identify the root causes of hardware, software and network issues, and providing a resolution to these issues by fixing or replacing faulty hardware components, reinstalling software, or applying configuration changes.
  • Automation, monitoring enhancements, capacity management, resiliency, business continuity management, front office specific support and stakeholder management.
  • Identification and remediation or raising, through appropriate process, of potential service impacting risks and issues.
  • Proactively assess support activities implementing automations where appropriate to maintain stability and drive efficiency. Actively tune monitoring tools, thresholds, and alerting to ensure issues are known when they occur.
What we offer
What we offer
  • Competitive holiday allowance
  • Life assurance
  • Private medical care
  • Pension contribution
  • Fulltime
Read More
Arrow Right

Sre design & support engineer

We are looking for a self-driven, software engineering mindset SRE engineer to •...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
pepsico.com Logo
Pepsico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8-11 years of work experience evolving to a SRE engineer
  • 3-5 years of experience in continuously improving and transforming IT operations ways of working
  • Bachelor’s degree in Computer Science, Information Technology or a related field
  • Proven experience as an SRE in designing the events diagnostics, performance measures and alert solutions to meet the SLA/SLO/SLIs
  • The ideal Engineer will be highly quantitative, have great judgment, able to connect dots across ecosytems, and efficiently work cross-functionally across teams to ensure SRE orchestrating solutions are meeting customer/end-user expectations
  • The candidate will take a pragmatic approach resolving incidents, including the ability to systemically triangulate root causes and work effectively with external and internal teams to meet objectives
  • A strong expertise of SRE (Software Reliability Engineering) and IT Service Management (ITSM) processes with a track record for improving service offerings – pro-actively resolving incidents, providing a seamless customer/end-user experience and proactively identifying and mitigating areas of risk
  • Hands on experience in Python, SQL /No-SQl( MySQL, Mongo DB, Cassandra, Postgress), AppDynamics, ELK Stack Grafana, Splunk, Dynatrace, Kafka and any SRE Ops toolsets
  • A firm understanding of cloud archticture for distributed environments
  • Front-end technologies: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js
Job Responsibility
Job Responsibility
  • Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications
  • Ensuring non-functional requirements (NFRs) are embedded early including SLA/SLO/SLI and error budgets into the product’s offerings as part of the engineering solution
  • Execute as Pro-active SRE Support engineer, preventing P1, P2, potential P3s, diagnosing any anomalies prior to any user and driving the necessary remediations across the teams involved in end-to-end ecosystem availability, performance and consumption of the cloud architected application ecosystem leveraging SRE Orchestration solutions
  • Collaborates with Engineering & support teams, including participation in escalations, , and blameless postmortems,
  • Work closely with customer-facing support teams to empower them with SRE insights and tooling
  • Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member
  • Continuously optimize the L2/support operations work via SRE workflow automation
  • Shape the SRE orchestration platform design with inputs from Production Operations, Business usage & Product and engineering teams
  • Actively engage and drive AI Ops adoption across teams
Read More
Arrow Right

SRE Lead Design & Support Engineer

This is a critical enabler achieving a high resiliency during operations and als...
Location
Location
Mexico , Miguel Hidalgo
Salary
Salary:
Not provided
pepsico.com Logo
Pepsico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of work experience evolving to a SRE engineer
  • 3-5 years of experience in continuously improving and transforming IT operations ways of working
  • Bachelor’s degree in Computer Science, Information Technology or a related field
  • Proven experience as an SRE in designing the events diagnostics, performance measures and alert solutions to meet the SLA/SLO/SLIs
  • Highly quantitative, have great judgment, able to connect dots across ecosytems, and efficiently work cross-functionally across teams
  • A strong expertise of SRE (Software Reliability Engineering) and IT Service Management (ITSM) processes
  • Hands on experience in Python, SQL /No-SQl( MySQL, Mongo DB, Cassandra, Postgress), AppDynamics, ELK Stack Grafana, Splunk, Dynatrace, Kafka and any SRE Ops toolsets
  • A firm understanding of cloud archticture for distributed environments
  • Front-end technologies: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js
  • Back-end technologies: Server-side languages (Java, Spring Boot, and related technologies that build the server-side logic, APIs, and database interaction with MySQL, MongoDB, Cassandra, Couchbase)
Job Responsibility
Job Responsibility
  • Drive new shift left activities critical to apply Site Reliability Engineering (SRE) and quality assurance principles within the application design / Project roadmap that enablees resilient outcomes
  • Apply pre-emptive approach into production minimizing business impact, via SRE-driven orchestration of connecting all components of the ecosystem diagnosing anomalies prior to user & remediating through automation
  • Ensure ecosystem availability and performance in production environments, Pro-actively preventing P1, P2, potential P3s
  • Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications
  • Accountable to institute non-functional requirements (NFRs) are embedded early including SLA/SLO/SLI and error budgets into the product’s offerings as part of the engineering solution
  • Leads the team diagnosing any anomalies prior to any user and driving the necessary remediations across the teams involved in end-to-end ecosystem availability, performance and consumption of the cloud architected application ecosystem leveraging SRE Orchestration solutions
  • Collaborates with Engineering & support teams, including participation in escalations, and blameless postmortems
  • Work closely with customer-facing support teams to empower them with SRE insights and tooling
  • Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member
  • Continuously optimize the L2/support operations work via SRE workflow automation
What we offer
What we offer
  • Opportunities to learn and develop every day through a wide range of programs
  • Internal digital platforms that promote self-learning
  • Development programs according to Leadership skills
  • Specialized training according to the role
  • Learning experiences with internal and external providers
  • Recognition programs for seniority, behavior, leadership, moments of life, among others
  • Financial wellness programs that will help you reach your goals in all stages of life
  • A flexibility program that will allow you to balance your personal and work life, adapting your working day to your lifestyle
  • Wellness Line, thousands of Agreements and Discounts, Scholarship programs for your children, Aid Plans for different moments of life
Read More
Arrow Right

Principal Site Reliability Engineer

Palo Alto Networks runs a large hybrid infrastructure and is one of the largest ...
Location
Location
United States , Santa Clara
Salary
Salary:
151600.00 - 245300.00 USD / Year
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience
  • Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
  • Proficient in Python and/or Go
  • Expertise in managing applications in the Kubernetes cluster with autoscaling enabled
  • Experience in Production Engineering, DevOps, or Site Reliability
  • Expertise in the public cloud (GCP or AWS), especially in GCP
  • Strong Linux administration, internals, and network troubleshooting
  • Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
  • Experience with CI/CD pipelines, GitLab, and GitHub preferred
  • Ability to diagnose and troubleshoot complex distributed systems handling high-volume transactions
Job Responsibility
Job Responsibility
  • Contribute to the success of SRE and DevOps
  • Develop expertise in new technologies
  • Work with developers, researchers, data scientists, and security experts
  • Design, build, and operate reliable, secure Cloud infrastructure
  • Ensure that applications are production-ready, scalable, and reliable
  • Develop tools and automation frameworks
  • Automate robust deployment of robust services
  • Orchestrate end-to-end monitoring and alerting
  • Participate with SRE and Dev teams in the on-call rotation
  • Lead root cause analysis of critical business and production issues
  • Fulltime
Read More
Arrow Right