SRE Production Support Job at Select Minds (Livonia)

Production Support Developer - Trading Technology

At Schwab, you’re empowered to make an impact on your career. Here, innovative t...

Location

United States , Omaha, NE ; Austin, TX

Salary:

107000.00 - 135000.00 USD / Year

Charles Schwab

Expiration Date

June 28, 2026

Requirements

Bachelor’s degree in Computer Science or a related field, or equivalent practical experience
3+ years of experience in production support, site reliability engineering (SRE), or software operations
Working knowledge of Java (Java 17+ preferred) and SQL for troubleshooting
Experience supporting applications using observability and monitoring tools such as AppDynamics, Splunk, Grafana, InfluxDB, and Control‑M
Oracle Database experience with SQL
3+ years of experience administering Linux systems (RHEL 7/8/9 preferred)
Ability to use shell scripting or Python to automate repetitive operational tasks
Strong communication skills, particularly during incident response and post‑incident reviews
Availability to work nights and weekends as part of a rotating on‑call schedule

Job Responsibility

Safeguard the stability and resilience of Schwab’s Order Management System in a high-availability environment
Own complex production situations—from assessing impact and leading incident response to collaborating across application, infrastructure, database, and vendor partners to restore service
Focus on continuous improvement by identifying patterns, reducing recurring issues, and strengthening monitoring, runbooks, and operational practices that improve availability over time

What we offer

401(k) with company match and Employee stock purchase plan
Paid time for vacation, volunteering, and 28-day sabbatical after every 5 years of service for eligible positions
Paid parental leave and family building benefits
Tuition reimbursement
Health, dental, and vision insurance

Fulltime

!

Production Support Engineer

Robert Half is actively partnering with an Austin-based client to identify a pro...

Location

United States , Austin

Salary:

Not provided

Robert Half

Expiration Date

Until further notice

Requirements

5+ years of experience in Production Support, Site Reliability Engineering, or similar roles within a SaaS environment
Proven experience leading or mentoring technical support or SRE teams
Extensive understanding of SDLC and production support frameworks
Experience working with databases, APIs, and web-based systems
Advanced ability to write and analyze complex SQL queries
Hands-on experience with monitoring, logging, and debugging tools (e.g., log analytics, diagnostics platforms)
Proficient troubleshooting and investigative skills in live production environments
Familiarity with cloud platforms (Azure preferred) and PaaS environments
Experience with automation and scripting for operational efficiency
Excellent communication skills and a strong sense of ownership

Job Responsibility

Oversee the health, performance, and uptime of a cloud-based application platform
Act as a player/coach, actively contributing to issue resolution while guiding the team
Lead incident response and resolution efforts, driving improved SLAs and faster recovery times
Perform root cause analysis (RCA) and implement preventative solutions
Collaborate with engineering, QA, and operations teams to support smooth releases and deployments
Identify and implement improvements to system reliability, monitoring, and support processes
Maintain a strong customer-first mindset with accountability for production stability
Develop and support team members through coaching, mentoring, and regular feedback
Create and maintain technical documentation, including SOPs, runbooks, and incident reports
Stay current on best practices related to cloud platforms, observability, and production support

What we offer

Healthcare (medical, dental, and vision plans)
401(k) and retirement plans
Commuter benefits
Employee and vendor discounts
Employee Assistance Program (EAP)

Fulltime

Api Production Support Engineer - Officer

At Citi, we’re passionate about building and maintaining highly reliable APIs th...

Location

Canada , Mississauga

Salary:

79320.00 - 110680.00 USD / Year

Citi

Expiration Date

Until further notice

Requirements

Extensive experience supporting Java and J2EE based applications
Deep technical knowledge and hands-on experience supporting and troubleshooting environments including AWS, ECS, Oracle DB, and Mongo DB
A strong understanding and practical application of SRE concepts, particularly in defining and measuring SLIs, SLOs and Error Budgets
Demonstrated experience in building and utilizing comprehensive monitoring solutions such as AppDynamics, Splunk, Kibana to proactively alert on production API-related issues and ensure system health
In-depth knowledge and hands-on experience with API Gateway technologies, specifically APIGEE, and CDN solutions like Akamai
Proven ability to proactively identify and address problems, areas for improvement, and performance bottlenecks within complex API ecosystems using software-based solutions
Strong coding experience beyond simple scripts, preferably in Java or Python, for automation and internal tool development
Bachelor’s University degree in Computer Science, Engineering, or a related field
or equivalent experience

Job Responsibility

Champion stability initiatives to enable high availability and resilience for our API applications, including enhancing monitoring, failover mechanisms, and overall system health
Demonstrate calm and analytical capabilities when faced with major incidents on critical API systems, ensuring effective incident, problem, and change management at a global enterprise level
Perform proactive monitoring and management of production API environments, taking a holistic view of system health and performance
Drive the definition, analysis, and reporting of SLIs and SLOs for all supported APIs and clients, ensuring clear performance benchmarks
Contribute to the development and implementation of tools and systems designed to enhance API operational management and the client experience
Measure and optimize API system performance, always pushing capabilities forward, anticipating customer needs, and innovating for continuous improvement
Provide hands-on expert operational support for critical, large-scale distributed API ecosystems
Actively gather and analyze performance metrics from API platforms and underlying infrastructure to assist in performance tuning, fault finding, and capacity planning
Partner closely with API development teams to improve services through rigorous operational feedback loops, testing, and release procedures
Drive the creation of sustainable API operational systems and services through automation and continuous uplifts, including developing, testing, and debugging automated tasks

Fulltime

Equities Electronic Trading Support / SRE

Embark on a transformative journey as an Equities Electronic Trading Support/SRE...

Location

United States , New York

Salary:

120000.00 - 175000.00 USD / Year

Barclays

Expiration Date

Until further notice

Requirements

Working in Unix/Linux environments for production support, troubleshooting, and performance analysis
Writing scripts in Bash and Python to automate operational tasks and improve system reliability
Supporting containerized applications using Kubernetes and Docker in production environments
Supporting electronic trading systems that use FIX messaging in low-latency environments

Job Responsibility

Development and delivery of high-quality software solutions by using industry aligned programming languages, frameworks, and tools. Ensuring that code is scalable, maintainable, and optimized for performance
Cross-functional collaboration with product managers, designers, and other engineers to define software requirements, devise solution strategies, and ensure seamless integration and alignment with business objectives
Collaboration with peers, participate in code reviews, and promote a culture of code quality and knowledge sharing
Stay informed of industry technology trends and innovations and actively contribute to the organization’s technology communities to foster a culture of technical excellence and growth
Adherence to secure coding practices to mitigate vulnerabilities, protect sensitive data, and ensure secure software solutions
Implementation of effective unit testing practices to ensure proper code design, readability, and reliability

What we offer

Competitive holiday allowance
Life assurance
Private medical care
Pension contribution

Fulltime

Credit Risk Support Lead- SRE

Join Barclays as a Credit Risk Support Lead- SRE role, where to effectively moni...

Location

India , Pune

Salary:

Not provided

Barclays

Expiration Date

Until further notice

Requirements

14+ years’ experience in production support
High energy, hands-on and results & goal-oriented
Expertise in log debugging, root cause analysis and troubleshooting live issues
Experience on observability tools like ESaaS, AppD / ITRS , Netcool
Experience in data analysis to identify underlying themes impacting stability, performance, and customer experience
Ensures and promotes ITIL best practices for Incident, Problem, Change, Release management (including managing and running triages, conducting root cause analysis, post incident reviews etc)
Strong Credit Risk business knowledge
Negotiate SLAs/OLAs with customer and other support elements
Business (IT) Continuity Management
KPI reporting and monitoring

Job Responsibility

Provision of technical support for the service management function to resolve more complex issues for a specific client of group of clients. Develop the support model and service offering to improve the service to customers and stakeholders.
Execution of preventative maintenance tasks on hardware and software and utilisation of monitoring tools/metrics to identify, prevent and address potential issues and ensure optimal performance.
Maintenance of a knowledge base containing detailed documentation of resolved cases for future reference, self-service opportunities and knowledge sharing.
Analysis of system logs, error messages and user reports to identify the root causes of hardware, software and network issues, and providing a resolution to these issues by fixing or replacing faulty hardware components, reinstalling software, or applying configuration changes.
Automation, monitoring enhancements, capacity management, resiliency, business continuity management, front office specific support and stakeholder management.
Identification and remediation or raising, through appropriate process, of potential service impacting risks and issues.
Proactively assess support activities implementing automations where appropriate to maintain stability and drive efficiency. Actively tune monitoring tools, thresholds, and alerting to ensure issues are known when they occur.

What we offer

Competitive holiday allowance
Life assurance
Private medical care
Pension contribution

Fulltime

Sre design & support engineer

We are looking for a self-driven, software engineering mindset SRE engineer to •...

Location

India , Hyderabad

Salary:

Not provided

Pepsico

Expiration Date

Until further notice

Requirements

8-11 years of work experience evolving to a SRE engineer
3-5 years of experience in continuously improving and transforming IT operations ways of working
Bachelor’s degree in Computer Science, Information Technology or a related field
Proven experience as an SRE in designing the events diagnostics, performance measures and alert solutions to meet the SLA/SLO/SLIs
The ideal Engineer will be highly quantitative, have great judgment, able to connect dots across ecosytems, and efficiently work cross-functionally across teams to ensure SRE orchestrating solutions are meeting customer/end-user expectations
The candidate will take a pragmatic approach resolving incidents, including the ability to systemically triangulate root causes and work effectively with external and internal teams to meet objectives
A strong expertise of SRE (Software Reliability Engineering) and IT Service Management (ITSM) processes with a track record for improving service offerings – pro-actively resolving incidents, providing a seamless customer/end-user experience and proactively identifying and mitigating areas of risk
Hands on experience in Python, SQL /No-SQl( MySQL, Mongo DB, Cassandra, Postgress), AppDynamics, ELK Stack Grafana, Splunk, Dynatrace, Kafka and any SRE Ops toolsets
A firm understanding of cloud archticture for distributed environments
Front-end technologies: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js

Job Responsibility

Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications
Ensuring non-functional requirements (NFRs) are embedded early including SLA/SLO/SLI and error budgets into the product’s offerings as part of the engineering solution
Execute as Pro-active SRE Support engineer, preventing P1, P2, potential P3s, diagnosing any anomalies prior to any user and driving the necessary remediations across the teams involved in end-to-end ecosystem availability, performance and consumption of the cloud architected application ecosystem leveraging SRE Orchestration solutions
Collaborates with Engineering & support teams, including participation in escalations, , and blameless postmortems,
Work closely with customer-facing support teams to empower them with SRE insights and tooling
Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member
Continuously optimize the L2/support operations work via SRE workflow automation
Shape the SRE orchestration platform design with inputs from Production Operations, Business usage & Product and engineering teams
Actively engage and drive AI Ops adoption across teams

SRE Lead Design & Support Engineer

This is a critical enabler achieving a high resiliency during operations and als...

Location

Mexico , Miguel Hidalgo

Salary:

Not provided

Pepsico

Expiration Date

Until further notice

Requirements

8+ years of work experience evolving to a SRE engineer
3-5 years of experience in continuously improving and transforming IT operations ways of working
Bachelor’s degree in Computer Science, Information Technology or a related field
Proven experience as an SRE in designing the events diagnostics, performance measures and alert solutions to meet the SLA/SLO/SLIs
Highly quantitative, have great judgment, able to connect dots across ecosytems, and efficiently work cross-functionally across teams
A strong expertise of SRE (Software Reliability Engineering) and IT Service Management (ITSM) processes
Hands on experience in Python, SQL /No-SQl( MySQL, Mongo DB, Cassandra, Postgress), AppDynamics, ELK Stack Grafana, Splunk, Dynatrace, Kafka and any SRE Ops toolsets
A firm understanding of cloud archticture for distributed environments
Front-end technologies: HTML, CSS, JavaScript, and frameworks like React, Angular, or Vue.js
Back-end technologies: Server-side languages (Java, Spring Boot, and related technologies that build the server-side logic, APIs, and database interaction with MySQL, MongoDB, Cassandra, Couchbase)

Job Responsibility

Drive new shift left activities critical to apply Site Reliability Engineering (SRE) and quality assurance principles within the application design / Project roadmap that enablees resilient outcomes
Apply pre-emptive approach into production minimizing business impact, via SRE-driven orchestration of connecting all components of the ecosystem diagnosing anomalies prior to user & remediating through automation
Ensure ecosystem availability and performance in production environments, Pro-actively preventing P1, P2, potential P3s
Engage & influence product and engineering teams during the design and development phases to embed reliability and operability into new services defining & enforce events, logging, monitoring, and observability standards across applications
Accountable to institute non-functional requirements (NFRs) are embedded early including SLA/SLO/SLI and error budgets into the product’s offerings as part of the engineering solution
Leads the team diagnosing any anomalies prior to any user and driving the necessary remediations across the teams involved in end-to-end ecosystem availability, performance and consumption of the cloud architected application ecosystem leveraging SRE Orchestration solutions
Collaborates with Engineering & support teams, including participation in escalations, and blameless postmortems
Work closely with customer-facing support teams to empower them with SRE insights and tooling
Observe, diagnose & improve the end-2-end ecosystem performance of the Modern architected application portfolio i.e. technical “understanding of interactions" of a full stack application alongside with peer SRE team member
Continuously optimize the L2/support operations work via SRE workflow automation

What we offer

Opportunities to learn and develop every day through a wide range of programs
Internal digital platforms that promote self-learning
Development programs according to Leadership skills
Specialized training according to the role
Learning experiences with internal and external providers
Recognition programs for seniority, behavior, leadership, moments of life, among others
Financial wellness programs that will help you reach your goals in all stages of life
A flexibility program that will allow you to balance your personal and work life, adapting your working day to your lifestyle
Wellness Line, thousands of Agreements and Discounts, Scholarship programs for your children, Aid Plans for different moments of life

Principal Site Reliability Engineer

Palo Alto Networks runs a large hybrid infrastructure and is one of the largest ...

Location

United States , Santa Clara

Salary:

151600.00 - 245300.00 USD / Year

Palo Alto Networks

Expiration Date

Until further notice

Requirements

BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience
Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
Proficient in Python and/or Go
Expertise in managing applications in the Kubernetes cluster with autoscaling enabled
Experience in Production Engineering, DevOps, or Site Reliability
Expertise in the public cloud (GCP or AWS), especially in GCP
Strong Linux administration, internals, and network troubleshooting
Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
Experience with CI/CD pipelines, GitLab, and GitHub preferred
Ability to diagnose and troubleshoot complex distributed systems handling high-volume transactions

Job Responsibility

Contribute to the success of SRE and DevOps
Develop expertise in new technologies
Work with developers, researchers, data scientists, and security experts
Design, build, and operate reliable, secure Cloud infrastructure
Ensure that applications are production-ready, scalable, and reliable
Develop tools and automation frameworks
Automate robust deployment of robust services
Orchestrate end-to-end monitoring and alerting
Participate with SRE and Dev teams in the on-call rotation
Lead root cause analysis of critical business and production issues

Fulltime

Select Country

SRE Production Support

Job Description

Job Responsibility

Requirements

Looking for more opportunities?