CrawlJobs Logo

Site Reliability Engineer - Production Support

United States, Southlake Employment contract 48.56 - 55.77 USD / Hour · Job Posted June 16, 2026
Apply Position
Job Link Share

Job Description

At Schwab, you’re empowered to make an impact on your career. Here, innovative thought meets creative problem solving, helping us “challenge the status quo” and transform the finance industry together. Schwab Technology Services enables the future of how clients manage their money by providing innovative and reliable technology products and services as part of our ongoing commitment to democratize access to investing and financial planning.

Job Responsibility

  • You will play a hands on role across multiple phases of the systems development lifecycle, including planning, configuration, monitoring, administration, and ongoing maintenance within a business critical environment
  • Continually evaluate and improve service reliability by implementing effective monitoring, identifying gaps, and driving both tactical fixes and strategic improvements
  • Partner closely with vendors to troubleshoot system specific issues, manage response expectations, and drive timely delivery
  • Facilitate coordination across vendors, Schwab Infrastructure teams, and application engineering teams to efficiently resolve incidents and operational challenges
  • SRE Production Operations Support for Schab.com and Mobile Applications
  • Work with technical and agile teams to prioritize backlog work and incidents
  • Automate deployment, infrastructure and monitoring tools and systems
  • Work through root cause analysis and develop remediation plans to improve reliability

Requirements

  • 2+ years of systems support, scripting, or developing in an enterprise environment
  • Troubleshoots and resolves technical issues using best practices to support live applications
  • Prioritizes compliance, security, and scalability of data systems
  • Executes strategies to restore services promptly and address disruptions
  • Validates requirements and constraints to support effective development and operations
  • Monitors and addresses security policy violations using best practices
  • Solid foundation of Windows and or Linux
  • Strong verbal and written communication skills
  • Passion for learning, reliability, automation, orchestration and continual improvement

Nice to have

  • Evaluates moderately complex problems by analyzing data and spotting trend
  • Keeps others informed about workstream developments and update
  • Applies business intelligence and industry insights to support decision making
  • Enables others to make decisions within their scope
  • Recognizes industry trends to foresee opportunities and challenges
  • Contributes ideas for new approaches or technologies within a defined scope
  • Influences peers and collaborates across teams to improve workflow efficience
  • Verbally conveys complex concepts and engages peers effectively
  • Acts with integrity, responsibly handles information, and communicates with accuracy
  • Crafts impactful messages tailored to audience needs, persuading desired actions
  • Sets challenging goals, monitors progress, and applies problem-solving techniques
  • Evaluates market dynamics and proposes innovative, efficient solutions

What we offer

  • 401(k) with company match and Employee stock purchase plan
  • Paid time for vacation, volunteering, and 28-day sabbatical after every 5 years of service for eligible positions
  • Paid parental leave and family building benefits
  • Tuition reimbursement
  • Health, dental, and vision insurance

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineer - Production Support

8 matching positions

New

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
  • Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
  • Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
  • Proficiency in Python, Go, or Java, with strong code review and readability standards
  • Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
  • Ability to think and act under pressure
  • Strong communication skills
Job Responsibility
Job Responsibility
  • Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
  • Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
  • Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
  • Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
  • Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
  • Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
  • Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
  • Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 345000.00 USD / Year
lambda.ai Logo
Lambda
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
  • Strong understanding of Linux-based systems in a distributed environment
  • Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
  • Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
  • Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Passion for continuous improvement and innovation
Job Responsibility
Job Responsibility
  • Define Fleet Health metrics and indicators to objectively measure and improve system availability
  • Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
  • Create runbooks and automated remediations for common failure scenarios
  • Build in automation and auditing to ensure compliance and improve efficiency and productivity
  • Participate in on-call rotations and provide support for incident response and resolution
  • Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc
What we offer
What we offer
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan
  • Fulltime
Read More
Arrow Right
New

Site Reliability Engineer

Location
Location
South Africa , Johannesburg
Salary
Salary:
Not provided
nintex.com Logo
Nintex
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You provide guidance on infrastructure architecture and contribute to high-quality and successful product releases.
  • You contribute to your team and domain through successfully leading and consistently delivering on projects of ambiguous scope, high complexity, and critical business impact.
  • You contribute to relevant guilds, practice forums and other initiatives to improve Nintex’s DevOps and SRE discipline.
  • You have an in-depth understanding of distributed systems architecture, as well as monitoring and observability practices and tools.
  • You quickly resolve priority infrastructure issues and help other technical team members or Product Managers understand how to avoid them in the future.
  • You provide detailed estimates for work items you propose or assigned.
  • You assist in decision-making around tooling, automation practices, and testing solutions.
  • You stay up-to-date with technology trends and use this knowledge help your team and the broader Engineering practice.
  • You run Nintex infrastructure with IaC tools (as Terraform) and GitHub Actions for automation, containerize our environments (Kubernetes) and leverage cloud technologies to meet our goals
  • You build monitoring that alerts on symptoms rather than outages using tools like Prometheus, Grafana, Alertmanager and PagerDuty
Job Responsibility
Job Responsibility
  • You are highly skilled and sufficiently experienced in Nintex DevOps tools and processes to own a long-term program or technology such as Kubernetes, etc.
  • You write scripts, tools and utilities that support and integrate with delivery pipelines and you integrate telemetry where appropriate.
  • You are called into incidents and bring trusted knowledge in your platform domain.
  • You debug and fix infrastructure issues on production environments quickly using the relevant tools and guidelines to prevent recurrence.
  • You build, promote and support infrastructure patterns and practices within Nintex.
  • You provide coaching/mentoring to other Engineers on the team
  • You lead or contribute to post-mortems for incidents, including root cause analysis and identification of preventative and remedial actions.
  • You continuously monitor our platform performance and take immediate action to improve it
  • You review and advise on appropriate design patterns to solve automation and infrastructure problems without creating technical debt.
  • You design and build complex infrastructure components for distributed systems as Kubernetes.
What we offer
What we offer
  • Global Gratitude and Recharge Days
  • Flexible, paid time off policy
  • Employee wellness programs and counseling resources
  • Meaningful peer recognition and awards
  • Paid parental leave
  • Invention/patenting assistance
  • Community impact, paid volunteer time, and opportunities
  • Intercultural learning and celebration
  • Multiple tools through which to learn and grow, and an incredible global community
Read More
Arrow Right
New

Principal Site Reliability Engineer (Sovereign Cloud)

Location
Location
Bulgaria , Sofia
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years as DevOps engineer with a passion for technology, strong motivation and responsibility
  • Proficiency in DevOps and Platform Engineering with expertise in AWS, GCP, Terraform, ArgoCD, Kubernetes, and related tools
  • Experience in developing and maintaining CI/CD pipelines for continuous delivery in agile environments
  • Skilled in managing cloud infrastructure, particularly with AWS and GCP, and adept in infrastructure as code practices using Terraform/Terragrunt
  • Demonstrated capability in supporting high-scale SaaS applications, focusing on scalability, reliability, and performance
  • Strong communication, strategic thinking, and problem-solving skills
  • Self-disciplined, self-managed, self-motivated, strong sense of ownership, urgency, and drive
  • Ready to understand and dissect new technology stacks quickly
Job Responsibility
Job Responsibility
  • Implement and optimize CI/CD pipelines and cloud infrastructure using our technology stack, ensuring efficient and reliable deployment to production
  • Participate in the deployment of monitoring and alerting systems to maintain high system performance and reliability
  • Collaborate with software development and other cross-functional teams to streamline and enhance processes, aiming for efficiency and alignment with business goals
  • Contribute to the management of the cloud infrastructure, utilizing Infrastructure as Code principles
  • Participate in on-call rotations to support critical business and production systems
  • Fulltime
Read More
Arrow Right
New

Sr Principal Site Reliability Engineer (Sovereign Cloud)

The Prisma Access team is seeking a seasoned Principal Site Reliability Engineer...
Location
Location
Bulgaria , Sofia
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of experience in Infrastructure, SRE, or DevOps roles
  • BS or MS in Computer Science, a related field, or equivalent professional experience
  • 7+ years of experience with GCP, and expertise in their architecture, services and PKI concepts for cloud security
  • Expert troubleshooting skills to resolve cloud infrastructure and service issues, effectively identifying root cause and devising effective solutions
  • Proficiency in automation using Python and shell scripting
  • Expertise in Infrastructure as Code (IaC) with Terraform and Helm, leveraging AI tools for development
  • Solid experience with Kubernetes, container networking, and container workloads
  • Strong Linux administration skills
  • Proficiency with CI/CD pipelines, GitOps principles, and tooling like GitLab and Jenkins
  • Excellent written and verbal communication skills, with the ability to collaborate effectively to drive outcomes
Job Responsibility
Job Responsibility
  • Design, build, and operate reliable, secure Cloud infrastructure across multi-cloud environments for our sovereign customers
  • Lead cross-functional initiatives to ensure applications are production-ready, scalable, secure, and resilient
  • Develop expertise in new technologies, embracing continuous learning and the adoption of AI tools
  • Develop tools and automation frameworks, championing Infrastructure as Code (IaC) and Monitoring as Code (MaC) principles
  • Automate robust deployments and orchestrate end-to-end monitoring and alerting solutions
  • Participate in on-call rotations to support critical business and production systems
  • Lead root cause analysis of critical issues, driving improvements and preventing recurrence
  • Champion the success of SRE and DevOps initiatives, aligning technical decisions with business goals
  • Fulltime
Read More
Arrow Right
New

Sr Principal Site Reliability Engineer (Sovereign Cloud)

Palo Alto Networks runs a large infrastructure and is one of the largest GCP cus...
Location
Location
Bulgaria , Sofia
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years as an engineer in Infrastructure, Operations, DevOps, or System Engineering
  • 7+ years building high availability, scalable cloud-native applications on AWS and GCP
  • BS or MS in Computer Science, a related field, or equivalent professional experience required
  • Expertise in configuration management with a framework such as Ansible, Terraform, Helm
  • Passion for infrastructure and monitoring as code
  • Solid experience in container workloads and Kubernetes
  • Familiarity with PKI concepts, Networking concepts
  • In-depth knowledge of different security controls ( app-id, user-id, security profile, url category, content, ssl decryption, firewall MFA etc)
  • Linux administration, internals, and network troubleshooting
  • Proficiency with programming languages like Golang or Python along with shell scripting to automate tasks
Job Responsibility
Job Responsibility
  • Contribute to the success of SRE and DevOps
  • Develop expertise in new technologies
  • Work with developers, researchers, data scientists, and security experts
  • Design, build and operate reliable, secure Cloud infrastructure
  • Ensure that applications are production-ready, scalable, and reliable
  • Develop tools and automation frameworks
  • Automate robust deployment of robust services
  • Orchestrate end-to-end monitoring and alerting
  • Participate in on-call rotations to support critical business and production systems
  • Lead root cause analysis of critical business and production issues
  • Fulltime
Read More
Arrow Right
New

Principal Site Reliability Engineer (Sovereign Cloud)

As a Principal Site Reliability Engineer, you will serve as the technical author...
Location
Location
Bulgaria , Sofia
Salary
Salary:
Not provided
paloaltonetworks.com Logo
Palo Alto Networks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Infrastructure, SRE, or DevOps roles
  • BS or MS in Computer Science, a related field, or equivalent professional experience
  • Kubernetes Mastery: Expert-level experience (6+ years) managing production K8s workloads (preferably within GKE, but will also consider EKS)
  • Deep understanding of Networking, Storage, and RBAC
  • CI/CD & GitOps: Hands-on expertise with ArgoCD and modern pipeline runners (GitHub Actions, GitLab CI, or Jenkins)
  • Programming: Proficient in Python for systems programming and automation
  • Security Mindset: Proven experience integrating security scanning and compliance checks within a containerized environment
  • Modern Workflow: Experience (or strong desire) using AI-pair programming tools like Cursor and Claude to multiply personal and team productivity
  • Excellent written and verbal communication, able to collaborate and rally support
  • Self-disciplined, self-managed, self-motivated, strong sense of ownership, urgency, and drive
Job Responsibility
Job Responsibility
  • Infrastructure Leadership: Architect and oversee large-scale Kubernetes clusters in GKE, ensuring high availability, performance tuning, and cost optimization
  • GitOps & Orchestration: Design and refine complex CI/CD lifecycles using ArgoCD, moving toward a fully declarative infrastructure-as-code model
  • Security Engineering: Implement and manage security scanning tools (e.g., Prisma Cloud, Snyk, or GKE native security) to ensure container integrity and shift-left security compliance
  • Automation & Tooling: Develop sophisticated automation scripts and internal tools using Python to eliminate manual toil and improve system observability
  • AI-Driven Development: Lean into the future of engineering by utilizing Cursor and Claude to accelerate coding, debugging, and documentation tasks
  • Incident Management: Act as a final escalation point for complex infrastructure outages, conducting blameless post-mortems to drive systemic improvements
  • Participate in on-call rotations to support critical business and production systems
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer, Wikimedia Enterprise

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to jo...
Location
Location
United States
Salary
Salary:
116633.00 - 181243.00 USD / Year
wikimediafoundation.org Logo
Wikimedia Foundation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Automation & Configuration Management: Experience with Infrastructure as Code and automation tools (e.g., Terraform, Ansible) and proficiency in at least one programming language (e.g., Python, Go, or similar)
  • Cloud Infrastructure: Experience designing, operating, and optimizing cloud-based systems across platforms such as AWS, Azure, or GCP, including scalability, reliability, and cost efficiency
  • CI/CD & Deployment Practices: Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab or similar, ArgoCD), with familiarity in progressive delivery approaches such as canary and blue-green deployments
  • Incident Management & Reliability Operations: Experience with incident response, on-call practices, and leading postmortems, with a focus on continuous improvement and operational excellence
  • SRE Principles & Observability: Strong understanding of SRE best practices, including SLOs, SLIs, and error budgets, along with experience in observability (metrics, logging, and distributed tracing e.g., Prometheus, OpenTelemetry)
  • Collaboration & Communication: Ability to work effectively in a distributed, cross-functional environment, with strong documentation and communication skills
  • Proven experience operating highly available, large-scale distributed systems, with a deep understanding of reliability, scalability, and failure modes
  • Ownership mindset: Takes end-to-end responsibility for system reliability, proactively identifying and addressing risks before they impact users
  • Bias for automation: Continuously seeks to reduce operational toil through automation and scalable solutions
  • Continuous improvement mindset: Actively learns from incidents and drives improvements through blameless postmortems and iterative enhancements
Job Responsibility
Job Responsibility
  • Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets to ensure reliability targets are met
  • Build and enhance observability systems (metrics, logs, and distributed tracing) to enable proactive detection and faster troubleshooting
  • Drive reliability engineering practices, including capacity planning, load testing, and resilience validation (e.g., chaos testing)
  • Improve developer experience (DevEx) by enabling self-service infrastructure and streamlining deployment workflows
  • Partner with engineering team members to embed reliability best practices early in the development lifecycle
  • Design, implement, and optimize CI/CD and GitOps workflows using tools such as GitLab (or similar) and ArgoCD(or similar), enabling automated, reliable deployments with support for progressive delivery strategies like canary and blue-green releases
  • Implement secure-by-default infrastructure and enforce best practices (e.g., IAM, secrets management, encryption)
  • Continuously optimize infrastructure cost and efficiency using FinOps principles while maintaining performance and availability
  • Establish and track operational metrics such as MTTR, MTTD, and incident frequency to drive continuous improvement
  • Reduce operational toil by identifying repetitive work and implementing automation-first solutions
  • Fulltime
Read More
Arrow Right