CrawlJobs Logo

Site Reliability Engineer

United States, Austin 126000.00 - 140000.00 USD / Year · Job Posted June 10, 2026
Apply Position
Job Link Share

Job Description

At Schwab, you’re empowered to make an impact on your career. Here, innovative thought meets creative problem solving, helping us challenge the status quo and transform the finance industry together. We believe in the importance of in-office collaboration and fully intend for the selected candidate for this role to work on site in the specified location(s). As a Sr Specialist – Site Reliability Engineer (SRE) within Client Data Technology, you will play a critical role in ensuring the availability, performance, and resiliency of highly visible cloud-based platforms and applications. In this role, you will influence how systems are designed, built, and operated, driving measurable improvements in reliability and scalability while advancing modern SRE practices across the organization. You will partner closely with engineering and platform teams to define and implement sustainable operating models, enabling consistent, repeatable, and high-performing systems at scale. Your impact will include identifying and executing opportunities to enhance service health and telemetry, shaping and delivering forward-looking resiliency and availability roadmaps, and leading the adoption of cloud-native technologies aligned with established SRE standards. Through strong collaboration and technical leadership, you will promote a proactive, shift-left approach that embeds reliability, fault tolerance, and performance into the development lifecycle from the start. This role requires a balance of strategic thinking and hands-on problem-solving to optimize systems, reduce operational toil, and improve key metrics such as MTTD and MTTR, ultimately ensuring a seamless and reliable experience for clients.

Job Responsibility

  • Ensure availability, performance, and resiliency of highly visible cloud-based platforms and applications
  • Influence how systems are designed, built, and operated, driving measurable improvements in reliability and scalability
  • Partner closely with engineering and platform teams to define and implement sustainable operating models
  • Identify and execute opportunities to enhance service health and telemetry
  • Shape and deliver forward-looking resiliency and availability roadmaps
  • Lead adoption of cloud-native technologies aligned with established SRE standards
  • Promote a proactive shift-left approach embedding reliability, fault tolerance, and performance into the development lifecycle
  • Optimize systems, reduce operational toil, and improve key metrics such as MTTD and MTTR

Requirements

  • Bachelor's degree in Computer Engineering, Computer Science, or related field
  • 6+ years of software development and site reliability engineering experience supporting production applications in cloud environments such as Pivotal Cloud Foundry (PCF) or Google Cloud Platform (GCP)
  • 4+ years of DevOps engineering leadership experience focused on automation, tooling, and improving production operations
  • 2+ years of technical leadership experience guiding engineering teams and driving operational efficiencies
  • 2+ years of experience implementing and maturing operational best practices, including SLOs, SLIs, error budgets, monitoring, capacity planning, and incident management processes
  • Proficiency in programming and automation using tools such as Python, CloudFormation, or Terraform to build infrastructure-as-code solutions
  • Strong knowledge of database technologies (SQL, Aerospike, Postgres)
  • Experience working with messaging and streaming platforms such as RabbitMQ and Kafka

Nice to have

  • 4+ years of advanced technical leadership experience supporting highly skilled engineering teams
  • Demonstrated ability to influence development teams to design and build cloud-native systems that are scalable, maintainable, and resilient from initial deployment onward

What we offer

  • 401(k) with company match
  • Employee stock purchase plan
  • Paid time for vacation
  • Volunteering time
  • 28-day sabbatical after every 5 years of service for eligible positions
  • Paid parental leave
  • Family building benefits
  • Tuition reimbursement
  • Health insurance
  • Dental insurance
  • Vision insurance

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineer

8 matching positions

New

Site Reliability Engineer

An Elite FinTech Firm is looking for a highly talented DevOps Engineer/Systems S...
Location
Location
United Kingdom , London
Salary
Salary:
150000.00 GBP / Year
hunterbond.com Logo
Hunter Bond
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Genuine passion in Linux & Open-source
  • Excellent knowledge of Python
  • Use of CI/CD, Docker, Ansible, Chef, Puppet
  • Knowledge of large-scale storage systems (on-prem)
Job Responsibility
Job Responsibility
  • Help architect a resilient, multi-petabyte storage solutions & build new data centres
  • Automate anything and everything with Python & config tools
  • Innovate whilst bringing in new ideas
What we offer
What we offer
  • Flexible hours/work options
  • Working in one of the world’s most elite teams
  • Invest heavily in cutting-edge and next-gen tech
  • Technologists only report to other technologists
  • Brand new skyline Manhattan office
  • Start-up style environment
  • Fulltime
Read More
Arrow Right
New

Site Reliability Engineer

Microsoft is a company where passionate innovators come to collaborate, envision...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 3+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Own the end-to-end readiness of Event Stream across Azure regions, including onboarding new regions, driving deployment automation, and ensuring consistent, secure, and compliant service rollout
  • Work closely with platform, infrastructure, and partner teams (e.g., Event Hubs, Kusto, Fabric platform) to deliver resilient, low-latency streaming experiences on a global scale
  • Play a key role in advancing our reliability posture, improving availability, monitoring, and incident response across regions
  • Build strong observability, telemetry, and automated recovery mechanisms to meet high availability and SLA targets
  • Region Build-out & Deployment: Onboard new regions, drive deployment automation, and ensure consistent service configuration
  • Reliability & SRE: Improve availability, resiliency, and incident response
  • own service health across regions
  • Observability & Operations: Enhance telemetry, monitoring, alerting, and troubleshooting capabilities
  • Cross-team Collaboration: Partner with platform and infra teams to unblock dependencies and ensure smooth rollout
  • Production Excellence: Drive root-cause analysis, repair items, and continuous improvement on service reliability
  • Fulltime
Read More
Arrow Right
New

Site Reliability Engineer

Location
Location
United Kingdom , Newcastle
Salary
Salary:
Not provided
trimble.com Logo
Trimble Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Engineering or a related field
  • At least 5 years of technical experience with a proven ability to take full ownership of production infrastructure
  • Excellent collaboration skills with leading cross-functional work
  • Demonstrated success in managing infrastructure in production environments
  • Expertise in capacity planning and cost optimisation for efficient operations
  • Extensive expertise managing cloud provider hosted infrastructure, specifically with Microsoft Azure or AWS
  • Proficient in high-level scripting languages like Python and Infrastructure as Code tools (Terraform), along with containerisation
  • Demonstrated success with Kubernetes or other containerization technologies
  • Familiarity with CI/CD pipelines and tools such as Azure DevOps, Jenkins, Argo CD, Helm, GitHub
  • Experience with monitoring tools and incident management processes like Prometheus, Grafana, New Relic, DataDog, Splunk, Cloudwatch, Sumologic etc
Job Responsibility
Job Responsibility
  • Develop and maintain scalable infrastructure as code (IaC) using Terraform to ensure reliable and scalable cloud environments
  • Implement and enhance observability solutions using tools like New Relic, DataDog, Sumologic and Splunk for monitoring, logging, and alerting
  • Perform code deployments and manage CI/CD pipelines using Jenkins, Github, and related tooling to ensure smooth and efficient delivery processes
  • Automate routine tasks and workflows to increase operational efficiency and reduce manual intervention
  • Evaluate system designs and architectures for reliability, performance, security, and efficiency, ensuring best practices are followed
  • Lead incident response efforts and conduct deep-dive root cause analysis to implement long-term, innovative technical solutions
  • Develop and maintain comprehensive runbooks and procedures for incident response and operational tasks
  • Collaborate with cross-functional teams to review and provide feedback on technical designs, ensuring alignment with SRE principles
  • Participate in on-call rotations and handle critical incidents with confidence and expertise
  • Continuously improve documentation for systems and services, contributing to a knowledge-sharing culture within the team
Read More
Arrow Right
New

Site Reliability Engineer

Shape the Future of Intelligent Operations as a Site Reliability Engineer (AI Op...
Location
Location
India , Chennai
Salary
Salary:
Not provided
trimble.com Logo
Trimble Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 1 to 2 years of professional experience in a DevOps, MLOps, or systems engineering environment
  • Bachelor's degree in Computer Science, Engineering, Information Technology, or a closely related technical field
  • Direct experience with Microsoft Azure cloud platforms and its specialized ecosystem services (such as Azure ML and Azure DevOps)
  • Proficiency with Python or other scripting languages (Shell / Bash / PowerShell) for rapid system integration and task automation
  • Foundational understanding of containerization (Docker), basic orchestration concepts (Kubernetes fundamentals), and version control system workflows (Git)
  • Solid baseline knowledge of fundamental DevOps principles (CI/CD, system administration) and a basic understanding of the end-to-end machine learning model lifecycle
Job Responsibility
Job Responsibility
  • Assist in the deployment and maintenance of machine learning models in production under direct supervision, building skills in containerization and orchestration architectures
  • Support the development of robust continuous integration and deployment pipelines for ML workflows, including model versioning, automated testing, and release processes
  • Monitor production ML model performance, detect data drift, and track system health by implementing foundational logging, alerting, and metrics solutions
  • Contribute to infrastructure automation and configuration management for machine learning workloads, learning to treat infrastructure as software
  • Partner closely with ML engineers and data scientists to operationalize complex models, ensuring reliability, scale, and strict adherence to established operational patterns
What we offer
What we offer
  • Structured environment to accelerate technical skills
  • Direct guidance from experienced engineering professionals
  • Projects that improve productivity, quality, safety, transparency and sustainability
  • Collaborative and supportive team
  • Entrepreneurial spirit empowering proactive doers
  • Flexible work arrangements
  • Fulltime
Read More
Arrow Right
New

Site Reliability Engineer

We are currently seeking a Site Reliability Engineer to join our team in Westlak...
Location
Location
United States , Westlake
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in Site Reliability Engineering, DevOps Engineering, Platform Engineering, or related disciplines (understanding reliability engineering principles, SLIs, SLOs, error budgets, and operational excellence)
  • 5+ years’ hands-on Terraform experience
  • 5+ years’ experience supporting mission-critical enterprise applications in production environments
  • 5+ years’ experience with cloud networking, security, and infrastructure architecture
  • 5+ years of hands-on experience managing hybrid cloud environments
  • 5 + years of automation skills using Python, Ansible, Shell scripting, or similar technologies
  • 5+ years’ experience building reusable infrastructure modules and automated deployment frameworks
Job Responsibility
Job Responsibility
  • Design, implement, and support highly available load balancing solutions using F5 BIG-IP, Broadcom AVI, and cloud-native load balancing services
  • Build and maintain Infrastructure-as-Code (IaC) solutions using Terraform
  • Develop automation solutions for infrastructure provisioning, configuration management, and operational workflows
  • Support and enhance CI/CD pipelines using tools such as Jenkins, Azure DevOps, GitHub Actions, or similar platforms
  • Collaborate with application, cloud, network, and platform teams to improve reliability, performance, and scalability
  • Monitor production systems and proactively identify reliability, performance, and availability risks
  • Implement Site Reliability Engineering best practices including observability, incident management, capacity planning, and resiliency engineering
  • Troubleshoot complex issues across networking, cloud infrastructure, load balancing, and application environments
  • Support hybrid infrastructure environments spanning on-premises datacenters and public cloud platforms
  • Participate in on-call rotation and provide production support for critical business applications
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

RED Global is currently supporting one of our international clients in their sea...
Location
Location
Netherlands , Utrecht
Salary
Salary:
Not provided
redglobal.com Logo
RED Commerce - The Global SAP Solutions Provider
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience as a Site Reliability Engineer
  • Experience supporting and maintaining reliable, scalable production environments
  • Strong troubleshooting and incident management capabilities
  • Experience working within complex enterprise environments
  • Strong communication and stakeholder management skills
Read More
Arrow Right

Site Reliability Engineer

Qargo is a cloud-based (SaaS) Transport Management Platform. We are a scale-up b...
Location
Location
Belgium , Ghent
Salary
Salary:
Not provided
qargo.com Logo
Qargo
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience as a Software Engineer, with an interest in infrastructure, scalability, reliability
  • Strong programming skills (preferably Python or similar backend languages)
  • Experience working with cloud platforms, container orchestrators, serverless (preferably Google Cloud)
  • Familiarity with distributed systems and scalability challenges
  • Experience with CI/CD pipelines and automation
  • Solid understanding of databases and performance tuning (SQL and/or NoSQL)
  • Familiarity with monitoring and observability tools
  • A problem-solving mindset and the ability to think in systems
  • Strong collaboration skills and a proactive approach to improving systems
Job Responsibility
Job Responsibility
  • Build and maintain systems and tooling that improve the reliability, scalability, and performance of our platform
  • Improve software delivery cycle, focusing on automation and developer experience
  • Develop internal tools and services to reduce manual operational work
  • Improve observability by implementing monitoring, logging, and alerting across systems
  • Optimize system performance, including databases such as PostgreSQL and Firestore
  • Collaborate with backend engineers and other engineering teams to design reliable and scalable system architectures
  • Troubleshoot complex production issues and implement long-term fixes
  • Continuously improve infrastructure (Infrastructure as Code, automation, etc.)
What we offer
What we offer
  • A fast-growing SaaS company with a strong mission and an impact-driven team
  • A flexible work environment with flexible hours and hybrid working
  • A green office with a great atmosphere and lots of initiatives
  • A role with a lot of responsibility, ownership, and tangible impact
  • The opportunity to grow with us and shape both your career and our platform
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

We are looking for a Site Reliability Engineer (SRE) to support reliable, high-p...
Location
Location
United States , Novi
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Information Technology, Computer Science, Computer Engineering, or comparable practical experience
  • At least 5 years of experience supporting production environments in a corporate, startup, or similarly fast-paced technical setting
  • Hands-on expertise with infrastructure as code, including Terraform, along with experience in cloud platforms and related services
  • Working knowledge of container technologies such as Docker and orchestration platforms like Kubernetes
  • Experience supporting live systems, participating in on-call rotations, and contributing to incident reviews and corrective actions
  • Proficiency with automation and scripting using Bash and Python to reduce manual operational effort
  • Strong communication skills with the ability to explain technical decisions and tradeoffs to cross-functional or non-technical stakeholders
  • Willingness and ability to travel to customer or plant locations as business needs require
Job Responsibility
Job Responsibility
  • Maintain dependable and secure production environments across plant-edge and cloud-based systems, with a focus on uptime, responsiveness, and operational stability
  • Design, refine, and support monitoring dashboards, alerting frameworks, and operational runbooks using tools such as Prometheus, Grafana, and modern telemetry solutions
  • Build and manage infrastructure through code using Terraform, applying version control standards, peer reviews, and controlled deployment processes
  • Create automation scripts and lightweight tools in Bash and Python to streamline routine operations, recovery procedures, backup workflows, and environment setup
  • Take part in incident response and on-call coverage, troubleshoot service disruptions, coordinate initial communication, and document follow-up actions through blameless reviews
  • Establish and measure service reliability indicators and objectives, helping stakeholders balance system dependability with release speed and operational risk
  • Support secure connectivity between factory networks and cloud resources by configuring and maintaining VPNs, routing, private networking, and access controls
  • Administer and optimize relational or time-series databases, including backup planning, replication, performance tuning, and long-term storage health
  • Contribute to CI/CD delivery practices by improving deployment pipelines, supporting controlled release strategies, and preparing rollback procedures when needed
  • Partner with controls, software, and data teams to enable reliable data flow from industrial systems and ensure safe deployment to edge infrastructure
What we offer
What we offer
  • medical, vision, dental, and life and disability insurance
  • 401(k) plan
Read More
Arrow Right