Site Reliability Engineer Job at Charles Schwab (Austin)

New

Site Reliability Engineer

An Elite FinTech Firm is looking for a highly talented DevOps Engineer/Systems S...

Location

United Kingdom , London

Salary:

150000.00 GBP / Year

Hunter Bond

Expiration Date

Until further notice

Requirements

Genuine passion in Linux & Open-source
Excellent knowledge of Python
Use of CI/CD, Docker, Ansible, Chef, Puppet
Knowledge of large-scale storage systems (on-prem)

Job Responsibility

Help architect a resilient, multi-petabyte storage solutions & build new data centres
Automate anything and everything with Python & config tools
Innovate whilst bringing in new ideas

What we offer

Flexible hours/work options
Working in one of the world’s most elite teams
Invest heavily in cutting-edge and next-gen tech
Technologists only report to other technologists
Brand new skyline Manhattan office
Start-up style environment

Fulltime

New

Site Reliability Engineer

Microsoft is a company where passionate innovators come to collaborate, envision...

Location

India , Bangalore

Salary:

Not provided

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 3+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements
Must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter

Job Responsibility

Own the end-to-end readiness of Event Stream across Azure regions, including onboarding new regions, driving deployment automation, and ensuring consistent, secure, and compliant service rollout
Work closely with platform, infrastructure, and partner teams (e.g., Event Hubs, Kusto, Fabric platform) to deliver resilient, low-latency streaming experiences on a global scale
Play a key role in advancing our reliability posture, improving availability, monitoring, and incident response across regions
Build strong observability, telemetry, and automated recovery mechanisms to meet high availability and SLA targets
Region Build-out & Deployment: Onboard new regions, drive deployment automation, and ensure consistent service configuration
Reliability & SRE: Improve availability, resiliency, and incident response
own service health across regions
Observability & Operations: Enhance telemetry, monitoring, alerting, and troubleshooting capabilities
Cross-team Collaboration: Partner with platform and infra teams to unblock dependencies and ensure smooth rollout
Production Excellence: Drive root-cause analysis, repair items, and continuous improvement on service reliability

Fulltime

New

Site Reliability Engineer

Location

United Kingdom , Newcastle

Salary:

Not provided

Trimble Inc.

Expiration Date

Until further notice

Requirements

Bachelor's or Master's degree in Computer Engineering or a related field
At least 5 years of technical experience with a proven ability to take full ownership of production infrastructure
Excellent collaboration skills with leading cross-functional work
Demonstrated success in managing infrastructure in production environments
Expertise in capacity planning and cost optimisation for efficient operations
Extensive expertise managing cloud provider hosted infrastructure, specifically with Microsoft Azure or AWS
Proficient in high-level scripting languages like Python and Infrastructure as Code tools (Terraform), along with containerisation
Demonstrated success with Kubernetes or other containerization technologies
Familiarity with CI/CD pipelines and tools such as Azure DevOps, Jenkins, Argo CD, Helm, GitHub
Experience with monitoring tools and incident management processes like Prometheus, Grafana, New Relic, DataDog, Splunk, Cloudwatch, Sumologic etc

Job Responsibility

Develop and maintain scalable infrastructure as code (IaC) using Terraform to ensure reliable and scalable cloud environments
Implement and enhance observability solutions using tools like New Relic, DataDog, Sumologic and Splunk for monitoring, logging, and alerting
Perform code deployments and manage CI/CD pipelines using Jenkins, Github, and related tooling to ensure smooth and efficient delivery processes
Automate routine tasks and workflows to increase operational efficiency and reduce manual intervention
Evaluate system designs and architectures for reliability, performance, security, and efficiency, ensuring best practices are followed
Lead incident response efforts and conduct deep-dive root cause analysis to implement long-term, innovative technical solutions
Develop and maintain comprehensive runbooks and procedures for incident response and operational tasks
Collaborate with cross-functional teams to review and provide feedback on technical designs, ensuring alignment with SRE principles
Participate in on-call rotations and handle critical incidents with confidence and expertise
Continuously improve documentation for systems and services, contributing to a knowledge-sharing culture within the team

New

Site Reliability Engineer

Shape the Future of Intelligent Operations as a Site Reliability Engineer (AI Op...

Location

India , Chennai

Salary:

Not provided

Trimble Inc.

Expiration Date

Until further notice

Requirements

1 to 2 years of professional experience in a DevOps, MLOps, or systems engineering environment
Bachelor's degree in Computer Science, Engineering, Information Technology, or a closely related technical field
Direct experience with Microsoft Azure cloud platforms and its specialized ecosystem services (such as Azure ML and Azure DevOps)
Proficiency with Python or other scripting languages (Shell / Bash / PowerShell) for rapid system integration and task automation
Foundational understanding of containerization (Docker), basic orchestration concepts (Kubernetes fundamentals), and version control system workflows (Git)
Solid baseline knowledge of fundamental DevOps principles (CI/CD, system administration) and a basic understanding of the end-to-end machine learning model lifecycle

Job Responsibility

Assist in the deployment and maintenance of machine learning models in production under direct supervision, building skills in containerization and orchestration architectures
Support the development of robust continuous integration and deployment pipelines for ML workflows, including model versioning, automated testing, and release processes
Monitor production ML model performance, detect data drift, and track system health by implementing foundational logging, alerting, and metrics solutions
Contribute to infrastructure automation and configuration management for machine learning workloads, learning to treat infrastructure as software
Partner closely with ML engineers and data scientists to operationalize complex models, ensuring reliability, scale, and strict adherence to established operational patterns

What we offer

Structured environment to accelerate technical skills
Direct guidance from experienced engineering professionals
Projects that improve productivity, quality, safety, transparency and sustainability
Collaborative and supportive team
Entrepreneurial spirit empowering proactive doers
Flexible work arrangements

Fulltime

New

Site Reliability Engineer

We are currently seeking a Site Reliability Engineer to join our team in Westlak...

Location

United States , Westlake

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

5+ years of experience in Site Reliability Engineering, DevOps Engineering, Platform Engineering, or related disciplines (understanding reliability engineering principles, SLIs, SLOs, error budgets, and operational excellence)
5+ years’ hands-on Terraform experience
5+ years’ experience supporting mission-critical enterprise applications in production environments
5+ years’ experience with cloud networking, security, and infrastructure architecture
5+ years of hands-on experience managing hybrid cloud environments
5 + years of automation skills using Python, Ansible, Shell scripting, or similar technologies
5+ years’ experience building reusable infrastructure modules and automated deployment frameworks

Job Responsibility

Design, implement, and support highly available load balancing solutions using F5 BIG-IP, Broadcom AVI, and cloud-native load balancing services
Build and maintain Infrastructure-as-Code (IaC) solutions using Terraform
Develop automation solutions for infrastructure provisioning, configuration management, and operational workflows
Support and enhance CI/CD pipelines using tools such as Jenkins, Azure DevOps, GitHub Actions, or similar platforms
Collaborate with application, cloud, network, and platform teams to improve reliability, performance, and scalability
Monitor production systems and proactively identify reliability, performance, and availability risks
Implement Site Reliability Engineering best practices including observability, incident management, capacity planning, and resiliency engineering
Troubleshoot complex issues across networking, cloud infrastructure, load balancing, and application environments
Support hybrid infrastructure environments spanning on-premises datacenters and public cloud platforms
Participate in on-call rotation and provide production support for critical business applications

Fulltime

Site Reliability Engineer

RED Global is currently supporting one of our international clients in their sea...

Location

Netherlands , Utrecht

Salary:

Not provided

RED Commerce - The Global SAP Solutions Provider

Expiration Date

Until further notice

Requirements

Strong experience as a Site Reliability Engineer
Experience supporting and maintaining reliable, scalable production environments
Strong troubleshooting and incident management capabilities
Experience working within complex enterprise environments
Strong communication and stakeholder management skills

Site Reliability Engineer

Qargo is a cloud-based (SaaS) Transport Management Platform. We are a scale-up b...

Location

Belgium , Ghent

Salary:

Not provided

Qargo

Expiration Date

Until further notice

Requirements

Experience as a Software Engineer, with an interest in infrastructure, scalability, reliability
Strong programming skills (preferably Python or similar backend languages)
Experience working with cloud platforms, container orchestrators, serverless (preferably Google Cloud)
Familiarity with distributed systems and scalability challenges
Experience with CI/CD pipelines and automation
Solid understanding of databases and performance tuning (SQL and/or NoSQL)
Familiarity with monitoring and observability tools
A problem-solving mindset and the ability to think in systems
Strong collaboration skills and a proactive approach to improving systems

Job Responsibility

Build and maintain systems and tooling that improve the reliability, scalability, and performance of our platform
Improve software delivery cycle, focusing on automation and developer experience
Develop internal tools and services to reduce manual operational work
Improve observability by implementing monitoring, logging, and alerting across systems
Optimize system performance, including databases such as PostgreSQL and Firestore
Collaborate with backend engineers and other engineering teams to design reliable and scalable system architectures
Troubleshoot complex production issues and implement long-term fixes
Continuously improve infrastructure (Infrastructure as Code, automation, etc.)

What we offer

A fast-growing SaaS company with a strong mission and an impact-driven team
A flexible work environment with flexible hours and hybrid working
A green office with a great atmosphere and lots of initiatives
A role with a lot of responsibility, ownership, and tangible impact
The opportunity to grow with us and shape both your career and our platform

Fulltime

Site Reliability Engineer

We are looking for a Site Reliability Engineer (SRE) to support reliable, high-p...

Location

United States , Novi

Salary:

Not provided

Robert Half

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Information Technology, Computer Science, Computer Engineering, or comparable practical experience
At least 5 years of experience supporting production environments in a corporate, startup, or similarly fast-paced technical setting
Hands-on expertise with infrastructure as code, including Terraform, along with experience in cloud platforms and related services
Working knowledge of container technologies such as Docker and orchestration platforms like Kubernetes
Experience supporting live systems, participating in on-call rotations, and contributing to incident reviews and corrective actions
Proficiency with automation and scripting using Bash and Python to reduce manual operational effort
Strong communication skills with the ability to explain technical decisions and tradeoffs to cross-functional or non-technical stakeholders
Willingness and ability to travel to customer or plant locations as business needs require

Job Responsibility

Maintain dependable and secure production environments across plant-edge and cloud-based systems, with a focus on uptime, responsiveness, and operational stability
Design, refine, and support monitoring dashboards, alerting frameworks, and operational runbooks using tools such as Prometheus, Grafana, and modern telemetry solutions
Build and manage infrastructure through code using Terraform, applying version control standards, peer reviews, and controlled deployment processes
Create automation scripts and lightweight tools in Bash and Python to streamline routine operations, recovery procedures, backup workflows, and environment setup
Take part in incident response and on-call coverage, troubleshoot service disruptions, coordinate initial communication, and document follow-up actions through blameless reviews
Establish and measure service reliability indicators and objectives, helping stakeholders balance system dependability with release speed and operational risk
Support secure connectivity between factory networks and cloud resources by configuring and maintaining VPNs, routing, private networking, and access controls
Administer and optimize relational or time-series databases, including backup planning, replication, performance tuning, and long-term storage health
Contribute to CI/CD delivery practices by improving deployment pipelines, supporting controlled release strategies, and preparing rollback procedures when needed
Partner with controls, software, and data teams to enable reliable data flow from industrial systems and ensure safe deployment to edge infrastructure

What we offer

medical, vision, dental, and life and disability insurance
401(k) plan

Select Country

Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Our AI answers in your language