Staff Engineer, Site Reliability Engineer Job at General Motors (Dublin)

Staff Site Reliability Engineer - Incident Management & Reliability

We’re not just building better tech. We’re rewriting how data moves and what the...

Location

Canada

Salary:

225100.00 - 264500.00 CAD / Year

Confluent

Expiration Date

Until further notice

Requirements

10+ years of relevant experience in SRE, incident management, or reliability engineering
Cloud experience with at least one of AWS, GCP, or Azure
Experience navigating reliability/incident programs at 500+ engineer organizations
Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)
Strong understanding of distributed systems and failure modes at scale
Deep experience with observability: metrics, logging, tracing
Kubernetes and container orchestration experience
Understanding of CI/CD pipelines and release processes
Strong written communication (design docs, runbooks, post-mortems)
Experience driving org-wide process and cultural changes

Job Responsibility

Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence
Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
Define and maintain SLO/SLA frameworks
use error budgets to guide reliability investments
Own standards, practices, and continuous improvement of incident response across engineering
Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity
Develop and deliver training programs
coach teams through post-mortems
Partner with engineering leaders to elevate reliability practices org-wide

What we offer

Remote-First Work
Robust Insurance Benefits
Flexible Time Away
The Best Teammates
Experience Ambassadors
Open and Honest Culture
Well-Being and Growth
Offers Equity

Fulltime

Senior Staff Site Reliability Engineer

Fivetran is looking for a high-performance, experienced engineer to be a part of...

Location

India , Bengaluru

Salary:

Not provided

Fivetran

Expiration Date

Until further notice

Requirements

12+ years of experience working with SaaS products at scale
Working knowledge of managed Kubernetes (EKS, AKS and GKE)
Knowledge of Cloud Platforms and related tooling: AWS, Azure, Google Cloud (GCP), Terraform, Ansible, Buildkite, Pulumi and ArgoCD
Experience in Python/Shell scripting and Go Language. Bonus if you have Java
Experience with Linux operating systems internals and administration
Experience with cloud networking like Site-to-Site VPNs, Privatelinks and Private Service connect (GCP)

Job Responsibility

Responsible for ongoing reliability and robustness of Fivetran’s production infrastructure by monitoring availability, capacity, and throughput
Evolve systems by adding reliability into our product roadmap
Coordinate the re-prioritize or fix critical bugs for support or sales requirements as needed
Make recommendations to production infrastructure by interfacing with engineering to ensure 100% availability
Ensure scalable artifacts deployment to all environments by automation scripts
Constantly monitor infrastructure vulnerabilities and remedy them by working with the security team

What we offer

100% employer-paid medical insurance
Generous paid time-off policy (PTO), plus paid sick time, inclusive parental leave policy, holidays, and volunteer days off
RSU stock grants
Professional development and training opportunities
Company virtual happy hours, free food, and fun team-building activities
Monthly cell phone stipend
Access to an innovative mental health support platform that offers personalized care and resources in areas such as: therapy, coaching, and self-guided mindfulness exercises for all covered employees and their covered dependents

Fulltime

Site Reliability Engineer Staff

Site Reliability Engineer Staff. This role has been designed as 'Hybrid' with an...

Location

United States , San Juan

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Minimum of 4 years of hands-on experience in Infra Ops, Dev Ops, or Site Reliability Engineering (SRE)
Proficiency with Linux systems, especially Debian-based distributions
Strong experience with cloud platforms such as AWS and GCP
Expertise in Infrastructure as Code tools like Terraform, Packer, and Ansible
Solid programming skills in Python and/or Golang
Deep understanding of containerization (Docker, Container) and orchestration tools (AWS EKS, GCP GKE)
Experience with GitOps workflows
Proven track record in implementing and maintaining CI/CD pipelines
Strong background in security and familiarity with security programs
Experience with monitoring and logging tools (Prometheus, Grafana, ELK)

Job Responsibility

Enhance Infrastructure as Code (IAC) and enforce best practices
Optimize cloud infrastructure for scalability, security, and cost-effectiveness
Develop internal tools to support and streamline cloud platform operations
Improve CI/CD pipelines and deployment workflows using FluxCD and Jenkins
Address container image vulnerabilities and standardize remediation processes
Build Amazon Machine Images (AMIs) aligned with CIS and STIG benchmarks
Strengthen monitoring, alerting, and observability using Prometheus, Grafana, and logging tools
Troubleshoot complex production issues to ensure system reliability and customer satisfaction
Fine-tune distributed systems such as Apache Kafka and Cassandra
Collaborate with development, security, and operations teams to align infrastructure with application needs

What we offer

Health & Wellbeing
Personal & Professional Development
Unconditional Inclusion

Fulltime

Staff Site Reliability Engineer

Fivetran is building data pipelines to power the modern data stack for thousands...

Location

United States , Oakland

Salary:

196033.00 - 245041.50 USD / Year

Fivetran

Expiration Date

Until further notice

Requirements

Expertise in managed Kubernetes (EKS, AKS, and GKE)
Expertise of Cloud Platforms and related tooling: AWS, Azure, GCP, Terraform, Ansible, Buildkite, Pulumi, and ArgoCD
Expertise in Python/Shell scripting
Expertise with Linux operating systems, internals, and administration
Expertise with cloud networking like VPNs, Privatelinks, and Private Service connect (GCP)
Experience with databases such as PostgreSQL

Job Responsibility

Responsible for ongoing reliability and robustness of Fivetran's production infrastructure by monitoring availability, capacity, and throughput
Evolve systems by adding reliability into our product roadmap
Coordinate the re-prioritize or fix critical bugs for support or sales requirements as needed
Make recommendations to production infrastructure by interfacing with engineering to ensure 100% availability
Ensure scalable artifacts deployment to all environments by automation scripts
Constantly monitor infrastructure vulnerabilities and remedy them by working with the security team

What we offer

100% employer-paid medical insurance
Generous paid time-off policy (PTO), plus paid sick time, inclusive parental leave policy, holidays, and volunteer days off
RSU stock grants
Professional development and training opportunities
Company virtual happy hours, free food, and fun team-building activities
Monthly cell phone stipend
Access to an innovative mental health support platform that offers personalized care and resources in areas such as: therapy, coaching, and self-guided mindfulness exercises for all covered employees and their covered dependents

Fulltime

Staff Site Reliability Engineer

Fivetran is looking for a high-performance engineer to join a team of Site Relia...

Location

Serbia , Novi Sad

Salary:

Not provided

Fivetran

Expiration Date

Until further notice

Requirements

7+ years of experience working with SaaS platforms at scale
Expertise in managed Kubernetes (EKS, AKS, and GKE)
Knowledge of Cloud Platforms and related tooling: AWS, Azure, GCP, Terraform, Ansible, Buildkite, Pulumi, and ArgoCD
Experience in Python, Shell scripting, and Go
Experience with Linux operating systems, internals, and administration
Experience with cloud networking like Managed NAT Gateways, VPNs, Privatelinks, and Private Service Connect (GCP)
Experience with databases such as PostgreSQL

Job Responsibility

Responsible for the ongoing reliability and robustness of Fivetran’s production infrastructure by monitoring availability, capacity, and throughput
Collaborate with engineering teams to integrate reliability best practices into the product roadmap
Support the prioritization and resolution of critical bugs identified by support or sales
Contribute to maintaining the high reliability and availability of production infrastructure by collaborating with engineering to implement automation for scalable deployments
Ensure scalable artifacts deployment to all environments through automation scripts
Proactively monitor infrastructure vulnerabilities and collaborate with the security team to promptly address them

What we offer

100% employer-paid medical insurance
Generous paid time-off policy (PTO), plus paid sick time, inclusive parental leave policy, holidays, and volunteer days off
RSU stock grants
Professional development and training opportunities
Company virtual happy hours, free food, and fun team-building activities
Monthly cell phone stipend
Access to an innovative mental health support platform that offers personalized care and resources in areas such as: therapy, coaching, and self-guided mindfulness exercises for all covered employees and their covered dependents

Senior Staff Site Reliability Engineer

As a Site Reliability Engineer on the SASE Platform team, you will play a critic...

Location

Israel , Tel Aviv

Salary:

Not provided

Palo Alto Networks Italia

Expiration Date

Until further notice

Requirements

5+ years of experience working with Unix/Linux systems, including shell, tools, networking, and kernel concepts
2+ years of hands-on experience with microservices architectures running on Kubernetes and container platforms
Proven experience operating workloads in public cloud environments (e.g., AWS, GCP, Azure) at scale
Proficiency in building automation and tools in at least one scripting or programming language (e.g., Python, Go, Java)
Strong experience with Infrastructure as Code (IaC) tools such as Terraform or Ansible
Bachelor’s degree in Engineering, Computer Science, or a related technical field, or equivalent practical experience

Job Responsibility

Proactively collaborate with development teams to embed reliability, scalability, and operability into services from the earliest design stages
Design, review, and evolve cloud-native architectures to improve availability, performance, cost efficiency, and fault tolerance
Build and operate automation for provisioning, deploying, and managing global infrastructure using Infrastructure as Code (IaC)
Improve CI/CD pipelines and release processes to enable safe, fast, and repeatable deployments
Drive observability best practices, including metrics, logs, traces, and SLIs/SLOs to enable data-driven incident analysis
Participate in on-call rotations, reducing mean time to resolution (MTTR) through automation and proactive reliability improvements
Challenge existing processes by championing reliability, security, and operational maturity across the organization

Fulltime

Member of Technical Staff, Site Reliability Engineer (HPC)

As Microsoft continues to push the boundaries of AI, we are on the lookout for p...

Location

United States , Mountain View

Salary:

139900.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
OR equivalent experience
Strong proficiency in Kubernetes, Docker, and container orchestration
Knowledge of CI/CD pipelines for Inference and ML model deployment
Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
Expertise in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
Strong programming/scripting skills in Python, Go, or Bash
Solid knowledge of distributed systems, networking, and storage
Experience running large-scale GPU clusters for ML/AI workloads (preferred)

Job Responsibility

Reliability & Availability: Ensure uptime, resiliency, and fault tolerance of HPC clusters powering MAI model training and inference
Observability: Design and maintain monitoring, alerting, and logging systems to provide real-time visibility into all aspects of HPC systems including GPU, clusters, storage and networking
Automation & Tooling: Build automation for deployments, incident response, scaling, and failover in CPU+GPU environments
Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments
Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows

What we offer

Competitive compensation, equity options, and comprehensive benefits

Fulltime

Staff Site Reliability Engineer, Storage

At Crusoe Energy Systems, our SRE team plays a mission-critical role in maintain...

Location

United States , San Francisco, Sunnyvale

Salary:

204000.00 - 247000.00 USD / Year

Crusoe

Expiration Date

Until further notice

Requirements

8+ years of professional experience in Storage SRE, systems engineering, storage engineering, or similar roles
Hands-on experience with distributed storage systems (e.g., Ceph, GlusterFS, OpenEBS) and deep understanding of object, block, and file storage paradigms
Proficiency in a programming language such as, Go, Python, Java, or C
Experience with Infrastructure as Code and deployment tooling such as Terraform, Ansible, or Puppet
Deep knowledge of Linux internals with a focus on I/O subsystems, memory management, and storage scheduling
Familiarity with storage protocols like NFS, SMB, iSCSI, or NVMe-oF
Strong experience working with containerized workloads and orchestration platforms (e.g., Kubernetes, Docker)
Excellent incident response, troubleshooting, and documentation practices
Experience with building and operating managed services at scale such as object, file and block storage (AWS, GCP, Azure)
Excellent communication skills

Job Responsibility

Build automation and self-healing tools to monitor and maintain Crusoe’s distributed cloud storage infrastructure, which includes block, file, and object storage systems
Drive reliability initiatives focused on data replication, encryption, backup and restore strategies, and robust failover mechanisms
Help implement and maintain high-performance NVMe- and SSD-backed volumes that support large-scale AI compute clusters
Support user-facing storage services with a focus on availability, performance tuning, and adherence to error budgets
Investigate and resolve storage-related incidents using deep telemetry, logs, and performance profiling
Partner with hardware and kernel teams to diagnose low-level I/O issues and optimize I/O paths, cache policies, and file systems
Contribute to the architecture of fault-tolerant, scalable storage backends tailored for AI-first cloud environments

What we offer

Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement

Fulltime

Select Country

Staff Engineer, Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?