Site Reliability Engineer

Site Reliability Engineer Platform Engineer

Join a mission-driven, national financial services organization at the heart of ...

Location

United States , Reston

Salary:

Not provided

Tier4 Group

Expiration Date

Until further notice

Requirements

5+ years hands-on operating and managing Kubernetes and OpenShift clusters
Strong experience with Microsoft Azure (compute, networking, storage, and data services)
Proven skills in automation and Infrastructure-as-Code (Terraform, Ansible, GitOps)
Proficiency with observability tooling (Datadog, Prometheus, Grafana)
Scripting/coding ability in Bash, Python, or Go

Job Responsibility

Operate, tune, and optimize OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies)
Stand up and/or refine observability (Datadog, Prometheus, Grafana)—dashboards, alerts, SLOs, runbooks
Map current hybrid topology and critical delivery pipelines
identify toil and prioritize automation (Terraform/Ansible)
Begin supporting Azure environments (compute, networking, storage, data services) used by analytics teams
Drive GitOps-first workflows
harden CI/CD with ArgoCD/Jenkins/GitHub Actions and policy-as-code guardrails
Implement or enhance platform services (Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams
Lead incident response and postmortems
institutionalize RCA, blameless learning, and continuous improvement

Fulltime

Senior Site Reliability Engineer Cloud Platform

Zilliz is a fast-growing startup developing the industry’s leading vector databa...

Location

Salary:

175000.00 - 225000.00 USD / Year

Zilliz

Expiration Date

Until further notice

Requirements

4+ years of experience in site reliability engineering or similar roles with a focus on cloud-native systems
Proficiency in scripting languages such as Python, Go, or Java
Strong knowledge of container orchestration technologies like Kubernetes and Docker
Expertise with cloud platforms such as AWS, GCP, or Azure, and their respective monitoring and management tools
Experience with infrastructure as code tools such as Terraform or Ansible
Familiarity with CI/CD tools such as Jenkins, GitLab CI, or Argo
Proven ability to troubleshoot complex distributed systems and resolve issues promptly
Bachelor’s degree or above in computer science, software engineering, or other relevant disciplines
Ability to thrive in a fast-paced, startup environment and handle multiple projects simultaneously

Job Responsibility

Work at the intersection of development and site reliability. Creating SRE tools and systems, as well as supporting existing infrastructure and platforms
Ensure the reliability, availability, and performance of Zilliz’s distributed database systems
Develop and implement strategies for monitoring, incident management, and disaster recovery
Automate system operations and maintenance tasks to improve efficiency and reduce manual intervention
Design and build tools to manage and monitor infrastructure, ensuring scalability and robustness
Collaborate with software engineers to enhance system reliability, scalability, and performance
Maintain and improve the CI/CD pipeline to ensure smooth and rapid deployment of changes
Actively contribute to the Milvus Vector Database open-source community, focusing on improving reliability and operational efficiency

Fulltime

New

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...

Location

Ireland , Dublin

Salary:

Not provided

General Motors

Expiration Date

Until further notice

Requirements

8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
Proficiency in Python, Go, or Java, with strong code review and readability standards
Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
Ability to think and act under pressure
Strong communication skills

Job Responsibility

Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling

Fulltime

New

We are currently seeking a Site Reliability Engineer to join our team in Guadala...

Location

Mexico , Guadalajara

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

Perform L1.5 activities such as monitoring, deployment, rollback
Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
Understand the Microsoft Azure Cloud - ideally Azure Fundamentals certified OR Computer Science/Information Systems Management degree
Familiar with PaaS and IaaS - VMs, Storage, EventHub, Service Fabric Cluster (SFC), Azure Kubernetes Service (AKS), CosmosDB, SQL Server, IoT Hub, Databricks, KeyVault, Datalake
Understand the concept of Internet of Things (IoT) - telemetry, ingestion, processing, data storage, reporting
Understand the concept tools - Octopus, Bamboo, Terraform, Azure DevOps, Jenkins, Github, Ansible
Understand the concept of container orchestration platforms (e.g. Kubernetes)
Understand the concept of scripts: Powershell, Python
Understand the difference between NoSQL and SQL databases, and how to maintain them

Job Responsibility

Perform L1.5 activities such as monitoring, deployment, rollback
Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)

Fulltime

New

Site Reliability Engineer

Location

South Africa , Johannesburg

Salary:

Not provided

Nintex

Expiration Date

Until further notice

Requirements

You provide guidance on infrastructure architecture and contribute to high-quality and successful product releases.
You contribute to your team and domain through successfully leading and consistently delivering on projects of ambiguous scope, high complexity, and critical business impact.
You contribute to relevant guilds, practice forums and other initiatives to improve Nintex’s DevOps and SRE discipline.
You have an in-depth understanding of distributed systems architecture, as well as monitoring and observability practices and tools.
You quickly resolve priority infrastructure issues and help other technical team members or Product Managers understand how to avoid them in the future.
You provide detailed estimates for work items you propose or assigned.
You assist in decision-making around tooling, automation practices, and testing solutions.
You stay up-to-date with technology trends and use this knowledge help your team and the broader Engineering practice.
You run Nintex infrastructure with IaC tools (as Terraform) and GitHub Actions for automation, containerize our environments (Kubernetes) and leverage cloud technologies to meet our goals
You build monitoring that alerts on symptoms rather than outages using tools like Prometheus, Grafana, Alertmanager and PagerDuty

Job Responsibility

You are highly skilled and sufficiently experienced in Nintex DevOps tools and processes to own a long-term program or technology such as Kubernetes, etc.
You write scripts, tools and utilities that support and integrate with delivery pipelines and you integrate telemetry where appropriate.
You are called into incidents and bring trusted knowledge in your platform domain.
You debug and fix infrastructure issues on production environments quickly using the relevant tools and guidelines to prevent recurrence.
You build, promote and support infrastructure patterns and practices within Nintex.
You provide coaching/mentoring to other Engineers on the team
You lead or contribute to post-mortems for incidents, including root cause analysis and identification of preventative and remedial actions.
You continuously monitor our platform performance and take immediate action to improve it
You review and advise on appropriate design patterns to solve automation and infrastructure problems without creating technical debt.
You design and build complex infrastructure components for distributed systems as Kubernetes.

What we offer

Global Gratitude and Recharge Days
Flexible, paid time off policy
Employee wellness programs and counseling resources
Meaningful peer recognition and awards
Paid parental leave
Invention/patenting assistance
Community impact, paid volunteer time, and opportunities
Intercultural learning and celebration
Multiple tools through which to learn and grow, and an incredible global community

Site Reliability Engineer

As a Staff Software Engineer, you will play a key role in designing, building, a...

Location

United States , San Jose

Salary:

120500.00 - 243000.00 USD / Year

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

Minimum of 5 years of hands-on experience in Infra Ops, Dev Ops, or Site Reliability Engineering (SRE)
Proficiency with Linux systems, especially Debian-based distributions
Strong experience with cloud platforms such as AWS and GCP
Expertise in Infrastructure as Code tools like Terraform, Packer, and Ansible
Solid programming skills in Python and/or Golang
Deep understanding of containerization (Docker, Container) and orchestration tools (AWS EKS, GCP GKE)
Experience with GitOps workflows
Proven track record in implementing and maintaining CI/CD pipelines
Strong background in security and familiarity with security programs
Experience with monitoring and logging tools (Prometheus, Grafana, ELK)

Job Responsibility

Enhance Infrastructure as Code (IAC) and enforce best practices
Optimize cloud infrastructure for scalability, security, and cost-effectiveness
Develop internal tools to support and streamline cloud platform operations
Improve CI/CD pipelines and deployment workflows using FluxCD and Jenkins
Address container image vulnerabilities and standardize remediation processes
Build Amazon Machine Images (AMIs) aligned with CIS and STIG benchmarks
Strengthen monitoring, alerting, and observability using Prometheus, Grafana, and logging tools
Troubleshoot complex production issues to ensure system reliability and customer satisfaction
Fine-tune distributed systems such as Apache Kafka and Cassandra
Collaborate with development, security, and operations teams to align infrastructure with application needs

What we offer

Health & Wellbeing
Personal & Professional Development
Unconditional Inclusion

Fulltime

Trade Floor Site Reliability Engineer

Join us at Barclays as a Trade Floor Site Reliability Engineer, providing real‑t...

Location

United Kingdom , London

Salary:

Not provided

Barclays

Expiration Date

Until further notice

Requirements

Experience in systems engineering, including Linux and Windows, networking, Kubernetes and cloud infrastructure
Proficiency in automation tools
Proficiency in implementing monitoring, alerting and observability for critical trading platforms
The ability to manage incidents effectively, troubleshoot issues swiftly, proactively communicate and perform root cause analysis to prevent future incidents
Prior experience in supporting Credit or any IB asset classes like Rates or Equities or FX
Experience working with PaaS products, including some experience of either virtualization, containerization, orchestration of compute/network/storage

Job Responsibility

Providing real‑time support to Credit EMEA traders and sales teams to keep critical trading platforms stable and performant
Ensuring seamless client service as electronic and algo trading rapidly expand
Provision of technical support for the service management function to resolve more complex issues
Execution of preventative maintenance tasks on hardware and software and utilisation of monitoring tools/metrics
Maintenance of a knowledge base containing detailed documentation
Analysis of system logs, error messages and user reports to identify root causes
Automation, monitoring enhancements, capacity management, resiliency, business continuity management, front office specific support and stakeholder management
Identification and remediation of potential service impacting risks and issues
Proactively assess support activities implementing automations where appropriate

What we offer

Competitive holiday allowance
Life assurance
Private medical care
Pension contribution

Fulltime

Site Reliability Engineer

Qargo is a cloud-based (SaaS) Transport Management Platform. We are a scale-up b...

Location

Belgium , Ghent

Salary:

Not provided

Qargo

Expiration Date

Until further notice

Requirements

Experience as a Software Engineer, with an interest in infrastructure, scalability, reliability
Strong programming skills (preferably Python or similar backend languages)
Experience working with cloud platforms, container orchestrators, serverless (preferably Google Cloud)
Familiarity with distributed systems and scalability challenges
Experience with CI/CD pipelines and automation
Solid understanding of databases and performance tuning (SQL and/or NoSQL)
Familiarity with monitoring and observability tools
A problem-solving mindset and the ability to think in systems
Strong collaboration skills and a proactive approach to improving systems

Job Responsibility

Build and maintain systems and tooling that improve the reliability, scalability, and performance of our platform
Improve software delivery cycle, focusing on automation and developer experience
Develop internal tools and services to reduce manual operational work
Improve observability by implementing monitoring, logging, and alerting across systems
Optimize system performance, including databases such as PostgreSQL and Firestore
Collaborate with backend engineers and other engineering teams to design reliable and scalable system architectures
Troubleshoot complex production issues and implement long-term fixes
Continuously improve infrastructure (Infrastructure as Code, automation, etc.)

What we offer

A fast-growing SaaS company with a strong mission and an impact-driven team
A flexible work environment with flexible hours and hybrid working
A green office with a great atmosphere and lots of initiatives
A role with a lot of responsibility, ownership, and tangible impact
The opportunity to grow with us and shape both your career and our platform

Fulltime

Select Country

Site Reliability Engineer - Container Platform

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Site Reliability Engineer - Container Platform

Site Reliability Engineer Platform Engineer

Senior Site Reliability Engineer Cloud Platform

Staff Engineer, Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Trade Floor Site Reliability Engineer

Site Reliability Engineer

Our AI answers in your language