Site Reliability Engineer

Site Reliability Engineer Platform Engineer

Join a mission-driven, national financial services organization at the heart of ...

Location

United States , Reston

Salary:

Not provided

Tier4 Group

Expiration Date

Until further notice

Requirements

5+ years hands-on operating and managing Kubernetes and OpenShift clusters
Strong experience with Microsoft Azure (compute, networking, storage, and data services)
Proven skills in automation and Infrastructure-as-Code (Terraform, Ansible, GitOps)
Proficiency with observability tooling (Datadog, Prometheus, Grafana)
Scripting/coding ability in Bash, Python, or Go

Job Responsibility

Operate, tune, and optimize OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies)
Stand up and/or refine observability (Datadog, Prometheus, Grafana)—dashboards, alerts, SLOs, runbooks
Map current hybrid topology and critical delivery pipelines
identify toil and prioritize automation (Terraform/Ansible)
Begin supporting Azure environments (compute, networking, storage, data services) used by analytics teams
Drive GitOps-first workflows
harden CI/CD with ArgoCD/Jenkins/GitHub Actions and policy-as-code guardrails
Implement or enhance platform services (Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams
Lead incident response and postmortems
institutionalize RCA, blameless learning, and continuous improvement

Fulltime

Big Data/Data Platform Site Reliability Engineer

About PulsePoint: PulsePoint is a fast-growing healthcare technology company (wi...

Location

United Kingdom

Salary:

Not provided

PulsePoint

Expiration Date

Until further notice

Requirements

Strong hands-on experience operating large-scale Linux infrastructure in production (Rocky Linux or equivalent)
Deep practical knowledge of Apache Hadoop-based data platforms, including: HDFS architecture and failure modes, Kerberos-based security models, Operational lifecycle (upgrades, scaling, recovery)
Experience running Apache Kafka clusters in production, including KRaft-based setups
Proven ability to debug complex distributed system issues across storage, compute, and networking layers
Experience designing or improving automation, deployment, or GitOps-style workflows
Proficiency in scripting or automation (Python, Shell, etc.)
Solid understanding of networking fundamentals (TCP/IP, DNS, load balancing, basic network security concepts)
Comfortable taking technical ownership, driving reliability improvements, and participating in on-call / incident processes
Willing and able to work East Coast U.S. hours (9am–6pm EST)

Job Responsibility

Deploying, configuring, monitoring and maintaining multiple big data stores across multiple datacenters, with a strong focus on reliability, scalability, and operational excellence
Perform planning, configuration, deployment, and lifecycle management of critical data infrastructure
Managing large-scale Linux infrastructure to ensure maximum uptime and predictable performance
Developing and documenting system configuration standards, operational procedures, and best practices
Performance and reliability testing, including reviewing configuration, software choices, versions, and hardware specifications
Participating in incident response, root cause analysis, and driving long-term reliability improvements
Advancing our technology stack with innovative ideas and pragmatic solutions

Data Site Reliability Engineer

We are hiring Data Site Reliability Engineers (Data SREs) to join our Global Dat...

Location

China , Shanghai

Salary:

Not provided

Optiver

Expiration Date

Until further notice

Requirements

Experience operating or supporting production data systems, such as data pipelines, ingestion frameworks, or analytical platforms
Strong proficiency in Python, with experience using libraries like Pandas, Arrow, and Spark
Solid understanding of data modeling, normalization, and API development in support of large-scale analytical or trading systems
Experience working with lakehouse architectures (e.g., Delta Lake, Databricks, AWS) to manage large-scale, high-quality analytical datasets is preferred
An operations mindset with a strong sense of ownership, reliability, and continuous improvement
Ability to operate with a high degree of autonomy, making sound engineering and operational decisions for production systems
Strong communication skills and the ability to work effectively across teams and time zones
Experience with external data ingestion is a plus
Exposure to PCAP-based data workflows or market data capture pipelines is a plus
Willingness to contribute to setting best practices and mentoring engineers on operational excellence and production reliability

Job Responsibility

Configure, launch, and maintain robust data pipelines (ETL/ELT) for ingesting and transforming datasets critical to research and trading
Own the end-to-end reliability of critical production data pipelines, from ingestion through downstream consumption
Ensure data quality and consistency through validation, monitoring, and robust engineering practices
Act as the first point of contact for data incidents, investigating failures, data quality issues, and pipeline regressions, and driving them through to resolution
Participate in incident response, root cause analysis, and post-incident reviews, with a focus on preventing recurrence
Manage daily releases, backfills, and ad-hoc data runs, with a strong focus on safety, traceability, and environment segregation
Design and improve monitoring, alerting, data quality checks, and operational runbooks, ensuring issues are detected early and alerts are actionable
Build automation and tooling to reduce manual operational work and enable the platform to scale safely
Partner closely with data engineering, platform, and trading-facing teams to ensure data systems are reliable, well-understood, and fit for purpose

What we offer

A performance-based bonus structure unmatched anywhere in the industry
The chance to work alongside diverse and intelligent peers in a rewarding environment
Training, mentorship and personal development opportunities
Daily breakfast, lunch and snacks
Gym membership, sports and leisure activities, plus weekly in-house chair massages
Regular social events, clubs and Friday afternoon drinks

Senior Data Engineer - Data Platform

We are looking for a Senior Data Engineer - Data Platform to join our Data & AI ...

Location

France , Paris

Salary:

Not provided

Doctolib

Expiration Date

Until further notice

Requirements

More than 7 years of experience as Site Reliability Engineer, Data Ops, Data Platform Engineer or in a similar role, with a proven track record of building and maintaining complex data infrastructures
Strong proficiency in data engineering and infrastructure tools and technologies, such as stream and events processing (Kafka, PubSub, Firehose) and Kubernetes
Expertise in programming languages like Python
Familiar with cloud infrastructure and services, preferably AWS, Azure, or GCP, and have experience with infrastructure-as-code tools such as Terraform
Excellent problem-solving skills with a focus on identifying and resolving data infrastructure bottlenecks and performance issues

Job Responsibility

Design and implement a scalable and reliable data infrastructure that supports the collection, processing, storage, and analysis of large-scale datasets while pushing security and privacy best practices
Build and maintain data pipelines that efficiently extract, transform, and load data from various sources into our data warehouse
Implement automation and orchestration tools to streamline infrastructure provisioning, data workflows, reduce manual effort, and improve operational efficiency
Monitor data platform for performance and reliability, identify and troubleshoot issues, and implement proactive solutions to ensure data quality and availability
Streamline and monitor platform costs, identify optimizations and saving opportunities while collaborating with data engineers, data scientists, and other stakeholders

What we offer

Free comprehensive health insurance for you and your children
Parent Care Program: receive one additional month of leave on top of the legal parental leave
Free mental health and coaching services through our partner Moka.care
For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
Work from EU countries and the UK for up to 10 days per year, thanks to our flexibility days policy
Up to 14 days of RTT
A subsidy from the work council to refund part of the membership to a sport club or a creative class
Lunch voucher with Swile card

Fulltime

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...

Location

Canada , Mississauga

Salary:

115000.00 - 128000.00 CAD / Year

PointClickCare

Expiration Date

Until further notice

Requirements

5+ years' experience in software engineering
Experience with SRE principles
Experience with AI/ML in production environments
A passion for automation, intelligent systems, and operational excellence
Strong debugging, problem-solving, and system design skills
Languages: Python, Java, Bash, Terraform
Platforms: Azure, Kubernetes, Docker
Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
CI/CD: Jenkins, ArgoCD, Spinnaker

Job Responsibility

Build ML-based anomaly detection and pattern recognition systems
Enhance telemetry with smart tagging and metadata for better AI insights
Develop event-driven workflows and self-healing systems using AI triggers
Automate incident response with generative AI and custom AI agent orchestration
Use time-series forecasting and predictive modelling to anticipate failures
Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
Build scalable, fault-tolerant systems in a cloud-native environment
Participate in on-call rotations and lead incident response for critical systems
Skilled in API integration for streamlined data exchange and system connectivity
Run internal AIOps workshops and help teams adopt AI maturity models

What we offer

Benefits starting from Day 1!
Retirement Plan Matching
Flexible Paid Time Off
Wellness Support Programs and Resources
Parental & Caregiver Leaves
Fertility & Adoption Support
Continuous Development Support Program
Employee Assistance Program
Allyship and Inclusion Communities
Employee Recognition … and more!

Fulltime

New

We are currently seeking a Site Reliability Engineer to join our team in Guadala...

Location

Mexico , Guadalajara

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

Perform L1.5 activities such as monitoring, deployment, rollback
Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
Understand the Microsoft Azure Cloud - ideally Azure Fundamentals certified OR Computer Science/Information Systems Management degree
Familiar with PaaS and IaaS - VMs, Storage, EventHub, Service Fabric Cluster (SFC), Azure Kubernetes Service (AKS), CosmosDB, SQL Server, IoT Hub, Databricks, KeyVault, Datalake
Understand the concept of Internet of Things (IoT) - telemetry, ingestion, processing, data storage, reporting
Understand the concept tools - Octopus, Bamboo, Terraform, Azure DevOps, Jenkins, Github, Ansible
Understand the concept of container orchestration platforms (e.g. Kubernetes)
Understand the concept of scripts: Powershell, Python
Understand the difference between NoSQL and SQL databases, and how to maintain them

Job Responsibility

Perform L1.5 activities such as monitoring, deployment, rollback
Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)

Fulltime

Senior Site Reliability Engineer, Wikimedia Enterprise

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to jo...

Location

United States

Salary:

116633.00 - 181243.00 USD / Year

Wikimedia Foundation

Expiration Date

Until further notice

Requirements

Automation & Configuration Management: Experience with Infrastructure as Code and automation tools (e.g., Terraform, Ansible) and proficiency in at least one programming language (e.g., Python, Go, or similar)
Cloud Infrastructure: Experience designing, operating, and optimizing cloud-based systems across platforms such as AWS, Azure, or GCP, including scalability, reliability, and cost efficiency
CI/CD & Deployment Practices: Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab or similar, ArgoCD), with familiarity in progressive delivery approaches such as canary and blue-green deployments
Incident Management & Reliability Operations: Experience with incident response, on-call practices, and leading postmortems, with a focus on continuous improvement and operational excellence
SRE Principles & Observability: Strong understanding of SRE best practices, including SLOs, SLIs, and error budgets, along with experience in observability (metrics, logging, and distributed tracing e.g., Prometheus, OpenTelemetry)
Collaboration & Communication: Ability to work effectively in a distributed, cross-functional environment, with strong documentation and communication skills
Proven experience operating highly available, large-scale distributed systems, with a deep understanding of reliability, scalability, and failure modes
Ownership mindset: Takes end-to-end responsibility for system reliability, proactively identifying and addressing risks before they impact users
Bias for automation: Continuously seeks to reduce operational toil through automation and scalable solutions
Continuous improvement mindset: Actively learns from incidents and drives improvements through blameless postmortems and iterative enhancements

Job Responsibility

Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets to ensure reliability targets are met
Build and enhance observability systems (metrics, logs, and distributed tracing) to enable proactive detection and faster troubleshooting
Drive reliability engineering practices, including capacity planning, load testing, and resilience validation (e.g., chaos testing)
Improve developer experience (DevEx) by enabling self-service infrastructure and streamlining deployment workflows
Partner with engineering team members to embed reliability best practices early in the development lifecycle
Design, implement, and optimize CI/CD and GitOps workflows using tools such as GitLab (or similar) and ArgoCD(or similar), enabling automated, reliable deployments with support for progressive delivery strategies like canary and blue-green releases
Implement secure-by-default infrastructure and enforce best practices (e.g., IAM, secrets management, encryption)
Continuously optimize infrastructure cost and efficiency using FinOps principles while maintaining performance and availability
Establish and track operational metrics such as MTTR, MTTD, and incident frequency to drive continuous improvement
Reduce operational toil by identifying repetitive work and implementing automation-first solutions

Fulltime

New

Site Reliability Engineer

Shape the Future of Intelligent Operations as a Site Reliability Engineer (AI Op...

Location

India , Chennai

Salary:

Not provided

Trimble Inc.

Expiration Date

Until further notice

Requirements

1 to 2 years of professional experience in a DevOps, MLOps, or systems engineering environment
Bachelor's degree in Computer Science, Engineering, Information Technology, or a closely related technical field
Direct experience with Microsoft Azure cloud platforms and its specialized ecosystem services (such as Azure ML and Azure DevOps)
Proficiency with Python or other scripting languages (Shell / Bash / PowerShell) for rapid system integration and task automation
Foundational understanding of containerization (Docker), basic orchestration concepts (Kubernetes fundamentals), and version control system workflows (Git)
Solid baseline knowledge of fundamental DevOps principles (CI/CD, system administration) and a basic understanding of the end-to-end machine learning model lifecycle

Job Responsibility

Assist in the deployment and maintenance of machine learning models in production under direct supervision, building skills in containerization and orchestration architectures
Support the development of robust continuous integration and deployment pipelines for ML workflows, including model versioning, automated testing, and release processes
Monitor production ML model performance, detect data drift, and track system health by implementing foundational logging, alerting, and metrics solutions
Contribute to infrastructure automation and configuration management for machine learning workloads, learning to treat infrastructure as software
Partner closely with ML engineers and data scientists to operationalize complex models, ensuring reliability, scale, and strict adherence to established operational patterns

What we offer

Structured environment to accelerate technical skills
Direct guidance from experienced engineering professionals
Projects that improve productivity, quality, safety, transparency and sustainability
Collaborative and supportive team
Entrepreneurial spirit empowering proactive doers
Flexible work arrangements

Fulltime

Select Country

Site Reliability Engineer - Data Platform Operation

Job Description

Job Responsibility

Requirements

What we offer

Looking for more opportunities?

Site Reliability Engineer - Data Platform Operation

Site Reliability Engineer Platform Engineer

Big Data/Data Platform Site Reliability Engineer

Data Site Reliability Engineer

Senior Data Engineer - Data Platform

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

Site Reliability Engineer

Senior Site Reliability Engineer, Wikimedia Enterprise

Site Reliability Engineer

Our AI answers in your language