CrawlJobs Logo

Site Reliability Engineer - Kubernetes - Data Platforms

Netherlands, Amsterdam · Job Posted June 15, 2026
Apply Position
Job Link Share

Job Description

You will be building the rails of a self-service data platform inside Adyen, creating an ecosystem that is bigger than the sum of its parts. By blending Site Reliability Engineering, Software Engineering, Systems Engineering, and Data Engineering, you will power the many data, machine learning, and GenAI products running across Adyen.

Job Responsibility

  • Design & Build On-Premise (kubernetes) Infrastructure
  • Cluster Provisioning & Reliability
  • Mixed Workload Balancing
  • Advanced Scheduling & Hardware Management
  • Storage & Network Optimization
  • FinOps & Security
  • Automation & Operations

Requirements

  • Experienced Platform/SRE Professional
  • Technical Expertise
  • Tooling & Ecosystems
  • Observability Mindset
  • Good to have: A background in Software Engineering, specialized networking, or GPU management. Familiarity with data ecosystem tools like Airflow and HDFS is highly appreciated.
  • Ambitious & Collaborative

Nice to have

  • Software Engineering background
  • specialized networking
  • GPU management
  • familiarity with Airflow and HDFS

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineer - Kubernetes - Data Platforms

8 matching positions

Site Reliability Engineer - Data Platform Operation

Join our Data & AI Platform team as a Site Reliability Engineer (SRE) – Platform...
Location
Location
Brazil , Sao Paulo
Salary
Salary:
Not provided
amaris.com Logo
Amaris Consulting
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Academic background: Bachelor’s or Master’s degree in Computer Science, Information Technology, or related field (minimum 3 years of experience)
  • Experience: 5+ years hands-on with cloud platforms (Azure, AWS, GCP), programming (Bash, PowerShell, Terraform, Python, Java), and Infrastructure as Code (IaC)
  • English language: Professional working proficiency in English and the local language
  • Tools / software: Deep expertise in Azure, Databricks, Unity Catalog, Kubernetes, Helm, Docker, Power BI, Datadog, Grafana, GitHub, Azure DevOps, ArgoCD, Airflow, SSIS, Power Query, and relational/NoSQL databases
  • AI experience: Experience supporting enterprise Data & AI platforms
  • Soft skills: Analytical problem-solving
  • Effective communication and active listening
  • Team player with respect for others
  • Strong troubleshooting and platform monitoring skills
  • Automation (Python, PowerShell, CLI, KQL, Terraform)
Job Responsibility
Job Responsibility
  • Support, manage, and maintain Azure resources: Azure SQL, Synapse, Data Factory, Databricks, Unity Catalog
  • Monitor Azure workloads, troubleshoot incidents, alerts, and performance bottlenecks
  • Implement and manage RBAC, identity & access policies, and compliance controls
  • Optimize Azure cost and performance using Azure Monitor, DataDog, and Cost Management tools
  • Automate tasks using PowerShell, Azure CLI, Terraform, and Python
  • Utilize Git, GitHub Actions, and Airflow for workflow automation
  • Provide L2/L3 support for data pipelines, reporting, and cloud services
  • Conduct incident response, root cause analysis (RCA), and proactive issue resolution
  • Collaborate with Cloud Engineering, Data Engineers, BI Developers, and Cloud Architects
  • Follow ITSM processes: Incident, Change, and Problem Management
What we offer
What we offer
  • An international community bringing together 110+ different nationalities
  • An environment where trust has a central place: 70% of our key leaders started their careers at the first level of responsibility
  • A robust training system with our internal Academy and 250+ available modules
  • A vibrant workplace that frequently gathers for internal events (afterworks, team buildings, etc.)
  • Strong commitments to CSR, notably through participation in our WeCare Together program
Read More
Arrow Right

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
  • Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
  • Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
  • Proficiency in Python, Go, or Java, with strong code review and readability standards
  • Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
  • Ability to think and act under pressure
  • Strong communication skills
Job Responsibility
Job Responsibility
  • Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
  • Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
  • Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
  • Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
  • Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
  • Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
  • Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
  • Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer Platform Engineer

Join a mission-driven, national financial services organization at the heart of ...
Location
Location
United States , Reston
Salary
Salary:
Not provided
tier4group.com Logo
Tier4 Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years hands-on operating and managing Kubernetes and OpenShift clusters
  • Strong experience with Microsoft Azure (compute, networking, storage, and data services)
  • Proven skills in automation and Infrastructure-as-Code (Terraform, Ansible, GitOps)
  • Proficiency with observability tooling (Datadog, Prometheus, Grafana)
  • Scripting/coding ability in Bash, Python, or Go
Job Responsibility
Job Responsibility
  • Operate, tune, and optimize OpenShift/Kubernetes clusters (scheduling, ingress, upgrades, quotas, policies)
  • Stand up and/or refine observability (Datadog, Prometheus, Grafana)—dashboards, alerts, SLOs, runbooks
  • Map current hybrid topology and critical delivery pipelines
  • identify toil and prioritize automation (Terraform/Ansible)
  • Begin supporting Azure environments (compute, networking, storage, data services) used by analytics teams
  • Drive GitOps-first workflows
  • harden CI/CD with ArgoCD/Jenkins/GitHub Actions and policy-as-code guardrails
  • Implement or enhance platform services (Vault, Kafka/AMQ, ingress, service mesh) for dev and data teams
  • Lead incident response and postmortems
  • institutionalize RCA, blameless learning, and continuous improvement
  • Fulltime
Read More
Arrow Right

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...
Location
Location
Canada , Mississauga
Salary
Salary:
115000.00 - 128000.00 CAD / Year
pointclickcare.com Logo
PointClickCare
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years' experience in software engineering
  • Experience with SRE principles
  • Experience with AI/ML in production environments
  • A passion for automation, intelligent systems, and operational excellence
  • Strong debugging, problem-solving, and system design skills
  • Languages: Python, Java, Bash, Terraform
  • Platforms: Azure, Kubernetes, Docker
  • Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
  • ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
  • CI/CD: Jenkins, ArgoCD, Spinnaker
Job Responsibility
Job Responsibility
  • Build ML-based anomaly detection and pattern recognition systems
  • Enhance telemetry with smart tagging and metadata for better AI insights
  • Develop event-driven workflows and self-healing systems using AI triggers
  • Automate incident response with generative AI and custom AI agent orchestration
  • Use time-series forecasting and predictive modelling to anticipate failures
  • Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
  • Build scalable, fault-tolerant systems in a cloud-native environment
  • Participate in on-call rotations and lead incident response for critical systems
  • Skilled in API integration for streamlined data exchange and system connectivity
  • Run internal AIOps workshops and help teams adopt AI maturity models
What we offer
What we offer
  • Benefits starting from Day 1!
  • Retirement Plan Matching
  • Flexible Paid Time Off
  • Wellness Support Programs and Resources
  • Parental & Caregiver Leaves
  • Fertility & Adoption Support
  • Continuous Development Support Program
  • Employee Assistance Program
  • Allyship and Inclusion Communities
  • Employee Recognition … and more!
  • Fulltime
Read More
Arrow Right
New

Lead Site Reliability Engineer

We're building a Site Reliability Engineering center in Mexico City, and we're h...
Location
Location
Mexico , Mexico City
Salary
Salary:
Not provided
capitalone.com Logo
Capital One
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Professional English fluency
  • Bachelor's degree
  • At least 6 years of experience in SRE, production operations, or reliability engineering
  • Experience in DevOps Engineering (internship experience does not apply)
  • 5+ years of experience in at least one of the following: Java, Python, Go
  • At least 4 years of experience with Cloud Native technologies (Amazon Web Services, Microsoft Azure, Google Cloud Platform)
  • 3+ years of experience with container orchestration services including Docker or Kubernetes
  • Experience with Shell or Bash scripting
  • At least 3 years of Unix or Linux system administration experience
Job Responsibility
Job Responsibility
  • Own reliability for batch settlement systems - ensure cycle completion windows are met, data integrity is maintained, and failures are detected before they reach downstream consumers
  • Build and improve observability for settlement pipelines - dashboards, alerts, and anomaly detection that make system health legible and reduce reliance on tribal knowledge
  • Drive automation of operational toil - certificate rotation, environment provisioning, compliance artifact generation, and manual validation steps that currently require human intervention
  • Partner with UK-based settlement engineers - acquire domain expertise on Durbin compliance windows, cross-border DCI routing, and acquirer/issuer SLA adherence
  • Participate in incident management - respond to settlement failures, drive root cause analysis, and implement durable fixes that prevent recurrence
  • Contribute to regulatory readiness - ensure SRE practices produce audit-ready artifacts for SOX and PCI-DSS exams without manual toil
What we offer
What we offer
  • Healthy Body, Healthy Mind
  • Save Money, Make Money
  • Time, Family and Advice
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer - Cloud

Elevate Global Operations as our Next Cloud Site Reliability Engineer (OpenTelem...
Location
Location
United Kingdom
Salary
Salary:
Not provided
trimble.com Logo
Trimble Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Hands-on experience with the OpenTelemetry Collector, APIs, and SDKs
  • Extensive experience with observability tools like NewRelic, Datadog, or Splunk
  • Strong proficiency in Infrastructure as Code (Terraform, Ansible) and cloud platforms (AWS, GCP, or Azure)
  • Deep understanding of containerization and orchestration using Docker and Kubernetes
  • Advanced coding skills in Python, Go, or Java for building robust automation and monitoring tools
  • Experience leveraging AI coding assistants like GitHub Co-Pilot to accelerate development
Job Responsibility
Job Responsibility
  • Lead a global "OTel First" strategy, implementing OpenTelemetry at scale across a diverse technological landscape
  • Spearhead the development of automation scripts and Infrastructure as Code using Terraform to ensure seamless, reproducible platform delivery
  • Optimize platform performance and cost-efficiency, ensuring our observability tools scale economically as our data grows
  • Collaborate with engineering teams to embed reliability and security standards into new features from the ground up
  • Drive root cause analysis and problem management to proactively prevent incidents and improve the customer experience
Read More
Arrow Right

Site Reliability Engineer

We are currently seeking a Site Reliability Engineer to join our team in Guadala...
Location
Location
Mexico , Guadalajara
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Perform L1.5 activities such as monitoring, deployment, rollback
  • Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
  • Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
  • Understand the Microsoft Azure Cloud - ideally Azure Fundamentals certified OR Computer Science/Information Systems Management degree
  • Familiar with PaaS and IaaS - VMs, Storage, EventHub, Service Fabric Cluster (SFC), Azure Kubernetes Service (AKS), CosmosDB, SQL Server, IoT Hub, Databricks, KeyVault, Datalake
  • Understand the concept of Internet of Things (IoT) - telemetry, ingestion, processing, data storage, reporting
  • Understand the concept tools - Octopus, Bamboo, Terraform, Azure DevOps, Jenkins, Github, Ansible
  • Understand the concept of container orchestration platforms (e.g. Kubernetes)
  • Understand the concept of scripts: Powershell, Python
  • Understand the difference between NoSQL and SQL databases, and how to maintain them
Job Responsibility
Job Responsibility
  • Perform L1.5 activities such as monitoring, deployment, rollback
  • Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
  • Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer, Wikimedia Enterprise

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to jo...
Location
Location
United States
Salary
Salary:
116633.00 - 181243.00 USD / Year
wikimediafoundation.org Logo
Wikimedia Foundation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Automation & Configuration Management: Experience with Infrastructure as Code and automation tools (e.g., Terraform, Ansible) and proficiency in at least one programming language (e.g., Python, Go, or similar)
  • Cloud Infrastructure: Experience designing, operating, and optimizing cloud-based systems across platforms such as AWS, Azure, or GCP, including scalability, reliability, and cost efficiency
  • CI/CD & Deployment Practices: Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab or similar, ArgoCD), with familiarity in progressive delivery approaches such as canary and blue-green deployments
  • Incident Management & Reliability Operations: Experience with incident response, on-call practices, and leading postmortems, with a focus on continuous improvement and operational excellence
  • SRE Principles & Observability: Strong understanding of SRE best practices, including SLOs, SLIs, and error budgets, along with experience in observability (metrics, logging, and distributed tracing e.g., Prometheus, OpenTelemetry)
  • Collaboration & Communication: Ability to work effectively in a distributed, cross-functional environment, with strong documentation and communication skills
  • Proven experience operating highly available, large-scale distributed systems, with a deep understanding of reliability, scalability, and failure modes
  • Ownership mindset: Takes end-to-end responsibility for system reliability, proactively identifying and addressing risks before they impact users
  • Bias for automation: Continuously seeks to reduce operational toil through automation and scalable solutions
  • Continuous improvement mindset: Actively learns from incidents and drives improvements through blameless postmortems and iterative enhancements
Job Responsibility
Job Responsibility
  • Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets to ensure reliability targets are met
  • Build and enhance observability systems (metrics, logs, and distributed tracing) to enable proactive detection and faster troubleshooting
  • Drive reliability engineering practices, including capacity planning, load testing, and resilience validation (e.g., chaos testing)
  • Improve developer experience (DevEx) by enabling self-service infrastructure and streamlining deployment workflows
  • Partner with engineering team members to embed reliability best practices early in the development lifecycle
  • Design, implement, and optimize CI/CD and GitOps workflows using tools such as GitLab (or similar) and ArgoCD(or similar), enabling automated, reliable deployments with support for progressive delivery strategies like canary and blue-green releases
  • Implement secure-by-default infrastructure and enforce best practices (e.g., IAM, secrets management, encryption)
  • Continuously optimize infrastructure cost and efficiency using FinOps principles while maintaining performance and availability
  • Establish and track operational metrics such as MTTR, MTTD, and incident frequency to drive continuous improvement
  • Reduce operational toil by identifying repetitive work and implementing automation-first solutions
  • Fulltime
Read More
Arrow Right