Observability Infrastructure Engineer Job at Adyen (Amsterdam)

Senior Observability Infrastructure Engineer

We are looking for an experienced Observability Infrastructure Engineer to join ...

Location

Netherlands , Amsterdam

Salary:

Not provided

Adyen

Expiration Date

Until further notice

Requirements

10+ years of experience in the observability domain or in a relevant platform/infrastructure domain.
Observability Stack Expertise: You have hands-on experience operating core telemetry data stores at scale e.g. Elasticsearch/Opensearch/VictoriaLogs/Clickhouse for logging, Prometheus/ VictoriaMetrics for metrics and Grafana Tempo for distributed tracing.
Linux Experience: You understand the operating system at a kernel level and can debug complex networking, file system, and performance issues on both bare metal and virtualized hardware .
Production Kubernetes Experience: Proven hands-on experience operating, and troubleshooting production workloads on Kubernetes (on-prem and/or cloud), including strong day-to-day use of kubectl and Kubernetes primitives (e.g. Namespaces, Pods, Deployments/StatefulSets, Services, Ingress, ConfigMaps/Secrets)
Software Engineering Mindset: You are proficient in Go or Python and do not just write scripts
you build tools and automation platforms that treat infrastructure as code.

Job Responsibility

Build the next generation of our platform: Design and implement the future architecture of our logging and metrics systems.
Own infrastructure operations: You will take full ownership of our hybrid infrastructure, managing the lifecycle of over 1,500 servers across both bare-metal and Kubernetes environments.
Automate to reduce toil: You will write code in Go or Python to eliminate manual operational tasks.
Optimize for scale and performance: You will dive deep into performance bottlenecks within our distributed tracing and logging pipelines.
Reliability and Engineering: You will participate in on-call rotations, but your primary focus will be engineering solutions that stop alerts from firing in the first place.

Fulltime

Senior Infrastructure Engineer / Observability Specialist

Location: Remote - Anywhere in Australia (Will be required to travel to Canberra...

Location

Australia , Sydney

Salary:

Not provided

FinXL

Expiration Date

Until further notice

Requirements

Must be Australian Citizen and be able to obtain Baseline Security Clearance
Cloud Expertise: Proficiency in AWS, Azure, or Google Cloud platforms
Observability Concepts: Deep understanding of metrics, logs, and traces, including the design of alerting systems
Automation: Experience in scripting with Python, Bash, or PowerShell
Containerisation: Knowledge of Kubernetes and Docker
Soft Skills: Strong negotiation and communication skills to assist with project planning and problem resolution

Job Responsibility

Configure and support observability tools including Dynatrace, Amazon CloudWatch, Amazon CloudTrail, AWS Config, and Azure Monitor
Take ownership of observability monitoring policies, standards, and documentation
Perform fault diagnosis and root cause analysis with timely remedial action
Drive change and uplift IT teams through education and "evangelising" monitoring concepts
Provide support for AWS S3, cloud backups, and AWS RDS databases as needed
Lead incident response through to conclusion and manage assigned service queues

Senior Software Engineer - Cloud Infrastructure & Observability

Location

India , Bengaluru

Salary:

Not provided

Roku

Expiration Date

Until further notice

Requirements

15+ years in software engineering with a track record of architecting distributed systems or platforms at scale
Strong hands‑on experience in Golang and one scripting language (e.g., Python or Shell)
Experience operating observability at pb-scale ingestion and hundreds of millions of series
Expertise in observability platforms and tooling (Prometheus, Grafana, Loki, Tempo, ELK/OpenSearch, ClickHouse) and standards (OpenTelemetry, OpenMetrics)
Deep experience building systems of scale and operating cloud infrastructure with Kubernetes
strong proficiency with service mesh technologies (Istio/Envoy), infrastructure‑as‑code (Terraform) and experience in multi‑cloud (AWS, GCP)
Demonstrated ability to evolve storage and query architectures for cost, scale, and latency (e.g., TSDB, Parquet, distributed processing)
Proven experience integrating security as part of infrastructure and platform development
Exceptional cross‑functional communication
effective collaboration with both technical and non‑technical stakeholders

Job Responsibility

Architect and lead Roku’s observability platform across metrics, logs, and traces
evolve data pipelines and storage layers optimized for high throughput, performance, and cost at Roku scale (TSDBs, Parquet, distributed processing)
Extend and harden open‑source observability systems
overhaul core components (e.g., storage layers, query paths) to improve performance, reliability, and usability at scale
Implement features such as pre‑aggregation, down-sampling, and sampling to reduce load and accelerate queries across the platform
Collaborate across platform, SRE, and product teams to migrate hundreds of workloads to our common platform
augment and automate CI/CD flows and onboarding
Integrate security into infrastructure and platform services
ensure robust multi‑tenant, multi‑cluster, and multi‑cloud designs
Contribute improvements back to open source and CNCF‑aligned projects

What we offer

Global access to mental health and financial wellness support and resources
healthcare (medical, dental, and vision)
life, accident, disability, commuter, and retirement options (401(k)/pension)
time off in accordance with local leave policies

Fulltime

Staff Observability Data Infrastructure Engineer

CVS Health is seeking a highly skilled Observability Data Infrastructure Enginee...

Location

United States , Work at Home, Maryland

Salary:

130295.00 - 260590.00 USD / Year

CVS Health

Expiration Date

June 30, 2026

Requirements

7+ years of experience building and operating log, metric, and trace pipelines in Data, Security Data, or Observability Engineering roles
5+ years of hands-on experience with Databricks, Apache Spark, or other large-scale distributed data platforms
5+ years of experience working across cloud platforms (AWS, Azure, or GCP), including storage, compute, and event-driven services
5+ years of production experience using SQL and Python in data-intensive environments
3+ years of experience with enterprise observability platforms (Splunk, Datadog, Elastic, or equivalent)
3+ years of experience with high-throughput ingestion and streaming technologies such as Cribl, Vector, or Kafka
3+ years of experience designing telemetry systems aligned to OpenTelemetry (OTEL) or similar standards
Bachelor's degree from accredited university or equivalent work experience (HS diploma + 4 years relevant experience)

Job Responsibility

Design, build, and operate high-volume log, metric, and trace pipelines using Databricks, cloud data lakes, and distributed processing engines
Architect and evolve an Observability Lakehouse aligned with OpenTelemetry (OTEL) data models and standards
Implement ingestion and transformation workflows using technologies such as Cribl, Vector, Jenkins, GitHub Actions, or equivalent tools
Normalize, model, and enrich telemetry data to support detection engineering, forensics, and operational analytics
Develop scalable ETL/ELT frameworks, Delta Lake architectures, and automated data quality validation for unstructured and semi-structured data
Partner with Security Engineering, SRE, Cloud, and SOC teams to improve enterprise visibility and detection accuracy
Build and maintain CI/CD pipelines and reusable Infrastructure-as-Code (IaC) patterns for observability platform deployment
Identify and resolve performance, latency, cost, and reliability issues across telemetry pipelines
Contribute to engineering standards, documentation, and knowledge sharing across observability and security platforms

What we offer

Medical, dental, and vision coverage
Paid time off
Retirement savings options
Wellness programs
Bonus, commission or short-term incentive program
Equity award program

Fulltime

!

Senior Software Engineer - Cloud Infrastructure & Observability

We are building a next-generation observability and cloud platform that is high-...

Location

United Kingdom , Cambridge

Salary:

Not provided

Roku

Expiration Date

Until further notice

Requirements

Extensive experience with software engineering with a track record of architecting distributed systems or platforms at scale
Strong hands-on experience in Golang and one scripting language (e.g., Python or Shell)
Experience operating observability at pb-scale ingestion and hundreds of millions of series
Expertise in observability platforms and tooling (Prometheus, Grafana, Loki, Tempo, ELK/OpenSearch, ClickHouse) and standards (OpenTelemetry, OpenMetrics)
Deep experience building systems of scale and operating cloud infrastructure with Kubernetes
strong proficiency with service mesh technologies (Istio/Envoy), infrastructure-as-code (Terraform) and experience in multi-cloud (AWS, GCP)
Demonstrated ability to evolve storage and query architectures for cost, scale, and latency (e.g., TSDB, Parquet, distributed processing)
Proven experience integrating security as part of infrastructure and platform development
Exceptional cross-functional communication
effective collaboration with both technical and non-technical stakeholders

Job Responsibility

Architect and lead Roku’s observability platform across metrics, logs, and traces
evolve data pipelines and storage layers optimized for high throughput, performance, and cost at Roku scale (TSDBs, Parquet, distributed processing)
Extend and harden open-source observability systems
overhaul core components (e.g., storage layers, query paths) to improve performance, reliability, and usability at scale
Implement features such as pre-aggregation, down-sampling, and sampling to reduce load and accelerate queries across the platform
Collaborate across platform, SRE, and product teams to migrate hundreds of workloads to our common platform
augment and automate CI/CD flows and onboarding
Integrate security into infrastructure and platform services
ensure robust multi-tenant, multi-cluster, and multi-cloud designs
Contribute improvements back to open source and CNCF-aligned projects

What we offer

Global access to mental health and financial wellness support and resources
healthcare (medical, dental, and vision)
life, accident, disability, commuter, and retirement options (401(k)/pension)
time off work for vacation and other personal reasons

Fulltime

Site Reliability Engineer / Observability Engineer

Rackspace is building up its Professional Services Center of Excellence on Appli...

Location

Egypt , Giza

Salary:

Not provided

Rackspace

Expiration Date

Until further notice

Requirements

Bachelor’s degree in engineering/computer science or equivalent
Senior-level experience with Site Reliability Engineering, DevOps, Code level application support and troubleshooting, AWS Infrastructure design, implementation and optimization, Automation for deployment, scaling and reliability
Experience with observability solutions tools like Splunk, Datadog, SignalFx, etc.
Experience deploying, maintaining and supporting software applications/services in the AWS ecosystem
Proactive approach to identifying problems and solutions
Experience writing code with one or more interpreted languages such as Python, PHP, Perl, Ruby, Linux Shell
Experience with Terraform or Cloud Formation scripting
Experience with configuration management tools like Ansible, Chef or Puppet
Experience with standard software development best practices and tools such as code repositories (Git preferred)
Experience executing in an agile software development environment

Job Responsibility

Work with customers and implement Observability solutions
Build and maintain scalable systems and robust automation that supports engineering goals
Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
Proactively gather and analyze both metric and log data from systems and applications to perform anomaly detection, performance tuning, capacity planning and fault isolation
Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability, security and performance standards
Collaborate with team members to document and share solutions
Maintain a deep understanding of the customer’s business as well as their technical environment
Identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues

Fulltime

Ai Infrastructure Engineer, Core Infrastructure

As a Software Engineer on the ML Infrastructure team, you will design and build ...

Location

United States , San Francisco; Seattle; New York

Salary:

179400.00 - 310500.00 USD / Year

Scale

Expiration Date

Until further notice

Requirements

4+ years of experience building large-scale backend or distributed systems
Strong programming skills in Python, Go, or Rust, and familiarity with modern cloud-native architecture
Experience with containers and orchestration tools (Kubernetes, Docker) and Infrastructure as Code (Terraform)
Familiarity with schedulers or workload management systems (e.g., Kubernetes controllers, Slurm, Ray, internal job queues)
Understanding of observability and reliability practices (metrics, tracing, alerting, SLOs)
A track record of improving system efficiency, reliability, or developer velocity in production environments

Job Responsibility

Design and maintain fault-tolerant, cost-efficient systems that manage compute allocation, scheduling, and autoscaling across clusters and clouds
Build common abstractions and APIs that unify job submission, telemetry, and observability across serving and training workloads
Develop systems for usage metering, cost attribution, and quota management, enabling transparency and control over compute budgets
Improve reliability and efficiency of large-scale GPU workloads through better scheduling, bin-packing, preemption, and resource sharing
Partner with ML engineers and API teams to identify bottlenecks and define long-term architectural standards
Lead projects end-to-end — from requirements gathering and design to rollout and monitoring — in a cross-functional environment

What we offer

Comprehensive health, dental and vision coverage
retirement benefits
a learning and development stipend
generous PTO
equity based compensation

Fulltime

Research Engineer / Software Engineer (platform/core infrastructure)

Build the future of offensive security with XBOW. Attackers are already using AI...

Location

United States

Salary:

150000.00 - 350000.00 USD / Year

Xbow

Expiration Date

Until further notice

Requirements

Strong experience building and operating scalable, distributed systems on cloud infrastructure such as AWS or similar
Comfortable working with infrastructure as code (e.g., Terraform, CDK)
A track record of performance tuning across cloud services, databases, and compute layers
Eager to learn new tools, languages, and technologies as needed
A thoughtful communicator who values clarity and simplicity and is comfortable working in a fast-paced startup and navigating ambiguity
Strong problem-solving skills and the ability to work with incomplete information
Curious, practical, and eager to work across layers of the stack when needed
You think proactively about failure modes and bring experience implementing disaster recovery and business continuity plans that keep critical systems running

Job Responsibility

Design and implement infrastructure systems that scale reliably and securely, and can be deployed across multiple cloud environments (AWS, Azure, OCI etc.) and contexts (SaaS, on prem)
Tune and optimize cloud services across compute, storage, networking, and observability to drive performance, reliability and maintainability of core services
Develop our core services, written in TypeScript, Kotlin and Go
Support large-scale systems with event driven architectures
Own problems end-to-end—from design through deployment to production support
Navigate ambiguity and help define how we build as much as what we build
Partner closely with other engineers, AI researchers and Security researchers to enable high-quality, high-velocity product development
Design for resilience by implementing disaster recovery and business continuity strategies that ensure uptime, even when things break
Improve how we build, deploy, and monitor services at scale

What we offer

Competitive salary and a generous equity package
Career Growth: Shape your role, lead the function, and grow with the company
Meaningful Work: You will tackle technically complex challenges and play a pivotal role in the growth of our business

Fulltime

Select Country

Observability Infrastructure Engineer

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?