CrawlJobs Logo

Observability Infrastructure Engineer

Netherlands, Amsterdam · Job Posted March 05, 2026
Apply Position
Job Link Share

Job Description

We are looking for an experienced Observability Infrastructure Engineer to join our Platform Engineering organization. You will be part of the team responsible for building and running Observability pillars on premise and on Kubernetes. Our systems collect, process, and store the logs, metrics, and traces that allow hundreds of product teams to monitor their services in real time. This is a role for a builder and a problem solver who enjoys deep technical troubleshooting across distributed systems and then turns recurring issues into automated, repeatable solutions. You will work in a large-scale environment where we manage petabytes of data and thousands of servers. We are currently in the middle of a major transformation: focusing on automation of operations and enabling self service for our users.

Job Responsibility

  • Build the next generation of our platform: Design and implement the future architecture of our logging and metrics systems. You will play a key role in redesigning our infrastructure to support new global regions, ensuring data isolation and regulatory compliance in different geographies, and more
  • Own infrastructure operations: You will take full ownership of our hybrid infrastructure, managing the lifecycle of over 1,500 servers across both bare-metal and Kubernetes environments
  • Automate to reduce toil: You will write code in Go or Python to eliminate manual operational tasks. Your goal is to build self-healing systems that do not require manual intervention during the night. You will improve our CI pipelines to ensure that changes to our clusters are safe, predictable, and automated
  • Optimize for scale and performance: You will dive deep into performance bottlenecks within our distributed tracing and logging pipelines. We deal with high-volume data streams that can overwhelm standard configurations. You will tune our Elasticsearch clusters, optimize Prometheus and VictoriaMetrics storage, and ensure our OpenTelemetry implementation can handle peak traffic without missing a beat
  • Reliability and Engineering: You will participate in on-call rotations, but your primary focus will be engineering solutions that stop alerts from firing in the first place. You will help us upgrade our stack to the latest versions and ensure our platform remains secure and performant. You will improve the self-service experience by implementing automated guardrails and quota management to prevent noisy tenants from destabilizing the platform, while designing safer API access patterns for our users

Requirements

  • 4+ years of experience in the observability domain or in a relevant platform/infrastructure domain
  • Observability Stack Expertise: hands-on experience operating core telemetry data stores at scale e.g. Elasticsearch/Opensearch/VictoriaLogs/Clickhouse for logging, Prometheus/ VictoriaMetrics for metrics and Grafana Tempo for distributed tracing
  • Linux Experience: understand the operating system at a kernel level and can debug complex networking, file system, and performance issues on both bare metal and virtualized hardware
  • Production Kubernetes Experience: Proven hands-on experience operating, and troubleshooting production workloads on Kubernetes (on-prem and/or cloud), including strong day-to-day use of kubectl and Kubernetes primitives (e.g. Namespaces, Pods, Deployments/StatefulSets, Services, Ingress, ConfigMaps/Secrets)
  • Software Engineering Mindset: proficient in Go or Python and do not just write scripts
  • you build tools and automation platforms that treat infrastructure as code

Nice to have

  • Experience with large scale, multi tenant isolation and quota or cost governance approaches for telemetry platforms
  • Familiarity with regulated environments where security, audibility, and data handling requirements shape platform design decisions

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Observability Infrastructure Engineer

8 matching positions

Senior Observability Infrastructure Engineer

We are looking for an experienced Observability Infrastructure Engineer to join ...
Location
Location
Netherlands , Amsterdam
Salary
Salary:
Not provided
adyen.com Logo
Adyen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of experience in the observability domain or in a relevant platform/infrastructure domain.
  • Observability Stack Expertise: You have hands-on experience operating core telemetry data stores at scale e.g. Elasticsearch/Opensearch/VictoriaLogs/Clickhouse for logging, Prometheus/ VictoriaMetrics for metrics and Grafana Tempo for distributed tracing.
  • Linux Experience: You understand the operating system at a kernel level and can debug complex networking, file system, and performance issues on both bare metal and virtualized hardware .
  • Production Kubernetes Experience: Proven hands-on experience operating, and troubleshooting production workloads on Kubernetes (on-prem and/or cloud), including strong day-to-day use of kubectl and Kubernetes primitives (e.g. Namespaces, Pods, Deployments/StatefulSets, Services, Ingress, ConfigMaps/Secrets)
  • Software Engineering Mindset: You are proficient in Go or Python and do not just write scripts
  • you build tools and automation platforms that treat infrastructure as code.
Job Responsibility
Job Responsibility
  • Build the next generation of our platform: Design and implement the future architecture of our logging and metrics systems.
  • Own infrastructure operations: You will take full ownership of our hybrid infrastructure, managing the lifecycle of over 1,500 servers across both bare-metal and Kubernetes environments.
  • Automate to reduce toil: You will write code in Go or Python to eliminate manual operational tasks.
  • Optimize for scale and performance: You will dive deep into performance bottlenecks within our distributed tracing and logging pipelines.
  • Reliability and Engineering: You will participate in on-call rotations, but your primary focus will be engineering solutions that stop alerts from firing in the first place.
  • Fulltime
Read More
Arrow Right

Senior Infrastructure Engineer / Observability Specialist

Location: Remote - Anywhere in Australia (Will be required to travel to Canberra...
Location
Location
Australia , Sydney
Salary
Salary:
Not provided
finxl.com.au Logo
FinXL
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Must be Australian Citizen and be able to obtain Baseline Security Clearance
  • Cloud Expertise: Proficiency in AWS, Azure, or Google Cloud platforms
  • Observability Concepts: Deep understanding of metrics, logs, and traces, including the design of alerting systems
  • Automation: Experience in scripting with Python, Bash, or PowerShell
  • Containerisation: Knowledge of Kubernetes and Docker
  • Soft Skills: Strong negotiation and communication skills to assist with project planning and problem resolution
Job Responsibility
Job Responsibility
  • Configure and support observability tools including Dynatrace, Amazon CloudWatch, Amazon CloudTrail, AWS Config, and Azure Monitor
  • Take ownership of observability monitoring policies, standards, and documentation
  • Perform fault diagnosis and root cause analysis with timely remedial action
  • Drive change and uplift IT teams through education and "evangelising" monitoring concepts
  • Provide support for AWS S3, cloud backups, and AWS RDS databases as needed
  • Lead incident response through to conclusion and manage assigned service queues
Read More
Arrow Right

Senior Software Engineer - Cloud Infrastructure & Observability

Location
Location
India , Bengaluru
Salary
Salary:
Not provided
roku.com Logo
Roku
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years in software engineering with a track record of architecting distributed systems or platforms at scale
  • Strong hands‑on experience in Golang and one scripting language (e.g., Python or Shell)
  • Experience operating observability at pb-scale ingestion and hundreds of millions of series
  • Expertise in observability platforms and tooling (Prometheus, Grafana, Loki, Tempo, ELK/OpenSearch, ClickHouse) and standards (OpenTelemetry, OpenMetrics)
  • Deep experience building systems of scale and operating cloud infrastructure with Kubernetes
  • strong proficiency with service mesh technologies (Istio/Envoy), infrastructure‑as‑code (Terraform) and experience in multi‑cloud (AWS, GCP)
  • Demonstrated ability to evolve storage and query architectures for cost, scale, and latency (e.g., TSDB, Parquet, distributed processing)
  • Proven experience integrating security as part of infrastructure and platform development
  • Exceptional cross‑functional communication
  • effective collaboration with both technical and non‑technical stakeholders
Job Responsibility
Job Responsibility
  • Architect and lead Roku’s observability platform across metrics, logs, and traces
  • evolve data pipelines and storage layers optimized for high throughput, performance, and cost at Roku scale (TSDBs, Parquet, distributed processing)
  • Extend and harden open‑source observability systems
  • overhaul core components (e.g., storage layers, query paths) to improve performance, reliability, and usability at scale
  • Implement features such as pre‑aggregation, down-sampling, and sampling to reduce load and accelerate queries across the platform
  • Collaborate across platform, SRE, and product teams to migrate hundreds of workloads to our common platform
  • augment and automate CI/CD flows and onboarding
  • Integrate security into infrastructure and platform services
  • ensure robust multi‑tenant, multi‑cluster, and multi‑cloud designs
  • Contribute improvements back to open source and CNCF‑aligned projects
What we offer
What we offer
  • Global access to mental health and financial wellness support and resources
  • healthcare (medical, dental, and vision)
  • life, accident, disability, commuter, and retirement options (401(k)/pension)
  • time off in accordance with local leave policies
  • Fulltime
Read More
Arrow Right

Staff Observability Data Infrastructure Engineer

CVS Health is seeking a highly skilled Observability Data Infrastructure Enginee...
Location
Location
United States , Work at Home, Maryland
Salary
Salary:
130295.00 - 260590.00 USD / Year
https://www.cvshealth.com/ Logo
CVS Health
Expiration Date
June 30, 2026
Flip Icon
Requirements
Requirements
  • 7+ years of experience building and operating log, metric, and trace pipelines in Data, Security Data, or Observability Engineering roles
  • 5+ years of hands-on experience with Databricks, Apache Spark, or other large-scale distributed data platforms
  • 5+ years of experience working across cloud platforms (AWS, Azure, or GCP), including storage, compute, and event-driven services
  • 5+ years of production experience using SQL and Python in data-intensive environments
  • 3+ years of experience with enterprise observability platforms (Splunk, Datadog, Elastic, or equivalent)
  • 3+ years of experience with high-throughput ingestion and streaming technologies such as Cribl, Vector, or Kafka
  • 3+ years of experience designing telemetry systems aligned to OpenTelemetry (OTEL) or similar standards
  • Bachelor's degree from accredited university or equivalent work experience (HS diploma + 4 years relevant experience)
Job Responsibility
Job Responsibility
  • Design, build, and operate high-volume log, metric, and trace pipelines using Databricks, cloud data lakes, and distributed processing engines
  • Architect and evolve an Observability Lakehouse aligned with OpenTelemetry (OTEL) data models and standards
  • Implement ingestion and transformation workflows using technologies such as Cribl, Vector, Jenkins, GitHub Actions, or equivalent tools
  • Normalize, model, and enrich telemetry data to support detection engineering, forensics, and operational analytics
  • Develop scalable ETL/ELT frameworks, Delta Lake architectures, and automated data quality validation for unstructured and semi-structured data
  • Partner with Security Engineering, SRE, Cloud, and SOC teams to improve enterprise visibility and detection accuracy
  • Build and maintain CI/CD pipelines and reusable Infrastructure-as-Code (IaC) patterns for observability platform deployment
  • Identify and resolve performance, latency, cost, and reliability issues across telemetry pipelines
  • Contribute to engineering standards, documentation, and knowledge sharing across observability and security platforms
What we offer
What we offer
  • Medical, dental, and vision coverage
  • Paid time off
  • Retirement savings options
  • Wellness programs
  • Bonus, commission or short-term incentive program
  • Equity award program
  • Fulltime
!
Read More
Arrow Right

Senior Software Engineer - Cloud Infrastructure & Observability

We are building a next-generation observability and cloud platform that is high-...
Location
Location
United Kingdom , Cambridge
Salary
Salary:
Not provided
roku.com Logo
Roku
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience with software engineering with a track record of architecting distributed systems or platforms at scale
  • Strong hands-on experience in Golang and one scripting language (e.g., Python or Shell)
  • Experience operating observability at pb-scale ingestion and hundreds of millions of series
  • Expertise in observability platforms and tooling (Prometheus, Grafana, Loki, Tempo, ELK/OpenSearch, ClickHouse) and standards (OpenTelemetry, OpenMetrics)
  • Deep experience building systems of scale and operating cloud infrastructure with Kubernetes
  • strong proficiency with service mesh technologies (Istio/Envoy), infrastructure-as-code (Terraform) and experience in multi-cloud (AWS, GCP)
  • Demonstrated ability to evolve storage and query architectures for cost, scale, and latency (e.g., TSDB, Parquet, distributed processing)
  • Proven experience integrating security as part of infrastructure and platform development
  • Exceptional cross-functional communication
  • effective collaboration with both technical and non-technical stakeholders
Job Responsibility
Job Responsibility
  • Architect and lead Roku’s observability platform across metrics, logs, and traces
  • evolve data pipelines and storage layers optimized for high throughput, performance, and cost at Roku scale (TSDBs, Parquet, distributed processing)
  • Extend and harden open-source observability systems
  • overhaul core components (e.g., storage layers, query paths) to improve performance, reliability, and usability at scale
  • Implement features such as pre-aggregation, down-sampling, and sampling to reduce load and accelerate queries across the platform
  • Collaborate across platform, SRE, and product teams to migrate hundreds of workloads to our common platform
  • augment and automate CI/CD flows and onboarding
  • Integrate security into infrastructure and platform services
  • ensure robust multi-tenant, multi-cluster, and multi-cloud designs
  • Contribute improvements back to open source and CNCF-aligned projects
What we offer
What we offer
  • Global access to mental health and financial wellness support and resources
  • healthcare (medical, dental, and vision)
  • life, accident, disability, commuter, and retirement options (401(k)/pension)
  • time off work for vacation and other personal reasons
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer / Observability Engineer

Rackspace is building up its Professional Services Center of Excellence on Appli...
Location
Location
Egypt , Giza
Salary
Salary:
Not provided
rackspace.com Logo
Rackspace
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering/computer science or equivalent
  • Senior-level experience with Site Reliability Engineering, DevOps, Code level application support and troubleshooting, AWS Infrastructure design, implementation and optimization, Automation for deployment, scaling and reliability
  • Experience with observability solutions tools like Splunk, Datadog, SignalFx, etc.
  • Experience deploying, maintaining and supporting software applications/services in the AWS ecosystem
  • Proactive approach to identifying problems and solutions
  • Experience writing code with one or more interpreted languages such as Python, PHP, Perl, Ruby, Linux Shell
  • Experience with Terraform or Cloud Formation scripting
  • Experience with configuration management tools like Ansible, Chef or Puppet
  • Experience with standard software development best practices and tools such as code repositories (Git preferred)
  • Experience executing in an agile software development environment
Job Responsibility
Job Responsibility
  • Work with customers and implement Observability solutions
  • Build and maintain scalable systems and robust automation that supports engineering goals
  • Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
  • Proactively gather and analyze both metric and log data from systems and applications to perform anomaly detection, performance tuning, capacity planning and fault isolation
  • Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability, security and performance standards
  • Collaborate with team members to document and share solutions
  • Maintain a deep understanding of the customer’s business as well as their technical environment
  • Identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues
  • Fulltime
Read More
Arrow Right

Ai Infrastructure Engineer, Core Infrastructure

As a Software Engineer on the ML Infrastructure team, you will design and build ...
Location
Location
United States , San Francisco; Seattle; New York
Salary
Salary:
179400.00 - 310500.00 USD / Year
scale.com Logo
Scale
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience building large-scale backend or distributed systems
  • Strong programming skills in Python, Go, or Rust, and familiarity with modern cloud-native architecture
  • Experience with containers and orchestration tools (Kubernetes, Docker) and Infrastructure as Code (Terraform)
  • Familiarity with schedulers or workload management systems (e.g., Kubernetes controllers, Slurm, Ray, internal job queues)
  • Understanding of observability and reliability practices (metrics, tracing, alerting, SLOs)
  • A track record of improving system efficiency, reliability, or developer velocity in production environments
Job Responsibility
Job Responsibility
  • Design and maintain fault-tolerant, cost-efficient systems that manage compute allocation, scheduling, and autoscaling across clusters and clouds
  • Build common abstractions and APIs that unify job submission, telemetry, and observability across serving and training workloads
  • Develop systems for usage metering, cost attribution, and quota management, enabling transparency and control over compute budgets
  • Improve reliability and efficiency of large-scale GPU workloads through better scheduling, bin-packing, preemption, and resource sharing
  • Partner with ML engineers and API teams to identify bottlenecks and define long-term architectural standards
  • Lead projects end-to-end — from requirements gathering and design to rollout and monitoring — in a cross-functional environment
What we offer
What we offer
  • Comprehensive health, dental and vision coverage
  • retirement benefits
  • a learning and development stipend
  • generous PTO
  • equity based compensation
  • Fulltime
Read More
Arrow Right

Research Engineer / Software Engineer (platform/core infrastructure)

Build the future of offensive security with XBOW. Attackers are already using AI...
Location
Location
United States
Salary
Salary:
150000.00 - 350000.00 USD / Year
xbow.com Logo
Xbow
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience building and operating scalable, distributed systems on cloud infrastructure such as AWS or similar
  • Comfortable working with infrastructure as code (e.g., Terraform, CDK)
  • A track record of performance tuning across cloud services, databases, and compute layers
  • Eager to learn new tools, languages, and technologies as needed
  • A thoughtful communicator who values clarity and simplicity and is comfortable working in a fast-paced startup and navigating ambiguity
  • Strong problem-solving skills and the ability to work with incomplete information
  • Curious, practical, and eager to work across layers of the stack when needed
  • You think proactively about failure modes and bring experience implementing disaster recovery and business continuity plans that keep critical systems running
Job Responsibility
Job Responsibility
  • Design and implement infrastructure systems that scale reliably and securely, and can be deployed across multiple cloud environments (AWS, Azure, OCI etc.) and contexts (SaaS, on prem)
  • Tune and optimize cloud services across compute, storage, networking, and observability to drive performance, reliability and maintainability of core services
  • Develop our core services, written in TypeScript, Kotlin and Go
  • Support large-scale systems with event driven architectures
  • Own problems end-to-end—from design through deployment to production support
  • Navigate ambiguity and help define how we build as much as what we build
  • Partner closely with other engineers, AI researchers and Security researchers to enable high-quality, high-velocity product development
  • Design for resilience by implementing disaster recovery and business continuity strategies that ensure uptime, even when things break
  • Improve how we build, deploy, and monitor services at scale
What we offer
What we offer
  • Competitive salary and a generous equity package
  • Career Growth: Shape your role, lead the function, and grow with the company
  • Meaningful Work: You will tackle technically complex challenges and play a pivotal role in the growth of our business
  • Fulltime
Read More
Arrow Right