This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are looking for an experienced Observability Infrastructure Engineer to join our Platform Engineering organization. You will be part of the team responsible for building and running Observability pillars on premise and on Kubernetes. Our systems collect, process, and store the logs, metrics, and traces that allow hundreds of product teams to monitor their services in real time. This is a role for a builder and a problem solver who enjoys deep technical troubleshooting across distributed systems and then turns recurring issues into automated, repeatable solutions. You will work in a large-scale environment where we manage petabytes of data and thousands of servers. We are currently in the middle of a major transformation: focusing on automation of operations and enabling self service for our users.
Job Responsibility:
Build the next generation of our platform: Design and implement the future architecture of our logging and metrics systems.
Own infrastructure operations: You will take full ownership of our hybrid infrastructure, managing the lifecycle of over 1,500 servers across both bare-metal and Kubernetes environments.
Automate to reduce toil: You will write code in Go or Python to eliminate manual operational tasks.
Optimize for scale and performance: You will dive deep into performance bottlenecks within our distributed tracing and logging pipelines.
Reliability and Engineering: You will participate in on-call rotations, but your primary focus will be engineering solutions that stop alerts from firing in the first place.
Requirements:
10+ years of experience in the observability domain or in a relevant platform/infrastructure domain.
Observability Stack Expertise: You have hands-on experience operating core telemetry data stores at scale e.g. Elasticsearch/Opensearch/VictoriaLogs/Clickhouse for logging, Prometheus/ VictoriaMetrics for metrics and Grafana Tempo for distributed tracing.
Linux Experience: You understand the operating system at a kernel level and can debug complex networking, file system, and performance issues on both bare metal and virtualized hardware .
Production Kubernetes Experience: Proven hands-on experience operating, and troubleshooting production workloads on Kubernetes (on-prem and/or cloud), including strong day-to-day use of kubectl and Kubernetes primitives (e.g. Namespaces, Pods, Deployments/StatefulSets, Services, Ingress, ConfigMaps/Secrets)
Software Engineering Mindset: You are proficient in Go or Python and do not just write scripts
you build tools and automation platforms that treat infrastructure as code.
Nice to have:
Experience with large scale, multi tenant isolation and quota or cost governance approaches for telemetry platforms.
Familiarity with regulated environments where security, audibility, and data handling requirements shape platform design decisions.