CrawlJobs Logo

Senior Software Engineer - Cloud Infrastructure & Observability

United Kingdom, Cambridge · Job Posted April 23, 2026
Apply Position
Job Link Share

Job Description

We are building a next-generation observability and cloud platform that is high-performance, cost-efficient, secure, and scalable across multi-region, multi-cloud clusters. You will lead the architecture and evolution of Roku’s observability and cloud infrastructure stack. This includes metrics, logs, traces, telemetry pipelines, service mesh, developer experience, and reliability of systems that power thousands of services and millions of devices. You will drive a vision where developers gain deep visibility with minimal overhead, onboarding is seamless, and insights are available in real time. Your work will directly help Roku scale efficiently while maintaining reliability, cost control, and performance.

Job Responsibility

  • Architect and lead Roku’s observability platform across metrics, logs, and traces
  • evolve data pipelines and storage layers optimized for high throughput, performance, and cost at Roku scale (TSDBs, Parquet, distributed processing)
  • Extend and harden open-source observability systems
  • overhaul core components (e.g., storage layers, query paths) to improve performance, reliability, and usability at scale
  • Implement features such as pre-aggregation, down-sampling, and sampling to reduce load and accelerate queries across the platform
  • Collaborate across platform, SRE, and product teams to migrate hundreds of workloads to our common platform
  • augment and automate CI/CD flows and onboarding
  • Integrate security into infrastructure and platform services
  • ensure robust multi-tenant, multi-cluster, and multi-cloud designs
  • Contribute improvements back to open source and CNCF-aligned projects
  • shape standards adoption (OpenTelemetry, OpenMetrics) across the company
  • Mentor engineers
  • establish best practices for reliability, efficiency, and cost management across service mesh and observability domains

Requirements

  • Extensive experience with software engineering with a track record of architecting distributed systems or platforms at scale
  • Strong hands-on experience in Golang and one scripting language (e.g., Python or Shell)
  • Experience operating observability at pb-scale ingestion and hundreds of millions of series
  • Expertise in observability platforms and tooling (Prometheus, Grafana, Loki, Tempo, ELK/OpenSearch, ClickHouse) and standards (OpenTelemetry, OpenMetrics)
  • Deep experience building systems of scale and operating cloud infrastructure with Kubernetes
  • strong proficiency with service mesh technologies (Istio/Envoy), infrastructure-as-code (Terraform) and experience in multi-cloud (AWS, GCP)
  • Demonstrated ability to evolve storage and query architectures for cost, scale, and latency (e.g., TSDB, Parquet, distributed processing)
  • Proven experience integrating security as part of infrastructure and platform development
  • Exceptional cross-functional communication
  • effective collaboration with both technical and non-technical stakeholders
  • Culture fit: independent thinker, pragmatic problem-solver, low-ego collaborator who moves fast and focuses on company success
  • Experience integrating AI tools to improve processes and reduce toil
  • Open-source contributions in CNCF projects is preferred but not mandatory

Nice to have

Open-source contributions in CNCF projects

What we offer

  • Global access to mental health and financial wellness support and resources
  • healthcare (medical, dental, and vision)
  • life, accident, disability, commuter, and retirement options (401(k)/pension)
  • time off work for vacation and other personal reasons

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Software Engineer - Cloud Infrastructure & Observability

8 matching positions

Senior Software Engineer - Cloud Infrastructure & Observability

Location
Location
India , Bengaluru
Salary
Salary:
Not provided
roku.com Logo
Roku
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years in software engineering with a track record of architecting distributed systems or platforms at scale
  • Strong hands‑on experience in Golang and one scripting language (e.g., Python or Shell)
  • Experience operating observability at pb-scale ingestion and hundreds of millions of series
  • Expertise in observability platforms and tooling (Prometheus, Grafana, Loki, Tempo, ELK/OpenSearch, ClickHouse) and standards (OpenTelemetry, OpenMetrics)
  • Deep experience building systems of scale and operating cloud infrastructure with Kubernetes
  • strong proficiency with service mesh technologies (Istio/Envoy), infrastructure‑as‑code (Terraform) and experience in multi‑cloud (AWS, GCP)
  • Demonstrated ability to evolve storage and query architectures for cost, scale, and latency (e.g., TSDB, Parquet, distributed processing)
  • Proven experience integrating security as part of infrastructure and platform development
  • Exceptional cross‑functional communication
  • effective collaboration with both technical and non‑technical stakeholders
Job Responsibility
Job Responsibility
  • Architect and lead Roku’s observability platform across metrics, logs, and traces
  • evolve data pipelines and storage layers optimized for high throughput, performance, and cost at Roku scale (TSDBs, Parquet, distributed processing)
  • Extend and harden open‑source observability systems
  • overhaul core components (e.g., storage layers, query paths) to improve performance, reliability, and usability at scale
  • Implement features such as pre‑aggregation, down-sampling, and sampling to reduce load and accelerate queries across the platform
  • Collaborate across platform, SRE, and product teams to migrate hundreds of workloads to our common platform
  • augment and automate CI/CD flows and onboarding
  • Integrate security into infrastructure and platform services
  • ensure robust multi‑tenant, multi‑cluster, and multi‑cloud designs
  • Contribute improvements back to open source and CNCF‑aligned projects
What we offer
What we offer
  • Global access to mental health and financial wellness support and resources
  • healthcare (medical, dental, and vision)
  • life, accident, disability, commuter, and retirement options (401(k)/pension)
  • time off in accordance with local leave policies
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Together Cloud Infrastructure

Together AI is building the AI Acceleration Cloud, an end-to-end platform for th...
Location
Location
United States , San Francisco
Salary
Salary:
160000.00 - 230000.00 USD / Year
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of professional software development experience and proficiency in at least one backend programming language (Golang desired)
  • 5+ years experience writing high-performance, well-tested, production quality code
  • Demonstrated experience with building and operating high-performance and/or globally distributed micro-service architectures across one or more cloud providers (AWS, Azure, GCP)
  • Excellent communication skills – able to write clear design docs and work effectively with both technical and non-technical team members
  • Deep experience with Kubernetes internals a big plus, such as implementing non-trivial Kubernetes operators, device/storage/network plugins, custom schedulers, or patches thereon or Kubernetes itself
  • Deep experience with VMs/hypervisors a big plus, such as QEMU/KVM, cloud-hypervisor, VFIO, virtio, PCIE passthrough, Kubevirt, SR-IOV
  • Deep experience with DC networking tech + solutions a big plus, such as VLAN, VXLAN, VPN, VPC, OVS/OVN
  • Experience with Cluster API or similar a big plus
  • Experience working on high-performance compute, networking, and/or storage a big plus
  • Experience virtualizing GPUs and/or Infiniband a big plus
Job Responsibility
Job Responsibility
  • Design, build, and maintain performant, secure, and highly-available backend services/operators that run in our data centers and automate hardware management, such as Infiniband partitioning, in-DC parallel storage provisioning, and VM provisioning
  • Design and build out the IaaS software layer for a new GB200 data center with thousands of GPUs
  • Work on a global multi-exabyte high-performance object store, serving massive datasets for pretraining
  • Build advanced observability stacks for our customers with automated node lifecycle management for fault-tolerant distributed pretraining
  • Perform architecture and research work for decentralized AI workloads
  • Work on the core, open-source Together AI platform
  • Create services, tools, and developer documentation
  • Create testing frameworks for robustness and fault-tolerance
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • other benefits
  • flexibility in terms of remote work
  • Fulltime
Read More
Arrow Right

Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

We are looking for a highly skilled engineer with deep expertise in building and...
Location
Location
United States , San Francisco
Salary
Salary:
166000.00 - 201000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in infrastructure or platform engineering, with a focus on observability and monitoring systems
  • Deep expertise with metrics systems (Prometheus, Thanos, Mimir, Cortex), logging pipelines (Fluent Bit, Vector, Loki, ELK/Opensearch), and tracing platforms (Jaeger, Tempo, OpenTelemetry)
  • Strong programming skills in Go or Python for automation, operators, and custom integrations
  • Experience running observability platforms on Kubernetes and operating them at scale across multi-datacenter environments
  • Proven ability to design, optimize, and scale telemetry pipelines handling high cardinality and high throughput data
  • Solid understanding of distributed systems, performance engineering, and debugging complex workloads
  • Strong collaboration skills and the ability to influence engineering teams to adopt observability best practices
Job Responsibility
Job Responsibility
  • Designing and operating scalable observability systems (metrics, logging, tracing) across multi-datacenter Kubernetes environments
  • Architecting end-to-end telemetry pipelines, including ingestion, storage, querying, and visualization
  • Extending monitoring and alerting with Prometheus, Alertmanager, Thanos/Cortex, Grafana, and OpenTelemetry
  • Building scalable log collection and processing pipelines with Fluent Bit, Vector, Loki, or ELK/Opensearch stacks
  • Implementing distributed tracing platforms (Tempo, Jaeger, OpenTelemetry) and integrating with service meshes, load balancers, and APIs
  • Defining and driving adoption of SLOs, SLIs, and error budgets across services and teams
  • Automating provisioning and scaling of observability infrastructure with Kubernetes, Terraform, and custom tooling (Go, Python)
  • Ensuring reliability and cost efficiency of telemetry pipelines while supporting high-volume workloads (AI/ML, HPC clusters, GPU infrastructure)
  • Embedding security best practices into observability platforms, including RBAC, TLS, secret management, and multi-tenant access controls
  • Partnering with engineering teams to embed observability into applications, services, and infrastructure
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Senior Observability Infrastructure Engineer

We are looking for an experienced Observability Infrastructure Engineer to join ...
Location
Location
Netherlands , Amsterdam
Salary
Salary:
Not provided
adyen.com Logo
Adyen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of experience in the observability domain or in a relevant platform/infrastructure domain.
  • Observability Stack Expertise: You have hands-on experience operating core telemetry data stores at scale e.g. Elasticsearch/Opensearch/VictoriaLogs/Clickhouse for logging, Prometheus/ VictoriaMetrics for metrics and Grafana Tempo for distributed tracing.
  • Linux Experience: You understand the operating system at a kernel level and can debug complex networking, file system, and performance issues on both bare metal and virtualized hardware .
  • Production Kubernetes Experience: Proven hands-on experience operating, and troubleshooting production workloads on Kubernetes (on-prem and/or cloud), including strong day-to-day use of kubectl and Kubernetes primitives (e.g. Namespaces, Pods, Deployments/StatefulSets, Services, Ingress, ConfigMaps/Secrets)
  • Software Engineering Mindset: You are proficient in Go or Python and do not just write scripts
  • you build tools and automation platforms that treat infrastructure as code.
Job Responsibility
Job Responsibility
  • Build the next generation of our platform: Design and implement the future architecture of our logging and metrics systems.
  • Own infrastructure operations: You will take full ownership of our hybrid infrastructure, managing the lifecycle of over 1,500 servers across both bare-metal and Kubernetes environments.
  • Automate to reduce toil: You will write code in Go or Python to eliminate manual operational tasks.
  • Optimize for scale and performance: You will dive deep into performance bottlenecks within our distributed tracing and logging pipelines.
  • Reliability and Engineering: You will participate in on-call rotations, but your primary focus will be engineering solutions that stop alerts from firing in the first place.
  • Fulltime
Read More
Arrow Right

Senior Software Engineer (Cloud & DevOps)

At 3Shape, we use cloud platforms to deliver secure, reliable services to both i...
Location
Location
Denmark , Copenhagen
Salary
Salary:
Not provided
3shape.com Logo
3Shape
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5 years of experience, of which Minimum 3 years of professional C#/.NET backend development experience, ideally in a cloud environment.
  • Minimum 3 years of hands-on DevOps/SRE experience, ideally in a role combining software development and operations.
  • Strong backend engineering fundamentals (design, performance, security, and maintainability).
  • Experience with API design, automated testing, code reviews, and building maintainable systems.
  • Experience with containerized workloads and Kubernetes (e.g., Azure Kubernetes Service).
  • Curiosity for modern engineering practices and a strong understanding of core Azure concepts (networking, compute, storage, identity, and databases).
  • Experience with monitoring/observability in Azure (e.g., Azure Monitor, Application Insights, Log Analytics) and incident handling is a plus.
  • A strong ownership mindset: automation-first, focus on reliability, and continuous improvement of quality and stability.
Job Responsibility
Job Responsibility
  • Design, implement, and maintain backend services in the Account domain, delivering features end-to-end from implementation and testing to deployment readiness.
  • Be the team's primary point of contact for DevOps topics and drive improvements across CI/CD, AKS/Kubernetes, Infrastructure as Code, observability, and platform stability.
  • Collaborate with platform and product teams across 3Shape to align on Azure standards and best practices especially around Infrastructure as Code, observability, and operational readiness.
  • Help define actionable alerts and dashboards, improve runbooks, and build safe automation so incidents can be detected, triaged, and mitigated quickly even outside normal working hours.
What we offer
What we offer
  • Central Copenhagen location
  • An attractive healthcare package to keep you fit and well.
  • Breakfast every day, and a delicious and healthy lunch cooked by our private chefs.
  • A joint purpose: to enable dentists to provide superior dental care to every patient, every time.
  • Fulltime
Read More
Arrow Right
New

Senior Software Engineer, Observability

You will work on core observability systems (metrics, logs, traces) while also d...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
roku.com Logo
Roku
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in software engineering, building distributed, high-throughput systems or observability platforms
  • 4+ years of Go/Golang experience
  • our observability ecosystem is built on Go, making it the most effective language for this role
  • Experience with, or strong interest in, observability tools (Prometheus, Grafana, Loki, Tempo, ELK/OpenSearch, Clickhouse) and standards (OpenTelemetry, OpenTracing, OpenMetrics)
  • Deep understanding of distributed systems and data models
  • Hands-on experience with Kubernetes and cloud platforms (AWS, GCP, Azure)
Job Responsibility
Job Responsibility
  • Extend and integrate open-source observability systems, and when necessary, structurally overhaul core components, such as storage layers and query paths, to enhance the performance, reliability, and usability of these tools at scale
  • Build services to improve performance, usability, reliability, and cost efficiency
  • Implement features like pre-aggregation, downsampling, and sampling to reduce load and accelerate queries
  • Create developer-facing capabilities for metrics, logs, and traces usage, data quality, and cost management
  • Automate onboarding, dashboards, alerting, and tracing
  • Collaborate across platform and infrastructure teams to integrate observability into Roku’s cloud-native stack
What we offer
What we offer
  • global access to mental health and financial wellness support and resources
  • healthcare (medical, dental, and vision)
  • life, accident, disability, commuter, and retirement options (401(k)/pension)
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Infrastructure Reliability

We are seeking a Senior Software Engineer to join our Security Product team, foc...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
jfrog.com Logo
JFrog
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in software engineering, with at least 3+ years focused on debugging and solving infrastructure-level problems in distributed systems
  • Strong proficiency in Go
  • familiarity with Python and Helm is a plus
  • Deep hands-on experience with RabbitMQ or similar message brokers (Kafka, ActiveMQ) - including queue management, clustering, monitoring, and production troubleshooting
  • Solid working knowledge of Kubernetes (pod lifecycle, resource management, networking, debugging CrashLoopBackOff / OOMKilled scenarios) and Docker
  • Experience investigating production incidents and conducting post-incident reviews with clear root cause analysis and follow-through
  • Strong understanding of Linux systems, networking fundamentals, and cloud infrastructure (AWS, Azure, or GCP)
  • Ability to read and interpret logs, thread dumps, heap dumps, and system metrics to isolate root causes under time pressure
  • Excellent analytical and problem-solving skills with a methodical approach to debugging
  • Strong written and verbal communication skills - ability to produce clear incident reports, root cause analyses, and playbooks, and to communicate effectively across engineering, SRE, and customer-facing teams
Job Responsibility
Job Responsibility
  • Investigate system outages and production failures across customer environments (SaaS and self-hosted), spanning RabbitMQ, Kubernetes, Docker, Postgres, and cloud infrastructure (AWS, Azure, GCP)
  • Identify recurring failure patterns and systemic weaknesses from incident data, and drive them to resolution - whether by writing Go code yourself (resilience features, infrastructure fixes, observability) or by collaborating with service owners to prioritize and address reliability gaps
  • Lead and participate in post-incident reviews - document root causes, corrective actions, and follow through to ensure issues are properly resolved
  • Collaborate with production engineering and SRE teams to develop and maintain operational playbooks and runbooks that reduce time-to-resolution
  • Diagnose root causes across the full stack - message queue failures, container lifecycle issues, cloud networking, disk and memory pressure, and deployment topology mismatches
  • Design and implement data migrations and lifecycle management for infrastructure components such as queue management and vhost operations
  • Emit and monitor operational metrics to proactively detect infrastructure degradation and measure service reliability
Read More
Arrow Right

Senior Software Engineer – Infrastructure as Code (IaC)

Coralogix is a modern, full-stack observability platform transforming how busine...
Location
Location
Salary
Salary:
Not provided
coralogix.com Logo
Coralogix
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5 years (or more) of passion for Go and writing maintainable code
  • Committed to long-term API design, stability, and backward compatibility
  • Systemic thinker with experience in Terraform, Kubernetes, and cloud-native ecosystems
  • Strong sense of ownership and ability to work independently
Job Responsibility
Job Responsibility
  • Design, implement, and evolve a high quality public API directly used by customers
  • Own core infrastructure tooling like our Terraform provider and Kubernetes operator
  • Investigate and integrate AI/ML-driven approaches to enhance our core infrastructure tooling, platform stability, and operational efficiency
  • Build SDKs that promote stability and an excellent developer experience
  • Guide internal teams to create and maintain a stable API
  • Design, and implement critical backend services for the Coralogix platform
  • Lead technical design discussions and internal RFCs
  • Align with teams across the company: PMs, backend teams, and platform maintainers
  • Contribute to emerging platform surfaces (e.g., MCP server)
  • Fulltime
Read More
Arrow Right