Senior Software Engineer - Cloud Infrastructure & Observability Job at Roku (Cambridge)

Senior Software Engineer - Cloud Infrastructure & Observability

Location

India , Bengaluru

Salary:

Not provided

Roku

Expiration Date

Until further notice

Requirements

15+ years in software engineering with a track record of architecting distributed systems or platforms at scale
Strong hands‑on experience in Golang and one scripting language (e.g., Python or Shell)
Experience operating observability at pb-scale ingestion and hundreds of millions of series
Expertise in observability platforms and tooling (Prometheus, Grafana, Loki, Tempo, ELK/OpenSearch, ClickHouse) and standards (OpenTelemetry, OpenMetrics)
Deep experience building systems of scale and operating cloud infrastructure with Kubernetes
strong proficiency with service mesh technologies (Istio/Envoy), infrastructure‑as‑code (Terraform) and experience in multi‑cloud (AWS, GCP)
Demonstrated ability to evolve storage and query architectures for cost, scale, and latency (e.g., TSDB, Parquet, distributed processing)
Proven experience integrating security as part of infrastructure and platform development
Exceptional cross‑functional communication
effective collaboration with both technical and non‑technical stakeholders

Job Responsibility

Architect and lead Roku’s observability platform across metrics, logs, and traces
evolve data pipelines and storage layers optimized for high throughput, performance, and cost at Roku scale (TSDBs, Parquet, distributed processing)
Extend and harden open‑source observability systems
overhaul core components (e.g., storage layers, query paths) to improve performance, reliability, and usability at scale
Implement features such as pre‑aggregation, down-sampling, and sampling to reduce load and accelerate queries across the platform
Collaborate across platform, SRE, and product teams to migrate hundreds of workloads to our common platform
augment and automate CI/CD flows and onboarding
Integrate security into infrastructure and platform services
ensure robust multi‑tenant, multi‑cluster, and multi‑cloud designs
Contribute improvements back to open source and CNCF‑aligned projects

What we offer

Global access to mental health and financial wellness support and resources
healthcare (medical, dental, and vision)
life, accident, disability, commuter, and retirement options (401(k)/pension)
time off in accordance with local leave policies

Fulltime

Senior Software Engineer - Together Cloud Infrastructure

Together AI is building the AI Acceleration Cloud, an end-to-end platform for th...

Location

United States , San Francisco

Salary:

160000.00 - 230000.00 USD / Year

Together AI

Expiration Date

Until further notice

Requirements

5+ years of professional software development experience and proficiency in at least one backend programming language (Golang desired)
5+ years experience writing high-performance, well-tested, production quality code
Demonstrated experience with building and operating high-performance and/or globally distributed micro-service architectures across one or more cloud providers (AWS, Azure, GCP)
Excellent communication skills – able to write clear design docs and work effectively with both technical and non-technical team members
Deep experience with Kubernetes internals a big plus, such as implementing non-trivial Kubernetes operators, device/storage/network plugins, custom schedulers, or patches thereon or Kubernetes itself
Deep experience with VMs/hypervisors a big plus, such as QEMU/KVM, cloud-hypervisor, VFIO, virtio, PCIE passthrough, Kubevirt, SR-IOV
Deep experience with DC networking tech + solutions a big plus, such as VLAN, VXLAN, VPN, VPC, OVS/OVN
Experience with Cluster API or similar a big plus
Experience working on high-performance compute, networking, and/or storage a big plus
Experience virtualizing GPUs and/or Infiniband a big plus

Job Responsibility

Design, build, and maintain performant, secure, and highly-available backend services/operators that run in our data centers and automate hardware management, such as Infiniband partitioning, in-DC parallel storage provisioning, and VM provisioning
Design and build out the IaaS software layer for a new GB200 data center with thousands of GPUs
Work on a global multi-exabyte high-performance object store, serving massive datasets for pretraining
Build advanced observability stacks for our customers with automated node lifecycle management for fault-tolerant distributed pretraining
Perform architecture and research work for decentralized AI workloads
Work on the core, open-source Together AI platform
Create services, tools, and developer documentation
Create testing frameworks for robustness and fault-tolerance

What we offer

competitive compensation
startup equity
health insurance
other benefits
flexibility in terms of remote work

Fulltime

Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

We are looking for a highly skilled engineer with deep expertise in building and...

Location

United States , San Francisco

Salary:

166000.00 - 201000.00 USD / Year

Crusoe

Expiration Date

Until further notice

Requirements

7+ years of experience in infrastructure or platform engineering, with a focus on observability and monitoring systems
Deep expertise with metrics systems (Prometheus, Thanos, Mimir, Cortex), logging pipelines (Fluent Bit, Vector, Loki, ELK/Opensearch), and tracing platforms (Jaeger, Tempo, OpenTelemetry)
Strong programming skills in Go or Python for automation, operators, and custom integrations
Experience running observability platforms on Kubernetes and operating them at scale across multi-datacenter environments
Proven ability to design, optimize, and scale telemetry pipelines handling high cardinality and high throughput data
Solid understanding of distributed systems, performance engineering, and debugging complex workloads
Strong collaboration skills and the ability to influence engineering teams to adopt observability best practices

Job Responsibility

Designing and operating scalable observability systems (metrics, logging, tracing) across multi-datacenter Kubernetes environments
Architecting end-to-end telemetry pipelines, including ingestion, storage, querying, and visualization
Extending monitoring and alerting with Prometheus, Alertmanager, Thanos/Cortex, Grafana, and OpenTelemetry
Building scalable log collection and processing pipelines with Fluent Bit, Vector, Loki, or ELK/Opensearch stacks
Implementing distributed tracing platforms (Tempo, Jaeger, OpenTelemetry) and integrating with service meshes, load balancers, and APIs
Defining and driving adoption of SLOs, SLIs, and error budgets across services and teams
Automating provisioning and scaling of observability infrastructure with Kubernetes, Terraform, and custom tooling (Go, Python)
Ensuring reliability and cost efficiency of telemetry pipelines while supporting high-volume workloads (AI/ML, HPC clusters, GPU infrastructure)
Embedding security best practices into observability platforms, including RBAC, TLS, secret management, and multi-tenant access controls
Partnering with engineering teams to embed observability into applications, services, and infrastructure

What we offer

Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability
Teladoc
401(k) with a 100% match up to 4% of salary
Generous paid time off and holiday schedule
Cell phone reimbursement
Tuition reimbursement

Fulltime

Senior Observability Infrastructure Engineer

We are looking for an experienced Observability Infrastructure Engineer to join ...

Location

Netherlands , Amsterdam

Salary:

Not provided

Adyen

Expiration Date

Until further notice

Requirements

10+ years of experience in the observability domain or in a relevant platform/infrastructure domain.
Observability Stack Expertise: You have hands-on experience operating core telemetry data stores at scale e.g. Elasticsearch/Opensearch/VictoriaLogs/Clickhouse for logging, Prometheus/ VictoriaMetrics for metrics and Grafana Tempo for distributed tracing.
Linux Experience: You understand the operating system at a kernel level and can debug complex networking, file system, and performance issues on both bare metal and virtualized hardware .
Production Kubernetes Experience: Proven hands-on experience operating, and troubleshooting production workloads on Kubernetes (on-prem and/or cloud), including strong day-to-day use of kubectl and Kubernetes primitives (e.g. Namespaces, Pods, Deployments/StatefulSets, Services, Ingress, ConfigMaps/Secrets)
Software Engineering Mindset: You are proficient in Go or Python and do not just write scripts
you build tools and automation platforms that treat infrastructure as code.

Job Responsibility

Build the next generation of our platform: Design and implement the future architecture of our logging and metrics systems.
Own infrastructure operations: You will take full ownership of our hybrid infrastructure, managing the lifecycle of over 1,500 servers across both bare-metal and Kubernetes environments.
Automate to reduce toil: You will write code in Go or Python to eliminate manual operational tasks.
Optimize for scale and performance: You will dive deep into performance bottlenecks within our distributed tracing and logging pipelines.
Reliability and Engineering: You will participate in on-call rotations, but your primary focus will be engineering solutions that stop alerts from firing in the first place.

Fulltime

Senior Software Engineer (Cloud & DevOps)

At 3Shape, we use cloud platforms to deliver secure, reliable services to both i...

Location

Denmark , Copenhagen

Salary:

Not provided

3Shape

Expiration Date

Until further notice

Requirements

5 years of experience, of which Minimum 3 years of professional C#/.NET backend development experience, ideally in a cloud environment.
Minimum 3 years of hands-on DevOps/SRE experience, ideally in a role combining software development and operations.
Strong backend engineering fundamentals (design, performance, security, and maintainability).
Experience with API design, automated testing, code reviews, and building maintainable systems.
Experience with containerized workloads and Kubernetes (e.g., Azure Kubernetes Service).
Curiosity for modern engineering practices and a strong understanding of core Azure concepts (networking, compute, storage, identity, and databases).
Experience with monitoring/observability in Azure (e.g., Azure Monitor, Application Insights, Log Analytics) and incident handling is a plus.
A strong ownership mindset: automation-first, focus on reliability, and continuous improvement of quality and stability.

Job Responsibility

Design, implement, and maintain backend services in the Account domain, delivering features end-to-end from implementation and testing to deployment readiness.
Be the team's primary point of contact for DevOps topics and drive improvements across CI/CD, AKS/Kubernetes, Infrastructure as Code, observability, and platform stability.
Collaborate with platform and product teams across 3Shape to align on Azure standards and best practices especially around Infrastructure as Code, observability, and operational readiness.
Help define actionable alerts and dashboards, improve runbooks, and build safe automation so incidents can be detected, triaged, and mitigated quickly even outside normal working hours.

What we offer

Central Copenhagen location
An attractive healthcare package to keep you fit and well.
Breakfast every day, and a delicious and healthy lunch cooked by our private chefs.
A joint purpose: to enable dentists to provide superior dental care to every patient, every time.

Fulltime

New

Senior Software Engineer, Observability

You will work on core observability systems (metrics, logs, traces) while also d...

Location

India , Bengaluru

Salary:

Not provided

Roku

Expiration Date

Until further notice

Requirements

8+ years in software engineering, building distributed, high-throughput systems or observability platforms
4+ years of Go/Golang experience
our observability ecosystem is built on Go, making it the most effective language for this role
Experience with, or strong interest in, observability tools (Prometheus, Grafana, Loki, Tempo, ELK/OpenSearch, Clickhouse) and standards (OpenTelemetry, OpenTracing, OpenMetrics)
Deep understanding of distributed systems and data models
Hands-on experience with Kubernetes and cloud platforms (AWS, GCP, Azure)

Job Responsibility

Extend and integrate open-source observability systems, and when necessary, structurally overhaul core components, such as storage layers and query paths, to enhance the performance, reliability, and usability of these tools at scale
Build services to improve performance, usability, reliability, and cost efficiency
Implement features like pre-aggregation, downsampling, and sampling to reduce load and accelerate queries
Create developer-facing capabilities for metrics, logs, and traces usage, data quality, and cost management
Automate onboarding, dashboards, alerting, and tracing
Collaborate across platform and infrastructure teams to integrate observability into Roku’s cloud-native stack

What we offer

global access to mental health and financial wellness support and resources
healthcare (medical, dental, and vision)
life, accident, disability, commuter, and retirement options (401(k)/pension)

Fulltime

Senior Software Engineer - Infrastructure Reliability

We are seeking a Senior Software Engineer to join our Security Product team, foc...

Location

India , Bangalore

Salary:

Not provided

JFrog

Expiration Date

Until further notice

Requirements

7+ years of experience in software engineering, with at least 3+ years focused on debugging and solving infrastructure-level problems in distributed systems
Strong proficiency in Go
familiarity with Python and Helm is a plus
Deep hands-on experience with RabbitMQ or similar message brokers (Kafka, ActiveMQ) - including queue management, clustering, monitoring, and production troubleshooting
Solid working knowledge of Kubernetes (pod lifecycle, resource management, networking, debugging CrashLoopBackOff / OOMKilled scenarios) and Docker
Experience investigating production incidents and conducting post-incident reviews with clear root cause analysis and follow-through
Strong understanding of Linux systems, networking fundamentals, and cloud infrastructure (AWS, Azure, or GCP)
Ability to read and interpret logs, thread dumps, heap dumps, and system metrics to isolate root causes under time pressure
Excellent analytical and problem-solving skills with a methodical approach to debugging
Strong written and verbal communication skills - ability to produce clear incident reports, root cause analyses, and playbooks, and to communicate effectively across engineering, SRE, and customer-facing teams

Job Responsibility

Investigate system outages and production failures across customer environments (SaaS and self-hosted), spanning RabbitMQ, Kubernetes, Docker, Postgres, and cloud infrastructure (AWS, Azure, GCP)
Identify recurring failure patterns and systemic weaknesses from incident data, and drive them to resolution - whether by writing Go code yourself (resilience features, infrastructure fixes, observability) or by collaborating with service owners to prioritize and address reliability gaps
Lead and participate in post-incident reviews - document root causes, corrective actions, and follow through to ensure issues are properly resolved
Collaborate with production engineering and SRE teams to develop and maintain operational playbooks and runbooks that reduce time-to-resolution
Diagnose root causes across the full stack - message queue failures, container lifecycle issues, cloud networking, disk and memory pressure, and deployment topology mismatches
Design and implement data migrations and lifecycle management for infrastructure components such as queue management and vhost operations
Emit and monitor operational metrics to proactively detect infrastructure degradation and measure service reliability

Senior Software Engineer – Infrastructure as Code (IaC)

Coralogix is a modern, full-stack observability platform transforming how busine...

Location

Salary:

Not provided

Coralogix

Expiration Date

Until further notice

Requirements

5 years (or more) of passion for Go and writing maintainable code
Committed to long-term API design, stability, and backward compatibility
Systemic thinker with experience in Terraform, Kubernetes, and cloud-native ecosystems
Strong sense of ownership and ability to work independently

Job Responsibility

Design, implement, and evolve a high quality public API directly used by customers
Own core infrastructure tooling like our Terraform provider and Kubernetes operator
Investigate and integrate AI/ML-driven approaches to enhance our core infrastructure tooling, platform stability, and operational efficiency
Build SDKs that promote stability and an excellent developer experience
Guide internal teams to create and maintain a stable API
Design, and implement critical backend services for the Coralogix platform
Lead technical design discussions and internal RFCs
Align with teams across the company: PMs, backend teams, and platform maintainers
Contribute to emerging platform surfaces (e.g., MCP server)

Fulltime

Select Country

Senior Software Engineer - Cloud Infrastructure & Observability

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Senior Software Engineer - Cloud Infrastructure & Observability

Senior Software Engineer - Cloud Infrastructure & Observability

Senior Software Engineer - Together Cloud Infrastructure

Senior+ Software Engineer - Cloud Availability Platform Engineering (Observability)

Senior Observability Infrastructure Engineer

Senior Software Engineer (Cloud & DevOps)

Senior Software Engineer, Observability

Senior Software Engineer - Infrastructure Reliability

Senior Software Engineer – Infrastructure as Code (IaC)

Our AI answers in your language