Senior Software Engineer

Job Description

Roku is changing how the world watches TV. Roku is the #1 TV streaming platform in the U.S., Canada, and Mexico, and we've set our sights on powering every television in the world. Roku pioneered streaming to the TV. Our mission is to be the TV streaming platform that connects the entire TV ecosystem. We connect consumers to the content they love, enable content publishers to build and monetize large audiences, and provide advertisers unique capabilities to engage consumers. From your first day at Roku, you'll make a valuable - and valued - contribution. We're a fast-growing public company where no one is a bystander. We offer you the opportunity to delight millions of TV streamers around the world while gaining meaningful experience across a variety of disciplines. The Platform Infrastructure team ensures that all Roku systems run smoothly. These systems support over 100M+ users and billions in transaction revenue per year. We are a group of highly skilled infrastructure and software engineers who help build and operate systems at internet scale, including Platform (Kubernetes, Istio, Envoy, operators, and more) and Observability (OSS/CNCF-supported observability projects). We engage with multiple teams to achieve company-impacting results. We are seeking a talented and experienced SRE (Site Reliability Engineering) Senior Software Engineer to join our dynamic team. The ideal candidate will have a strong background in SRE practices, cloud infrastructure management, and automation.

Job Responsibility

Design & Infrastructure
Contribute to postmortem culture by facilitating comprehensive, blameless post-incident reviews that identify root causes, contributing factors, and actionable remediation items. Track incident trends to identify systemic issues and prioritize reliability improvements
Implement chaos engineering practices to proactively identify failure modes, validate system resilience, and build confidence in recovery procedures. Conduct game days and disaster recovery exercises
SRE Process & Principles Implementation
Deploy and evolve SRE practices across the organization by establishing core SRE principles, frameworks, and methodologies. Define and implement service reliability practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets, to balance innovation velocity with system reliability
Manage Error Budgets as a mechanism for making data-driven decisions about feature velocity vs. reliability. Track, report, and enforce error budget policies, facilitating conversations between engineering and product teams about risk tolerance and release decisions
Reliability Engineering & Infrastructure
Reduce toil through automation by identifying repetitive operational work and systematically eliminating it through infrastructure-as-code, automation frameworks, and intelligent tooling. Measure and track toil reduction efforts, aiming to keep toil below 50% of team time
Implement capacity planning processes that ensure systems have adequate headroom to meet SLOs during peak traffic, unexpected load spikes, and degraded states. Develop predictive models and automated scaling mechanisms
Observability, Monitoring & Reporting
Build comprehensive observability systems that provide deep visibility into service health, performance, and user experience. Implement monitoring strategies based on the Four Golden Signals (latency, traffic, errors, saturation) and USE/RED methodologies
Create SRE dashboards and reporting mechanisms that provide real-time visibility into SLO compliance, error budget consumption, and system reliability metrics. Develop executive-level reporting on reliability trends, incident impact, and improvement initiatives
Establish alerting strategies that are actionable, symptom-based, and aligned with SLOs. Reduce alert fatigue by tuning thresholds and eliminating noise while ensuring critical issues trigger appropriate responses
Collaboration and Leadership
Partner with development teams to implement reliability from the design phase using SRE principles. Conduct design reviews focused on failure modes, scalability, observability, and operational concerns. Guide teams in building services that meet SLO requirements
Collaborate through code reviews and design reviews, ensuring infrastructure-as-code, automation scripts, and reliability improvements follow best practices, are well-documented, and maintain high-quality standards
Manage project priorities using error budgets as a decision-making framework. Leverage agile methodologies while ensuring reliability work gets appropriate prioritization alongside feature development
Operational Excellence & Continuous Improvement
Identify and eliminate performance bottlenecks through detailed analysis of metrics, traces, and profiles. Optimize system resources, tune configurations, and implement auto-scaling to ensure SLO compliance during varying load conditions
Drive continuous improvement through SRE feedback loops by analyzing SLO violations, incident trends, and toil metrics to identify systemic improvements. Champion the reliability roadmap and advocate for technical debt reduction
Maintain a culture of documentation and knowledge sharing by creating comprehensive runbooks, operational guides, system architecture documentation, and disaster recovery procedures. Ensure operational knowledge is distributed across the team
Track and report on SRE metrics, including SLO compliance rates, error budget consumption, mean time to detection (MTTD), mean time to resolution (MTTR), toil percentage, and reliability improvement velocity
On-call & reliability
Participate in a 12x7 on-call rotation and be available to work with global teams in the event of critical outages

Requirements

Preferably 8+ years of experience in DevOps/SRE roles, with demonstrated expertise in implementing SRE principles, SLO/SLI frameworks, and error budget policies in production environments
Deep experience with observability and monitoring platforms such as Prometheus, Grafana, Datadog, New Relic, or equivalent, including experience building custom dashboards, alerts, and SLO-based monitoring
Strong background in incident management, including experience as an Incident Commander, conducting blameless postmortems, and implementing systematic reliability improvements based on incident learnings
Strong understanding of distributed systems and reliability engineering, including failure modes, fault tolerance patterns, circuit breakers, bulkheads, rate limiting, and graceful degradation strategies
Experience with a number of the following: Kubernetes, Docker, Service Mesh such as Istio, Envoy, Linkerd, Solo & ECS
Experience in cloud-focused software development, preferably in Go, Python, or other object-oriented programming languages
Experience with Infrastructure as Code (IaC) tools such as Terraform, Ansible, or CloudFormation
Experience with CI/CD automation, including GitLab pipelines and other related tools
Strong hands-on experience with cloud platforms such as AWS, GCP or Azure
Proven track record of implementing scalable, high-performance infrastructure solutions in fast-paced, dynamic environments
Demonstrated ability to communicate clearly with both technical and non-technical project stakeholders, with the ability to work effectively in a cross-functional team environment
Self-driven and detail-oriented with the ability to understand complex distributed systems and identify reliability risks proactively
BS Degree in Computer Science or Equivalent

Nice to have

Certifications in relevant technologies, such as Certified Kubernetes Administrator (CKA), AWS Certified DevOps Engineer, or Certified Information Systems Security Professional (CISSP), are preferred

What we offer

global access to mental health and financial wellness support and resources
healthcare (medical, dental, and vision)
life, accident, disability, commuter, and retirement options (401(k)/pension)
time off in accordance with local leave policies

Roku - All Job Offers

Select Country

Senior Software Engineer - SRE

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Senior Software Engineer - SRE

Senior Software Engineer - Sre

Senior Software Engineer and Principal Software Engineer

Senior Software Engineer, SRE

Senior Software Engineer/ SE II (DevOps/ SRE)

Senior Software Engineer - Kubernetes & ServiceMesh

Senior Software Engineer

Senior Software Engineer - Cloud Infrastructure & Observability

Our AI answers in your language