CrawlJobs Logo

Senior Software Engineer - SRE

India, Bengaluru · Job Posted June 09, 2026
Apply Position
Job Link Share

Job Description

Roku is changing how the world watches TV. Roku is the #1 TV streaming platform in the U.S., Canada, and Mexico, and we've set our sights on powering every television in the world. Roku pioneered streaming to the TV. Our mission is to be the TV streaming platform that connects the entire TV ecosystem. We connect consumers to the content they love, enable content publishers to build and monetize large audiences, and provide advertisers unique capabilities to engage consumers. From your first day at Roku, you'll make a valuable - and valued - contribution. We're a fast-growing public company where no one is a bystander. We offer you the opportunity to delight millions of TV streamers around the world while gaining meaningful experience across a variety of disciplines. The Platform Infrastructure team ensures that all Roku systems run smoothly. These systems support over 100M+ users and billions in transaction revenue per year. We are a group of highly skilled infrastructure and software engineers who help build and operate systems at internet scale, including Platform (Kubernetes, Istio, Envoy, operators, and more) and Observability (OSS/CNCF-supported observability projects). We engage with multiple teams to achieve company-impacting results. We are seeking a talented and experienced SRE (Site Reliability Engineering) Senior Software Engineer to join our dynamic team. The ideal candidate will have a strong background in SRE practices, cloud infrastructure management, and automation.

Job Responsibility

  • Design & Infrastructure
  • Contribute to postmortem culture by facilitating comprehensive, blameless post-incident reviews that identify root causes, contributing factors, and actionable remediation items. Track incident trends to identify systemic issues and prioritize reliability improvements
  • Implement chaos engineering practices to proactively identify failure modes, validate system resilience, and build confidence in recovery procedures. Conduct game days and disaster recovery exercises
  • SRE Process & Principles Implementation
  • Deploy and evolve SRE practices across the organization by establishing core SRE principles, frameworks, and methodologies. Define and implement service reliability practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets, to balance innovation velocity with system reliability
  • Manage Error Budgets as a mechanism for making data-driven decisions about feature velocity vs. reliability. Track, report, and enforce error budget policies, facilitating conversations between engineering and product teams about risk tolerance and release decisions
  • Reliability Engineering & Infrastructure
  • Reduce toil through automation by identifying repetitive operational work and systematically eliminating it through infrastructure-as-code, automation frameworks, and intelligent tooling. Measure and track toil reduction efforts, aiming to keep toil below 50% of team time
  • Implement capacity planning processes that ensure systems have adequate headroom to meet SLOs during peak traffic, unexpected load spikes, and degraded states. Develop predictive models and automated scaling mechanisms
  • Observability, Monitoring & Reporting
  • Build comprehensive observability systems that provide deep visibility into service health, performance, and user experience. Implement monitoring strategies based on the Four Golden Signals (latency, traffic, errors, saturation) and USE/RED methodologies
  • Create SRE dashboards and reporting mechanisms that provide real-time visibility into SLO compliance, error budget consumption, and system reliability metrics. Develop executive-level reporting on reliability trends, incident impact, and improvement initiatives
  • Establish alerting strategies that are actionable, symptom-based, and aligned with SLOs. Reduce alert fatigue by tuning thresholds and eliminating noise while ensuring critical issues trigger appropriate responses
  • Collaboration and Leadership
  • Partner with development teams to implement reliability from the design phase using SRE principles. Conduct design reviews focused on failure modes, scalability, observability, and operational concerns. Guide teams in building services that meet SLO requirements
  • Collaborate through code reviews and design reviews, ensuring infrastructure-as-code, automation scripts, and reliability improvements follow best practices, are well-documented, and maintain high-quality standards
  • Manage project priorities using error budgets as a decision-making framework. Leverage agile methodologies while ensuring reliability work gets appropriate prioritization alongside feature development
  • Operational Excellence & Continuous Improvement
  • Identify and eliminate performance bottlenecks through detailed analysis of metrics, traces, and profiles. Optimize system resources, tune configurations, and implement auto-scaling to ensure SLO compliance during varying load conditions
  • Drive continuous improvement through SRE feedback loops by analyzing SLO violations, incident trends, and toil metrics to identify systemic improvements. Champion the reliability roadmap and advocate for technical debt reduction
  • Maintain a culture of documentation and knowledge sharing by creating comprehensive runbooks, operational guides, system architecture documentation, and disaster recovery procedures. Ensure operational knowledge is distributed across the team
  • Track and report on SRE metrics, including SLO compliance rates, error budget consumption, mean time to detection (MTTD), mean time to resolution (MTTR), toil percentage, and reliability improvement velocity
  • On-call & reliability
  • Participate in a 12x7 on-call rotation and be available to work with global teams in the event of critical outages

Requirements

  • Preferably 8+ years of experience in DevOps/SRE roles, with demonstrated expertise in implementing SRE principles, SLO/SLI frameworks, and error budget policies in production environments
  • Deep experience with observability and monitoring platforms such as Prometheus, Grafana, Datadog, New Relic, or equivalent, including experience building custom dashboards, alerts, and SLO-based monitoring
  • Strong background in incident management, including experience as an Incident Commander, conducting blameless postmortems, and implementing systematic reliability improvements based on incident learnings
  • Strong understanding of distributed systems and reliability engineering, including failure modes, fault tolerance patterns, circuit breakers, bulkheads, rate limiting, and graceful degradation strategies
  • Experience with a number of the following: Kubernetes, Docker, Service Mesh such as Istio, Envoy, Linkerd, Solo & ECS
  • Experience in cloud-focused software development, preferably in Go, Python, or other object-oriented programming languages
  • Experience with Infrastructure as Code (IaC) tools such as Terraform, Ansible, or CloudFormation
  • Experience with CI/CD automation, including GitLab pipelines and other related tools
  • Strong hands-on experience with cloud platforms such as AWS, GCP or Azure
  • Proven track record of implementing scalable, high-performance infrastructure solutions in fast-paced, dynamic environments
  • Demonstrated ability to communicate clearly with both technical and non-technical project stakeholders, with the ability to work effectively in a cross-functional team environment
  • Self-driven and detail-oriented with the ability to understand complex distributed systems and identify reliability risks proactively
  • BS Degree in Computer Science or Equivalent

Nice to have

Certifications in relevant technologies, such as Certified Kubernetes Administrator (CKA), AWS Certified DevOps Engineer, or Certified Information Systems Security Professional (CISSP), are preferred

What we offer

  • global access to mental health and financial wellness support and resources
  • healthcare (medical, dental, and vision)
  • life, accident, disability, commuter, and retirement options (401(k)/pension)
  • time off in accordance with local leave policies

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Software Engineer - SRE

8 matching positions

Senior Software Engineer - Sre

Hybrid: This role is categorized as hybrid and is expected to report to Austin ...
Location
Location
United States , Austin; Warren
Salary
Salary:
Not provided
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science or a related field, or equivalent work experience
  • 7-10 years software experience with strong proficiency in PostgreSQL and at least one other (Oracle, SQL Server) database technologies
  • Proficiency in at least one programming language (e.g., Python, Go, Java) and familiarity with multiple language ecosystems
  • Solid understanding of operating systems, networking, distributed systems, databases, and storage architectures
  • Deep understanding of how code runs on underlying hardware, including operating systems, algorithms, and data structures
  • Ability to optimize or troubleshoot code by understanding its execution and the impact on system resources
  • Experience handling production incidents, including root cause analysis, mitigation, and working through complex system failures
  • Strong communication skills, with an ability to explain technical concepts to both engineering and business stakeholders
  • Commitment to collaborative problem-solving and shared ownership of services
  • Proven experience in automating manual processes, building deployment pipelines, or managing configuration systems
Job Responsibility
Job Responsibility
  • Develop tools and software to automate operational processes, improve system reliability, and reduce manual intervention
  • Lead, Implement and improve monitoring and observability frameworks, enabling proactive detection and resolution of incidents
  • Participate in an on-call rotation to diagnose, troubleshoot, and mitigate production incidents, ensuring minimal downtime and swift resolution
  • Work alongside developers to ensure the quality, scalability, and reliability of our database services
  • Practice shared ownership of services in production, fostering a "You build it, you run it" culture
  • Manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to manage reliability expectations effectively
  • Conduct deep-dive analyses of incidents and collaborate on post-incident reviews to derive learnings and prevent recurrence
  • Champion a culture of continuous improvement
  • Evaluate system performance and advocate for optimizations that reduce infrastructure costs while maintaining service reliability
  • Fulltime
Read More
Arrow Right

Senior Software Engineer and Principal Software Engineer

We are building a planet-scale multi-modal database and infrastructure for execu...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, or Java
  • OR Equivalent experience
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, Java
  • OR equivalent experience
  • Experience in shipping products and scalable, reliable services
  • Currently programming/coding in your current or most recent role
  • Hands on experience with asynchronous programming and concurrency (threads, tasks, futures, async/await)
  • Experience with Azure Kubernetes Service (AKS), Amazon Elastic Kubernetes Service (EKS), and/or Google Kubernetes Engine (GKE)
  • Experience in building database engines, query engines, indexing solutions (columnar, full-text, vector), at scale
  • Experience with programming CUDA, AI systems at scale
Job Responsibility
Job Responsibility
  • Independently execute in the face of ambiguity
  • Leads identification of dependencies and the development of design documents for a product, application, service, or platform
  • Writes efficient systems code and able to debug distributed systems
  • Holds accountability as a Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, SRE

Abridge’s services and engineering team are in hyperscale mode. We are looking f...
Location
Location
United States , SF Office, NYC Office
Salary
Salary:
210800.00 - 248000.00 USD / Year
abridge.com Logo
Abridge
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of software engineering experience focused on distributed systems or tooling, with an interest in engineering enablement and software scaling
  • At least 2 years experience as a back-end engineer focused on system performance and scalability
  • Experience reducing latency in software by multiples through leveraging observability and profiling tools
  • Experience building on Kubernetes and scaling compute services on Kubernetes
  • experience with related cloud native technologies including ArgoCD, Argo Rollouts, Istio, etc
  • Comfortable implementing and securing services in Google Cloud Platform with Infrastructure as Code, including GCP Projects, VPC Networks, Google Kubernetes Engine, and IAM Roles, Groups and policies
  • Experience building software with backend languages (e.g. Python, GoLang, Node, and Rust)
  • Experience monitoring distributed systems with Prometheus, OpenTelemetry Collector, and Grafana (or something similar), including metrics collection, visualization, alerting, and using observability data to drive performance optimizations
  • Passion for engineering enablement and solving software and distributed systems scaling challenges under pressure
  • Must be willing to travel up to 10%
Job Responsibility
Job Responsibility
  • Leverage load testing, chaos engineering, and other test practices to identify performance and latency bottlenecks across all of our systems, and make changes to application code to resolve them
  • Drive software changes that can rehome applications at the code level onto new infrastructure (run times, event driven infrastructure, databases, and more) in order to dramatically improve scalability as well as enable multi-tenant deployments
  • Identify and implement software configuration changes and performance tuning parameters that will dramatically improve performance and scalability
  • Build developer tools and software modules that help engineers build code faster and more effectively with more enablements to the entire engineering organization
  • Work with the Platform team to develop, and application teams to adopt, emerging elements of our internal developer platform, such as service templates and self-serve infrastructure
  • Work with application teams to establish and adopt SLOs and error budgets, and drive better metrics for application health that can drive automated canary releases, improved health monitoring, and better engineering practices
  • Uplevel our ability to respond to incidents by improving observability, runbooks, and incident response muscle across the organization
  • Evangelize, document, and train the engineering team on the solutions being built and uplevel them on cloud native design strategies and tools
  • Be a public evangelist for Abridge in the global platform engineering community, including conferences, open source, and research as we pioneer new AI-first cloud-native-first security-first implementations at scale
What we offer
What we offer
  • Generous Time Off: 14 paid holidays, flexible PTO for salaried employees, and accrued time off for hourly employees
  • Comprehensive Health Plans: Medical, Dental, and Vision coverage for all full-time employees and their families
  • Generous HSA Contribution: If you choose a High Deductible Health Plan, Abridge makes monthly contributions to your HSA
  • Paid Parental Leave: Generous paid parental leave for all full-time employees
  • Family Forming Benefits: Resources and financial support to help you build your family
  • 401(k) Matching: Contribution matching to help invest in your future
  • Personal Device Allowance: Tax free funds for personal device usage
  • Pre-tax Benefits: Access to Flexible Spending Accounts (FSA) and Commuter Benefits
  • Lifestyle Wallet: Monthly contributions for fitness, professional development, coworking, and more
  • Mental Health Support: Dedicated access to therapy and coaching to help you reach your goals
  • Fulltime
Read More
Arrow Right

Senior Software Engineer/ SE II (DevOps/ SRE)

We are looking for DevOps/SRE Engineers to join the Optimizely team in Dhaka.
Location
Location
Bangladesh , Dhaka
Salary
Salary:
Not provided
optimizely.com Logo
Optimizely
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • AWS & GCP experience (multi-account, multi-region)
  • Kubernetes & container orchestration (EKS, Helm, Docker)
  • Terraform / Infrastructure-as-Code at scale
  • Automation scripting (Python, Bash, Fabric)
  • Experience managing scalable, fault-tolerant distributed infrastructure
  • Others: Datadog, Atlantis, Karpenter, Spark/EMR
  • Should be comfortable contributing code to service repositories if necessary (e.g. Node/Python/Golang)
  • Minimum experience 3+ years
  • Bachelor’s Degree (Computer Science or engineering preferred) or equivalent work experience
Job Responsibility
Job Responsibility
  • Multi-cloud infrastructure spanning multiple AWS accounts and GCP projects
  • 50+ microservices running on both EKS and GKE with auto-scaling
  • 36+ Terraform modules, 149+ Ansible roles, and more
  • Real-time data pipelines with Kinesis, Redshift, OpenSearch, and MongoDB Atlas
  • Self-managed OpenSearch, RabbitMQ, and other services
  • GitOps workflows powered by Atlantis with automated plan/apply cycles
  • CI/CD across 250+ Jenkins pipelines and Github Actions
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Kubernetes & ServiceMesh

Join us in building Roku’s next-generation cloud-agnostic platform that powers K...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
roku.com Logo
Roku
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong hands-on experience with cloud technologies (AWS preferred
  • GCP or Azure is a plus), specifically in architecting and managing performant, large-scale systems handling significant traffic/data
  • Deep knowledge of Kubernetes (EKS, GKE, AKS, or similar) and service mesh technologies
  • Proficiency in Go or another programming language, Python or another scripting language
  • Experience designing infrastructure and building automation tools, while collaborating with internal team members and external stakeholders
  • Experience building CI/CD pipelines and following modern deployment practices
  • Familiarity with observability tools (Prometheus, Thanos, Loki, Grafana, etc.)
  • Ability to work independently and communicate effectively with technical and non-technical stakeholders
  • Passion for learning and solving complex infrastructure challenges
  • Experience integrating AI tools to improve processes and reduce operational toil (a plus)
Job Responsibility
Job Responsibility
  • Architect, design, and deploy Roku’s next-generation cloud platform and service mesh
  • Build and own solutions to Roku's compute problems using Docker, Kubernetes, Istio/Envoy, Terraform and scripting to evolve our tech stack and deployments
  • Proactively drive the research and implementation of new technologies to enhance scalability, reliability, and developer experience
  • Integrate security best practices into infrastructure design and automation
  • Build tooling to visualize inefficiencies and optimize costs across shared-tenancy clusters, including network traffic insights, cross-cluster communication efficiency, and cost attribution
  • Collaborate with internal teams to migrate workloads to Kubernetes + Istio, leveraging open-source observability tools
  • Work closely with the Observability team to scale monitoring and logging solutions for a holistic view of the platform
  • Leverage SRE principles to maintain high availability and streamline onboarding workflows
  • Mentor team members and help define best practices for infrastructure and automation
What we offer
What we offer
  • global access to mental health and financial wellness support and resources
  • healthcare (medical, dental, and vision)
  • life insurance
  • accident insurance
  • disability insurance
  • commuter benefits
  • retirement options (401(k)/pension)
  • time off
  • Fulltime
Read More
Arrow Right

Senior Software Engineer

In Microsoft’s CoreAI division, the Azure SRE Agent Platform team builds and run...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science, or equivalent practical experience
  • 7+ years of experience building production software using one or more modern programming languages such as C#, C++, Go, Java or Python
  • Strong understanding of Generative AI & software engineering fundamentals, data structures, and problem-solving
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Ability to pass the Microsoft Cloud background check upon hire/transfer and every two years
Job Responsibility
Job Responsibility
  • Take ownership of important areas of the Azure SRE Agent Platform, including agent capabilities, orchestration, evaluation, user experiences on different form factors and supporting platform services
  • Build and iterate on agentic systems, including tools, planning and execution loops, evaluations, and safety mechanisms
  • Design and ship reliable capabilities that improve incident detection, diagnosis, mitigation, and operational learning
  • Use telemetry, experiments, evaluations, and user feedback to guide iteration and investment
  • Contribute to resilient, observable systems that operate safely and effectively in production
  • Partner closely with engineers, SREs, and product counterparts to turn ambiguous problems into high-quality shipped solutions
  • Participate in debugging, live-site learning, and post-incident hardening to continuously improve system quality
  • Contribute to architecture, engineering standards, and development practices across the team
  • Fulltime
Read More
Arrow Right

Senior Software Engineer

The Firefox Monitor Engineering Team builds tools that help people understand an...
Location
Location
United States
Salary
Salary:
Not provided
mozilla.org Logo
Mozilla
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in software development with a strong focus on backend technologies
  • Deep expertise in Node.js and TypeScript, with experience building and leading backend engineering projects
  • Proficiency with PostgreSQL and SQL query optimization
  • experience with query builders such as Knex is a plus
  • Experience deploying and operating applications on Kubernetes
  • Experience with GCP (Pub/Sub, Cloud Logging) with a solid understanding of DevOps and SRE collaboration
  • Experience with Infrastructure as Code tools such as Terraform
  • Experience with AWS (S3) or similar cloud storage services
  • Hands-on experience with observability tooling including OpenTelemetry, Sentry, Prometheus, and Grafana
  • Familiarity with Redis for caching and session management
Job Responsibility
Job Responsibility
  • Lead backend development in Node.js and TypeScript, building and maintaining server-side logic within a Next.js full-stack architecture
  • Design, implement, and maintain integrations with external data sources such as Have I Been Pwned (HIBP) and other breach intelligence providers, with a focus on data privacy and security
  • Build and maintain event-driven systems using Google Cloud Pub/Sub, and own cloud infrastructure on GCP (GKE) and AWS (S3, SES)
  • Own and evolve the data layer, including PostgreSQL schema design and query optimization using Knex, and Redis caching strategies
  • Work closely with our SRE team to maintain and improve production environments, including monitoring and alerting with OpenTelemetry, Sentry, Prometheus, and Grafana
  • Triage and resolve production issues, partnering with SRE and support teams to investigate incidents, address bug reports, and keep the application running reliably
  • Periodically rotate into a Base Load Engineer (BLE) role, handling releases, dependency updates, and incoming work requests from customer support and other stakeholders
  • Partner with and support the frontend team in their work with React, TypeScript, Next.js, and SCSS, ensuring backend systems, APIs, and data contracts meet their needs
  • Partner with cross-functional teams to align on project goals, ensure seamless frontend-backend integration, and contribute to API design and evaluations
  • Participate in code reviews to maintain high standards of code quality and system reliability
What we offer
What we offer
  • Generous performance-based bonus plans
  • Rich medical, dental, and vision coverage
  • Generous retirement contributions with 100% immediate vesting
  • Quarterly all-company wellness days
  • Country specific holidays plus a day off for your birthday
  • One-time home office stipend
  • Annual professional development budget
  • Quarterly well-being stipend
  • Considerable paid parental leave
  • Employee referral bonus program
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Cloud Infrastructure & Observability

Location
Location
India , Bengaluru
Salary
Salary:
Not provided
roku.com Logo
Roku
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 15+ years in software engineering with a track record of architecting distributed systems or platforms at scale
  • Strong hands‑on experience in Golang and one scripting language (e.g., Python or Shell)
  • Experience operating observability at pb-scale ingestion and hundreds of millions of series
  • Expertise in observability platforms and tooling (Prometheus, Grafana, Loki, Tempo, ELK/OpenSearch, ClickHouse) and standards (OpenTelemetry, OpenMetrics)
  • Deep experience building systems of scale and operating cloud infrastructure with Kubernetes
  • strong proficiency with service mesh technologies (Istio/Envoy), infrastructure‑as‑code (Terraform) and experience in multi‑cloud (AWS, GCP)
  • Demonstrated ability to evolve storage and query architectures for cost, scale, and latency (e.g., TSDB, Parquet, distributed processing)
  • Proven experience integrating security as part of infrastructure and platform development
  • Exceptional cross‑functional communication
  • effective collaboration with both technical and non‑technical stakeholders
Job Responsibility
Job Responsibility
  • Architect and lead Roku’s observability platform across metrics, logs, and traces
  • evolve data pipelines and storage layers optimized for high throughput, performance, and cost at Roku scale (TSDBs, Parquet, distributed processing)
  • Extend and harden open‑source observability systems
  • overhaul core components (e.g., storage layers, query paths) to improve performance, reliability, and usability at scale
  • Implement features such as pre‑aggregation, down-sampling, and sampling to reduce load and accelerate queries across the platform
  • Collaborate across platform, SRE, and product teams to migrate hundreds of workloads to our common platform
  • augment and automate CI/CD flows and onboarding
  • Integrate security into infrastructure and platform services
  • ensure robust multi‑tenant, multi‑cluster, and multi‑cloud designs
  • Contribute improvements back to open source and CNCF‑aligned projects
What we offer
What we offer
  • Global access to mental health and financial wellness support and resources
  • healthcare (medical, dental, and vision)
  • life, accident, disability, commuter, and retirement options (401(k)/pension)
  • time off in accordance with local leave policies
  • Fulltime
Read More
Arrow Right