CrawlJobs Logo

Senior Site Reliability Engineer

United States, San Francisco 129600.00 - 232200.00 USD / Year · Job Posted February 17, 2026
Apply Position
Job Link Share

Job Description

We're looking for a Senior Site Reliability Engineer for our Currents team, responsible for building, maintaining, and evolving Currents, our data export system at scale. The Currents system is a robust Kafka-based event pipeline handling tens of billions of messages daily that our customers leverage to analyze user behavior in near real-time. You’ll be a key engineer on a highly collaborative and skilled team, responsible for bringing projects from concept to production and improving our existing high-scale systems. You will be leveraging your experience, your skills, and a strong sense of teamwork to tackle the significant engineering challenges of running a critical data streaming system. As a Senior Site Reliability Engineer, you will specifically focus on the observability, scalability, and reliability strategy aspects of every project.

Job Responsibility

  • Solve live performance and reliability issues and prevent their recurrence
  • Write and review code, educating engineers and building a culture of reliability
  • Practice sustainable incident response and blameless postmortems
  • Define and enable standards for monitoring, reliability, and performance
  • Bridge the gap between infrastructure and platform engineering teams
  • Support and improve services by planning for scale and reliability
  • Guide junior engineers in SRE best practices, software engineering, and agile project leadership

Requirements

  • Bachelor’s in Computer Science, Software Engineering, or a related STEM field
  • Five (5) years of experience in any role/occupation/position involving software engineering or site reliability engineering
  • Experience using distributed systems to deploy and monitor live applications such as Kubernetes or Docker Swarm
  • Experience working with alerting software (Sentry, Datadog, and/or PagerDuty)
  • Experience utilizing programming languages (Java, Kotlin, and/or Ruby) to understand and contribute to the codebase
  • Experience storing data in relational and non-relational databases such as Postgres and MongoDb
  • Experience with data streaming or queuing systems to build data pipelines with technologies like Kafka, Sidekiq or SQS and SNS
  • Experience leveraging continuous integration tools such as Jenkins or Buildkite
  • Experience collaborating with engineers through pull requests and code reviews in version control software such as GitHub or GitLab

What we offer

  • Competitive compensation that may include equity
  • Retirement and Employee Stock Purchase Plans
  • Flexible paid time off
  • Comprehensive benefit plans covering medical, dental, vision, life, and disability
  • Family services that include fertility benefits and equal paid parental leave
  • Professional development supported by formal career pathing, learning platforms, and a yearly learning stipend
  • A curated in-office employee experience, designed to foster community, team connections, and innovation
  • Opportunities to give back to your community, including an annual company-wide Volunteer Week and donation matching
  • Employee Resource Groups that provide supportive communities within Braze

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Senior Site Reliability Engineer

8 matching positions

New

Senior Site Reliability Engineer

We are seeking a Senior Site Reliability Engineer with deep expertise in Kuberne...
Location
Location
Denmark , Copenhagen
Salary
Salary:
Not provided
keepit.com Logo
Keepit
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years in a Site Reliability, Platform, or DevOps Engineering role
  • Hands-on Kubernetes experience, including storage (Rook-Ceph or equivalent)
  • Solid Linux fundamentals
  • Proactive mindset
  • Clear communicator
Job Responsibility
Job Responsibility
  • Participate in the daily operation of our existing stack
  • Evolve and take part in designing our next generation infrastructure setup
  • Define and enforce reliability standards, runbooks, and operational best practices across the platform
  • Collaborate with Development and Operations teams to identify and resolve bottlenecks before they become incidents
  • Champion automation
  • if something is done twice, it should be scripted the third time
What we offer
What we offer
  • Competitive salary
  • Pension scheme
  • A modern, energetic global work environment
  • Flexible work-life balance supported by a hybrid working model
  • Regular team-building activities
  • Opportunities for professional development and career advancement
  • Compensation is based on experience and skill set
  • Fulltime
Read More
Arrow Right
New

Senior Site Reliability Engineer

The Business Operations team is seeking a highly motivated and experienced Senio...
Location
Location
Norway , Oslo
Salary
Salary:
Not provided
mastercard.com Logo
Mastercard
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Observability
  • Programming and Scripting
  • Systems and Network Administration
  • Cloud Computing and Infrastructure
  • Reliability and Scalability
  • DevOps Practices
  • Troubleshooting
  • Capacity Planning and Performance Optimization
  • IT Service Management
  • Proactive Monitoring and Improvement (SRE Applications)
Job Responsibility
Job Responsibility
  • Independently execute key elements of projects/processes within the Site Reliability Engineering area by applying in-depth knowledge of their discipline and area best practices to effectively resolve problems and roadblocks as they occur
  • Assist in evaluating operational requirements and developing technical solutions within existing frameworks
  • Support automation and scripting efforts to improve operational workflows and incident response processes
  • Troubleshoot and resolve routine and some complex system issues, escalating when necessary to maintain system health
  • Contribute to documentation, knowledge sharing, and best practices to enhance team operational procedures
  • Collaborate with development teams and stakeholders to ensure reliability solutions align with technical and business needs
  • Participate in reviews and quality assurance activities to uphold system stability standards
  • May contribute to solution development for new products/services and/or manage smaller project/initiatives as an experienced individual contributor with specialized knowledge within the Site Reliability Engineering area
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

The Senior Site Reliability Engineer establishes and maintains the infrastructur...
Location
Location
United Kingdom; United States; Canada
Salary
Salary:
Not provided
mozilla.org Logo
Mozilla
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in infrastructure, platform engineering, or site reliability roles, including hands-on production Kubernetes experience in workload operations, troubleshooting, and cluster management
  • Hands-on experience with infrastructure-as-code on AWS using Terraform, OpenTofu, or Pulumi
  • Security awareness in day-to-day infrastructure work: identity, least privilege, secrets hygiene, and network controls
  • Demonstrated ownership mindset with the ability to proactively identify issues, drive work to completion, and communicate risks early
  • Excellent async written communication skills
  • comfortable working with a geographically distributed team
  • Ability to collaborate effectively with software engineers and non-engineering stakeholders to improve platform reliability and operational efficiency
  • Ability to learn, evaluate, and responsibly use emerging technologies, including AI-enabled tools, to improve work processes
Job Responsibility
Job Responsibility
  • Operate and evolve our EKS-based Kubernetes platform, supporting service migrations, platform improvements, and reliability initiatives
  • Design and develop CI/CD systems supporting websites, services, and Thunderbird desktop releases, contributing to pipeline reliability and OIDC-based authentication across GitHub Actions workflows
  • Write and maintain infrastructure in Pulumi and/or Terraform/OpenTofu across multiple AWS accounts
  • Operate and evolve our observability stack (VictoriaMetrics, VictoriaLogs, Grafana, Vector) and partner with engineering teams to incorporate instrumentation and monitoring into service design
  • Apply security-conscious infrastructure practices, including least-privilege IAM, secrets management via AWS Secrets Manager and External Secrets Operator, and network segmentation
  • Diagnose and debug production incidents
  • drive root-cause analysis and post-incident improvements to prevent recurring problems
  • Participate in on-call rotation and collaborate with SDEs and fellow SREs to ship, maintain, and monitor new builds and support service onboarding
  • Contribute to runbooks, architecture documentation, and team processes
What we offer
What we offer
  • Fully remote work & schedule flexibility
  • Company-provided laptop
  • Annual bonus program
  • Monthly remote work stipend
  • Annual professional development stipend
  • Industry conferences
  • Company all-hands and team gatherings
  • 24 days PTO per year (prorated)
  • Birthday
  • Year-end company shutdown
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to su...
Location
Location
United States
Salary
Salary:
113082.00 - 175725.00 USD / Year
wikimediafoundation.org Logo
Wikimedia Foundation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years experience in an SRE/Operations/DevOps role as part of a team
  • Experience with shell and any scripting language used in an SRE context (Python, Go, Bash, Ruby
  • we primarily use Python) and configuration management tools (Puppet, Ansible
  • we use Puppet)
  • Experience with distributed caching systems: including their underlying algorithms and how to optimize their performance
  • Experience with package management on Linux systems (we use Debian)
  • Strong Linux system-level troubleshooting skills
  • History of automating tasks and processes, identifying process gaps, and finding automation opportunities
  • Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones
  • Experience leading and participating in incident response and post-incident review rituals, with the goal of conducting root cause analysis and implementing preventive measures
Job Responsibility
Job Responsibility
  • Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
  • Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
  • Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
  • Working closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
  • Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure.
  • Collaborating with a global, cross-functional team in an asynchronous communication environment
  • Mentoring peers in your areas of technical and operational strength
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

At bsport, the Senior Site Reliability Engineer is a role for someone who doesn’...
Location
Location
Spain; France , Barcelona; Paris
Salary
Salary:
Not provided
pro.bsport.io Logo
Bsport
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in SRE, Platform Engineering, Infrastructure or Backend Engineering
  • Strong experience with cloud infrastructure (AWS preferred), Kubernetes and CI/CD
  • Experience building or maintaining high-availability, scalable systems
  • Solid Python experience (bonus points for Django)
  • Experience working with SQL databases, ideally PostgreSQL
  • A proactive mindset: you enjoy taking ownership and solving complex technical challenges
  • Strong communication skills and fluency in English
Job Responsibility
Job Responsibility
  • Scale infrastructure and design resilient systems supporting international growth
  • Improve deployment speed, CI/CD pipelines and developer experience
  • Shape platform architecture through modularisation and scalable deployment strategies
  • Enhance observability, reliability and incident response capabilities
  • Influence engineering practices and collaborate across teams to improve how we build and ship
What we offer
What we offer
  • Competitive salary packages based on your experience and role
  • Hybrid model with 3 days in the office per week
  • Work from anywhere: up to 15 days of remote work from abroad each year
  • Exclusive fitness perks: discounted access to Wellhub for Spain and HelloCSE membership for France
  • Private health insurance and flexible remuneration for Spain
  • Diverse fun loving team: multicultural colleagues, after-work events, team-building & more
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Embark on a transformative journey as a Senior Site Reliability Engineer - AVP. ...
Location
Location
United States , Whippany
Salary
Salary:
120000.00 - 175000.00 USD / Year
barclays.co.uk Logo
Barclays
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Considerable programming expertise in languages such as Python, Java, and others
  • Practical experience with Infrastructure as Code (IaC) tools, including Ansible, Chef, and Terraform
  • Validated experience with observability and monitoring platforms such as Observe, Elastic, InfluxDB, and Grafana
  • Solid understanding of containerization technologies and Unix/Linux environments
  • Demonstrates a Site Reliability Engineering (SRE) mindset, with good analytical skills, ownership, and a forward-thinking approach to problem-solving
Job Responsibility
Job Responsibility
  • Build and maintain infrastructure platforms and products that support applications and data systems
  • Ensure the reliability, availability, and scalability of the systems, platforms, and technology
  • Development, delivery, and maintenance of high-quality infrastructure solutions
  • Monitoring of IT infrastructure and system performance to measure, identify, address, and resolve any potential issues, vulnerabilities, or outages
  • Development and implementation of automated tasks and processes to improve efficiency and reduce manual intervention
  • Implementation of a secure configuration and measures to protect infrastructure against cyber-attacks, vulnerabilities, and other security threats
  • Cross-functional collaboration with product managers, architects, and other engineers to define IT Infrastructure requirements
  • Stay informed of industry technology trends and innovations
What we offer
What we offer
  • medical, dental and vision coverage
  • 401(k)
  • life insurance
  • other paid leave for qualifying circumstances
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Are you interested in working on cutting-edge cloud security products Would you ...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Candidates must have an active TS and be willing and eligible to upgrade to TS/SCI (with polygraph) or have an active TS/SCI and be willing and eligible to upgrade to TS/SCI (with polygraph). This role will require candidates to maintain the TS/SCI (with polygraph) clearance
  • Ability to meet Microsoft, customer and/or government security screening requirements are required pre-offer and post-hire for this role
  • Failure to maintain or obtain the appropriate clearance and/or customer screening requirements may result in employment action up to and including termination
  • This position requires successful verification of the stated security clearance to meet federal government customer requirements
  • This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Ensure 24x7 Service Reliability: Act as a Designated Responsible Individual (DRI) in an on-call rotation, leading incident response and resolution to maintain uptime and performance for Microsoft's most critical services
  • Support and Automate Deployments: Execute and improve manual operations and deployments for our products, while designing automation to scale and streamline those processes across environments
  • Build Scalable Systems: Develop automation for monitoring, alerting, debugging, and deployment to reduce manual effort and accelerate safe, reliable delivery
  • Drive Compliance and Security: Ensure systems meet Microsoft's standards for security, privacy, and accessibility, especially when onboarding new technologies
  • Lead Post-Incident Learning: Conduct postmortems, share insights, and implement solutions that prevent recurrence—fostering a culture of learning and continuous improvement
  • Collaborate Across Teams: Partner with engineering and product teams to align reliability goals with customer needs and deliver seamless user experiences
  • Stay Ahead Technically: Continuously invest in your technical growth to improve system availability, observability, and performance at scale
  • Embody our company's Culture and Values
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Do you want to be at the heart of cloud computing? The Compute team is at the co...
Location
Location
United States , Reston
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Active U.S. Government Top Secret Clearance with access to Sensitive Compartmented Information (SCI) based on a Single Scope Background Investigation (SSBI) with Polygraph
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter
  • Verification of U.S. citizenship due to citizenship-based legal restrictions
Job Responsibility
Job Responsibility
  • Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions, alerting stakeholders about status and initiates actions to restore system/product/service for simple and complex problems when appropriate
  • Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of service fabric services while also driving consistency in monitoring and operations at scale
  • Drives development of design documents for a product, application, service, or platform
  • Creates, implements, optimizes, debugs, refactors, and reuses code to establish and improve performance and maintainability, effectiveness, and return on investment (ROI)
  • Leverages subject-matter expertise of product features and partners with appropriate stakeholders to drive a workgroup's project plans, release plans, and work items
  • Take full ownership of assigned services, actively contributing to its enhancement across all cloud environments
  • Identify opportunities for automation and optimization within the cloud to better support customers
What we offer
What we offer
  • Certain roles may be eligible for benefits and other compensation
  • Fulltime
Read More
Arrow Right