Senior Site Reliability Engineer Manager Job at Remotestar (London)

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...

Location

United States , San Francisco

Salary:

230000.00 - 345000.00 USD / Year

Lambda

Expiration Date

Until further notice

Requirements

7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
Strong understanding of Linux-based systems in a distributed environment
Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
Excellent problem-solving and troubleshooting skills
Strong communication and collaboration skills
Passion for continuous improvement and innovation

Job Responsibility

Define Fleet Health metrics and indicators to objectively measure and improve system availability
Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
Create runbooks and automated remediations for common failure scenarios
Build in automation and auditing to ensure compliance and improve efficiency and productivity
Participate in on-call rotations and provide support for incident response and resolution
Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc

What we offer

Generous cash & equity compensation
Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan

Fulltime

Digital Site Reliability Senior Engineer

We are seeking a Senior Site Reliability Engineer to drive reliability, performa...

Location

United States , Memphis

Salary:

87120.00 - 151250.00 USD / Year

NTT DATA

Expiration Date

Until further notice

Requirements

8+ years in DevOps/SRE roles with strong experience in CI/CD, containerization, and Kubernetes deployments
5+ years troubleshooting with APM tools (Datadog, Dynatrace)
5+ years log analysis using Splunk
5+ years working with Nexus Repository
5+ years diagnosing and resolving complex technical issues
5+ years demonstrating advanced expertise in core technical areas and commitment to operational excellence
5+ years experience with ServiceNow and Jira for change management
4+ years Unix/Linux shell scripting and supporting NoSQL databases (e.g., Couchbase)
3+ years working with static code analysis tools (Checkmarx, SonarQube)
3+ years applying OWASP security principles

Job Responsibility

Implement, maintain, and optimize CI/CD pipelines and tooling (GitLab preferred
Bamboo/Jenkins experience valuable)
Support and enhance platform architecture, including EKS, Docker, and containerized application environments
Monitor production and non‑production systems using Datadog, Splunk, and related APM tools
troubleshoot and resolve complex issues
Collaborate with product and engineering teams to scale applications and infrastructure while ensuring performance and availability
Define and refine SLIs/SLOs to measure service health and drive reliability improvements
Build automation and tooling using Groovy, Shell, Python, Terraform, Java, or JavaScript
Contribute to SRE best practices and help mature SRE capabilities across the Digital organization
Enhance CI/CD pipelines with automated testing, quality gates, and improved observability

What we offer

medical, dental, and vision insurance with an employer contribution
flexible spending or health savings account
life and AD&D insurance
short- and long-term disability coverage
paid time off
employee assistance
participation in a 401k program with company match
additional voluntary or legally-required benefits

Fulltime

New

Senior Site Reliability Engineer

We are looking for an experienced Site Reliability Engineer to join our growing ...

Location

Italy , Milan

Salary:

50000.00 - 70000.00 EUR / Year

iGenius

Expiration Date

Until further notice

Requirements

Bachelor's or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or a related field
At least 6 years of experience as a Site Reliability Engineer or in similar roles
Strong experience with observability and monitoring systems such as Prometheus, Thanos, Grafana, and OpenTelemetry
Experience with low-level system instrumentation and performance visibility using technologies such as eBPF
Experience with security monitoring and threat detection tools such as Zeek, Wazuh, or equivalent SIEM / security observability platforms
Strong experience with containerized and cloud-native environments, particularly Kubernetes
Strong software development skills, particularly in Python, with the ability to build automation, integrations, and custom tooling
Experience integrating heterogeneous infrastructure systems across multiple vendors, APIs, and evolving tool ecosystems
Familiarity with modern infrastructure automation and emerging agent-based frameworks such as MCP / A2A (or equivalent technologies)
Exposure to digital twin technologies and simulation platforms such as NVIDIA Omniverse or equivalent

Job Responsibility

Design and implement observability and control mechanisms that extract operational data from infrastructure and feed it into automated systems to enable continuous optimization, including key system budgets such as power, cooling and service level, security-level objectives
Actively guard and maintain these operational budgets as part of day-to-day system reliability and performance management
Contribute to operational excellence through blameless post-mortem analysis and structured incident learning, ensuring continuous improvement of system behavior and resilience
Work closely with Platform Engineering in a shared cybersecurity model, where SRE focuses on detection and monitoring, while Platform Engineering ensures the secure design and operation of the underlying infrastructure

What we offer

Learning Friday
Training budget for books, online courses or other training materials
Smart Working (remote work opportunities)
Opportunity to receive company equity
Stock options

Fulltime

Senior Site Reliability Engineer

Our client, a leader in the HCM space is in need of a Senior Site Reliability En...

Location

United States , Reston

Salary:

67.50 - 97.50 USD / Hour

ClearBridge Technology Group

Expiration Date

Until further notice

Requirements

5+ years of experience support large scale cloud infrastructure, automation and DevOps preferably in an AWS environment
Ability to build, maintain, and consume CI/CD pipelines and tools
Proficient w/ Terraform to automate critical infrastructure
Experience supporting Kubernetes based platforms to ensure high availability
Active TS SCI w/ CI Poly

Job Responsibility

Ensuring Kubernetes based platform is maintained, healthy, and ensures high availability, scalability and security
Automating infrastructure provisioning, configuration management, application deployments using Terraform and Argo CD
Handling troubleshooting and documentation associated with the platform
Collaborating with multiple cross functional teams
Proficient at building, maintaining and consuming CI/CD pipelines

Fulltime

Senior Site Reliability Engineer

The Senior Site Reliability Engineer establishes and maintains the infrastructur...

Location

United Kingdom; United States; Canada

Salary:

Not provided

Mozilla

Expiration Date

Until further notice

Requirements

7+ years of experience in infrastructure, platform engineering, or site reliability roles, including hands-on production Kubernetes experience in workload operations, troubleshooting, and cluster management
Hands-on experience with infrastructure-as-code on AWS using Terraform, OpenTofu, or Pulumi
Security awareness in day-to-day infrastructure work: identity, least privilege, secrets hygiene, and network controls
Demonstrated ownership mindset with the ability to proactively identify issues, drive work to completion, and communicate risks early
Excellent async written communication skills
comfortable working with a geographically distributed team
Ability to collaborate effectively with software engineers and non-engineering stakeholders to improve platform reliability and operational efficiency
Ability to learn, evaluate, and responsibly use emerging technologies, including AI-enabled tools, to improve work processes

Job Responsibility

Operate and evolve our EKS-based Kubernetes platform, supporting service migrations, platform improvements, and reliability initiatives
Design and develop CI/CD systems supporting websites, services, and Thunderbird desktop releases, contributing to pipeline reliability and OIDC-based authentication across GitHub Actions workflows
Write and maintain infrastructure in Pulumi and/or Terraform/OpenTofu across multiple AWS accounts
Operate and evolve our observability stack (VictoriaMetrics, VictoriaLogs, Grafana, Vector) and partner with engineering teams to incorporate instrumentation and monitoring into service design
Apply security-conscious infrastructure practices, including least-privilege IAM, secrets management via AWS Secrets Manager and External Secrets Operator, and network segmentation
Diagnose and debug production incidents
drive root-cause analysis and post-incident improvements to prevent recurring problems
Participate in on-call rotation and collaborate with SDEs and fellow SREs to ship, maintain, and monitor new builds and support service onboarding
Contribute to runbooks, architecture documentation, and team processes

What we offer

Fully remote work & schedule flexibility
Company-provided laptop
Annual bonus program
Monthly remote work stipend
Annual professional development stipend
Industry conferences
Company all-hands and team gatherings
24 days PTO per year (prorated)
Birthday
Year-end company shutdown

Fulltime

Senior Site Reliability Engineer - Edge Computing

We are seeking a Senior Site Reliability Engineer to be part of the Edge Computi...

Location

Switzerland , Lausanne

Salary:

140000.00 - 155000.00 CHF / Year

Genius Sports

Expiration Date

Until further notice

Requirements

Swiss/EU/EFTA citizen or residency permit in Switzerland
5+ years experience in SRE with Linux
Strong understanding of the entire Linux server stack: OS boot and installation, systemd, networking, container deployment, logging, metrics & monitoring, out-of-band management, etc...
Strong experience designing robust automation processes for a large inventory of on-premises servers
Strong experience contributing to a large automation code base using Ansible or a similar platform
Proficient with Python programming and Bash scripting
Ability to communicate efficiently and articulate concepts based on the audience

Job Responsibility

Design and code end-to-end processes enabling untrained staff to autonomously prepare, install and monitor all Linux servers and networking devices installed in 300+ sport venues
Design and code end-to-end processes enabling developers to autonomously deploy and monitor our player tracking and augmentation applications
Take ownership of long-term technical efforts and articulate design choices to technical and non-technical people
Collaborate closely with teammates to solve problems, share knowledge and provide actionable feedback
Participate in an on-call rotation that emphasizes eliminating repeating escalations
Visit the Lausanne Jordils office 4 times per week, with flexible hours

What we offer

Become part of the world of elite sports, build systems that support data analytics and augmentation for the best leagues
Enjoy an innovative and dynamic environment that encourages self-development
Develop automation solutions that directly improve the daily work of dozens of operational staff and developers
Competitive salary and range of benefits
Committed to supporting employee wellbeing and helping you grow your skills, experience and career

Fulltime

Senior Site Reliability Engineer

The Business Operations team is seeking a highly motivated and experienced Senio...

Location

Norway , Oslo

Salary:

Not provided

Mastercard

Expiration Date

Until further notice

Requirements

Observability
Programming and Scripting
Systems and Network Administration
Cloud Computing and Infrastructure
Reliability and Scalability
DevOps Practices
Troubleshooting
Capacity Planning and Performance Optimization
IT Service Management
Proactive Monitoring and Improvement (SRE Applications)

Job Responsibility

Independently execute key elements of projects/processes within the Site Reliability Engineering area by applying in-depth knowledge of their discipline and area best practices to effectively resolve problems and roadblocks as they occur
Assist in evaluating operational requirements and developing technical solutions within existing frameworks
Support automation and scripting efforts to improve operational workflows and incident response processes
Troubleshoot and resolve routine and some complex system issues, escalating when necessary to maintain system health
Contribute to documentation, knowledge sharing, and best practices to enhance team operational procedures
Collaborate with development teams and stakeholders to ensure reliability solutions align with technical and business needs
Participate in reviews and quality assurance activities to uphold system stability standards
May contribute to solution development for new products/services and/or manage smaller project/initiatives as an experienced individual contributor with specialized knowledge within the Site Reliability Engineering area

Fulltime

Senior Site Reliability Engineer, Wikimedia Enterprise

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to jo...

Location

United States

Salary:

116633.00 - 181243.00 USD / Year

Wikimedia Foundation

Expiration Date

Until further notice

Requirements

Automation & Configuration Management: Experience with Infrastructure as Code and automation tools (e.g., Terraform, Ansible) and proficiency in at least one programming language (e.g., Python, Go, or similar)
Cloud Infrastructure: Experience designing, operating, and optimizing cloud-based systems across platforms such as AWS, Azure, or GCP, including scalability, reliability, and cost efficiency
CI/CD & Deployment Practices: Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab or similar, ArgoCD), with familiarity in progressive delivery approaches such as canary and blue-green deployments
Incident Management & Reliability Operations: Experience with incident response, on-call practices, and leading postmortems, with a focus on continuous improvement and operational excellence
SRE Principles & Observability: Strong understanding of SRE best practices, including SLOs, SLIs, and error budgets, along with experience in observability (metrics, logging, and distributed tracing e.g., Prometheus, OpenTelemetry)
Collaboration & Communication: Ability to work effectively in a distributed, cross-functional environment, with strong documentation and communication skills
Proven experience operating highly available, large-scale distributed systems, with a deep understanding of reliability, scalability, and failure modes
Ownership mindset: Takes end-to-end responsibility for system reliability, proactively identifying and addressing risks before they impact users
Bias for automation: Continuously seeks to reduce operational toil through automation and scalable solutions
Continuous improvement mindset: Actively learns from incidents and drives improvements through blameless postmortems and iterative enhancements

Job Responsibility

Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets to ensure reliability targets are met
Build and enhance observability systems (metrics, logs, and distributed tracing) to enable proactive detection and faster troubleshooting
Drive reliability engineering practices, including capacity planning, load testing, and resilience validation (e.g., chaos testing)
Improve developer experience (DevEx) by enabling self-service infrastructure and streamlining deployment workflows
Partner with engineering team members to embed reliability best practices early in the development lifecycle
Design, implement, and optimize CI/CD and GitOps workflows using tools such as GitLab (or similar) and ArgoCD(or similar), enabling automated, reliable deployments with support for progressive delivery strategies like canary and blue-green releases
Implement secure-by-default infrastructure and enforce best practices (e.g., IAM, secrets management, encryption)
Continuously optimize infrastructure cost and efficiency using FinOps principles while maintaining performance and availability
Establish and track operational metrics such as MTTR, MTTD, and incident frequency to drive continuous improvement
Reduce operational toil by identifying repetitive work and implementing automation-first solutions

Fulltime

Select Country

Senior Site Reliability Engineer Manager

Job Description

Job Responsibility

Requirements

What we offer

Looking for more opportunities?

Senior Site Reliability Engineer Manager

Senior Site Reliability Engineer - Fleet Reliability

Digital Site Reliability Senior Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer - Edge Computing

Senior Site Reliability Engineer

Senior Site Reliability Engineer, Wikimedia Enterprise

Our AI answers in your language