Senior Site Reliability Engineer

Senior Site Reliability Engineer, Infrastructure Foundations

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to su...

Location

United States

Salary:

113082.00 - 175725.00 USD / Year

Wikimedia Foundation

Expiration Date

Until further notice

Requirements

6+ years of experience in an SRE/Operations/DevOps role as part of a team
Experience with shell and any scripting languages used in an SRE context (Python, Go, Bash, Ruby
we primarily use Python) and configuration management tools (Puppet, Ansible
we use Puppet)
Experience designing and managing infrastructure security for large fleets of diverse services
Experience with technical response during security incidents
Experience with package management on Linux systems (we use Debian)
Strong Linux system-level troubleshooting skills
History of automating tasks and processes, identifying process gaps, and finding automation opportunities
Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones

Job Responsibility

Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
Work closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure
Collaborating with a global, cross-functional team in an asynchronous communication environment
Mentoring peers in your areas of technical and operational strength
Ability and willingness to travel 1-2 times a year for in-person events and team meetings
Most importantly, share our values and work in accordance with them

Fulltime

It's fun to work in a company where people truly believe in what they're doing! ...

Location

India , Bengaluru

Salary:

Not provided

BlackLine

Expiration Date

Until further notice

Requirements

5–10+ years in SRE, DevOps, or systems engineering in production cloud environments
B.tech/B.E in Computer Science or related field
Expertise in automation, observability & monitoring, CI/CD pipelines, and incident management
Experience with SRE principles (SLI/SLO/error budgets/postmortems, etc)
Proficient in IaC tools like Terraform, Ansible, Chef
Experience in working on HashiCorp tools - Consul, Vault, Nomad, Packer
Strong cloud knowledge (GCP preferred, AWS/Azure a plus)
Experience with containerization & orchestration (Docker, Kubernetes, ArgoCD, etc)
Advanced scripting and automation (Python, Go, PowerShell)
Familiarity with cloud cost monitoring and optimization techniques

Job Responsibility

Own performance, scalability, and operational excellence across critical services
Blend software engineering and systems engineering to build and run large-scale, fault-tolerant, distributed systems—focusing on performance, capacity, availability, and security
Own service reliability across the stack and collaborate closely with developers, architects, and infrastructure teams to ensure services are resilient by design and self-healing by default
Automate operational tasks to reduce toil and increase team velocity
Lead timely and reliable deployments, with emphasis on progressive delivery techniques (canary, blue/green, feature flags, zero outage, etc)
Partner in blameless postmortems and ensure incident reviews lead to systemic fixes
Automate secure lifecycle of certificates, secrets, and credentials
Build and maintain cloud-native security stacks and compliance guardrails
Execute infrastructure rotation and automated rehydration to maintain fleet hygiene
Create and manage highly reproducible environment provisioning via Infrastructure as Code

What we offer

A technology-based company with a sense of adventure and a vision for the future
A culture that is kind, open, and accepting
A culture where BlackLiner's continued growth and learning is empowered
BlackLine offers a wide variety of professional development seminars and inclusive affinity groups to celebrate and support our diversity

Senior Reliability Engineer - AV Labs

We are looking for a hardware focused Senior Reliability Engineer to focus on se...

Location

United States , Sunnyvale

Salary:

180000.00 - 200000.00 USD / Year

Uber

Expiration Date

Until further notice

Requirements

5+ years of relevant industry experience in software engineering, site reliability, or systems engineering
Experience with modern observability platforms (e.g., Prometheus, Grafana, ELK) in edge, IoT, or hardware-integrated environments
coding skills in one or more of Go, Python, or C++, with experience building and operating production systems
Proficiency in Linux internals and shell scripting for triaging and debugging edge devices or hardware-adjacent systems
Ability to debug across services, containers (Docker), and networking stacks
Proven track record owning reliability, infrastructure, or platform systems for large-scale production workloads
Experience designing and operating observability systems (metrics, logging, alerting, and dashboards)
Experience defining and implementing SLIs and SLOs for system availability or data yield
Deep understanding of networking protocols (TCP/IP, gRPC, or MQTT) and data handling in bandwidth-constrained environments
Experience driving complex technical projects and architectural reviews across multiple teams from design through production

Job Responsibility

Architect Observability Systems: Design and scale an observability platform capable of ingesting and analyzing real-time health telemetry from thousands of distributed vehicle nodes
Build for Edge Constraints: Develop systems that remain performant despite hardware diversity, intermittent connectivity, and rapid fleet scaling
Define Criticality Models: Establish alerting strategies that distinguish transient anomalies from systemic issues impacting sensor uptime and data yield
Detect Complex Failure Modes: Design detection logic for 'silent' failures, such as sensor degradation, compute saturation, or recording pipeline stalls
Scale Through Automation: Design automated detection, triage, and mitigation mechanisms to eliminate manual intervention as the fleet grows
Partner on Mitigation: Collaborate with Operations and Engineering to build safe, automated responses to recurring hardware and software failure scenarios
Drive Operational Efficiency: Build technical interfaces to help Operations surface issues and Engineering diagnose and deploy mitigations rapidly (TTD/TTM)
Lead Technical Strategy: Drive reliability-focused design reviews and translate operational pain points into concrete technical requirements and roadmaps
Uncover Proactive Insights: Apply advanced data analytics to identify latent patterns in fleet telemetry, enabling the proactive detection of systemic regressions and hardware degradation before they impact operations

What we offer

Uber's bonus program
equity award & other types of comp
401(k) plan
various benefits

Fulltime

Senior Software Engineer - Stateful Platform

Engineering at Uber means building for real-world impact under real-world constr...

Location

Denmark , Aarhus

Salary:

Not provided

Uber

Expiration Date

Until further notice

Requirements

5+ years of professional software development experience
Proficiency in at least one backend language (e.g., Go, Java, Python)
Experience with software engineering fundamentals, including testing methodologies and quality documentation
Technical depth in building or managing distributed systems at scale
Demonstrated ability to lead technical direction and help a team navigate through ambiguity and complex organizational changes

Job Responsibility

Design, build, and maintain the automation frameworks that deploy and run all database engines globally, ensuring high availability across on-prem and multiple cloud environments
Solve high-impact infrastructure problems where the solution isn't always obvious - such as automating kernel upgrades or storage cluster expansions across millions of containers
Build and own services written in Go, focusing on system reliability, resource forecasting, and intelligent placement to maximize utilization
Navigate the messiness of technical debt and shifting priorities, ensuring every code change is backed by rigorous testing while maintaining a bias for action
Unblock fleet-wide efficiency by optimizing scheduling, with the goal of eliminating manual on-call operations through automation
Lead through influence by mentoring peers and collaborating across global engineering sites (SF, Amsterdam, Seattle, Bangalore) to ship practical solutions at speed
Engage with the local community by participating in or leading tech meetups hosted at our Aarhus office to share knowledge and drive engineering excellence locally

What we offer

Monthly Uber Credits: Credits to use on Uber Rides and Uber Eats every month
Equity Compensation: Opportunity to be awarded stock options (RSUs) to ensure you own a piece of the mission you’re building
Culture & Socials: Frequent local social events and office clubs (chess, board games, running, crossfit, creative club and more)
Tech Community: We host regular local tech meetups to stay connected with the Aarhus engineering scene, and give our engineers the opportunity to sometimes share what they’re working on with the community
Well-being & Fertility: Global support programs for mental health, wellness, and family planning/fertility
Parental Leave: Generous, gender-neutral parental leave to support your life outside of work
Modern Aarhus Hub: Work in a center of technical excellence featuring catered lunches and top-tier collaboration spaces

Fulltime

Senior Manager, Hybrid Services & Reliability (SRE)

As the Senior Engineering Manager for Hybrid Services & Reliability (HSR) within...

Location

United States , Austin, Texas; Sunnyvale, California

Salary:

201600.00 - 302000.00 USD / Year

General Motors

Expiration Date

Until further notice

Requirements

Extensive background in Site Reliability Engineering (SRE) and defining SLO/SLI frameworks for hybrid cloud environments
Technical proficiency in managing on-prem Linux utilities (DHCP/PXE/NTP) and core development services
Opinionated view on automated observability, incident response, and MTTR reduction
Proven leadership experience

Job Responsibility

Reliability Engineering: Define, measure, and enforce strict SLOs/SLIs for critical hybrid cloud services, including network connectivity and compute readiness
Foundational Utilities: Own and manage core on-prem utilities, such as DHCP, PXE, and CDN, to ensure seamless server auto-provisioning across the global fleet
Environment Integrity: Manage the entire data flow path, from initial ingestion at the test bench through the secure cloud network into production staging
HIL Readiness: Guarantee the 99%+ availability and stability of remote CI-based Hardware-in-the-Loop (HIL) benches required for AV safety validation
Organization Growth: Actively lead the recruitment and technical mentorship of Senior and Staff ICs as part of the team's expansion

What we offer

medical, dental, vision, Health Savings Account, Flexible Spending Accounts, retirement savings plan, sickness and accident benefits, life insurance, paid vacation & holidays, tuition assistance programs, employee assistance program, GM vehicle discounts
relocation benefits

Fulltime

Principal Software Engineering Manager

M365 Copilot Inference is a high-impact engineering team advancing applied AI an...

Location

United States , Redmond

Salary:

142800.00 - 274800.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Job Responsibility

Lead and grow a team of software engineers building control plane services and automations across the capacity buildout area
Drive technical design and execution for capacity automation — intake, planning, deployment, fleet health, and control plane components — prioritizing the highest-impact work for Copilot capacity
Replace manual, ticket-driven capacity workflows with automated, data-driven systems
reduce time from capacity request to production traffic for priority workloads
Own live-site, reliability, and operational excellence for the services your team builds
establish SLAs, metrics, and on-call practices
Partner with peer engineering managers on adjacent capacity areas, and with partner teams across M365 Core, AI Core, Azure, and Microsoft Research to align on dependencies and unblock execution
Coach and grow senior and mid-level engineers
raise the engineering bar
recruit strong platform talent into the team

Fulltime

Principal Group Software Engineering Manager

M365 Copilot inference is a high-impact engineering team advancing applied AI an...

Location

United States , Redmond

Salary:

165600.00 - 296400.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

Job Responsibility

Build and lead a high-performing organization of engineering managers and senior engineers across capacity buildouts/automation, capacity planning, and the control plane.
Set the strategy and roadmap for Copilot capacity management and the control plane.
Drive execution across existing teams today, with a clear plan to grow the org as control plane scope expands.
Partner deeply with Copilot, AI Core, Azure to align demand, supply, and COGs for Copilot workloads.
Own live-site, reliability, and operational excellence for the capacity surface area.
Establish metrics and SLAs for intake latency, fleet utilization, automation coverage, and time-to-deploy
use them to guide investment decisions.
Coach and grow managers and senior ICs
raise the engineering bar
recruit experienced platform leaders into the team.

Fulltime

Senior Network Technician

As Senior Network Technician, you would help support the rollout of GeniusIQ, ou...

Location

United Kingdom , London

Salary:

Not provided

Genius Sports

Expiration Date

Until further notice

Requirements

5 years’ experience with system and network administration on infrastructure with 100+ Linux servers
Strong understanding of the entire Linux server stack: OS boot and installation, system, networking, container deployment, logging, metrics & monitoring, out-of-band management, etc.
Strong understanding of OSI network layers 2-3-4 and network configuration: switching, VLANs, routing, firewall rules, ARP, DHCP, DNS, TCP, switch command-line, etc.
Proficiency in Bash scripting
Ability to communicate efficiently and articulate concepts based on the audience, including remote hands, engineering and customers

Job Responsibility

Supervise IT issue tracking and resolution for a large fleet of bare-metal Linux servers and network equipment in hundreds of sport venues in Europe
Assist venue operations coordinators with preparation of equipment and installation, based on automation processes developed by site reliability engineers
Communicate kindly with external venue IT and management staff
Partner with software engineers to eliminate common issues

Fulltime

Select Country

Senior Site Reliability Engineer - Fleet Reliability

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Senior Site Reliability Engineer - Fleet Reliability

Senior Site Reliability Engineer, Infrastructure Foundations

Senior Site Reliability Engineer

Senior Reliability Engineer - AV Labs

Senior Software Engineer - Stateful Platform

Senior Manager, Hybrid Services & Reliability (SRE)

Principal Software Engineering Manager

Principal Group Software Engineering Manager

Senior Network Technician

Our AI answers in your language