Senior Site Reliability Engineer Job at Heidi (Sydney, Melbourne)

Job Description

This role sits in the core Platform/SRE team that owns production. You’ll work directly on incident response, on-call, system reliability, and day-to-day operations for Heidi’s platform. We’re open to candidates who are strong mid-level SREs ready to take on more ownership, as well as senior SREs who enjoy being hands-on in operations. The role is intentionally ops-heavy and focused on keeping real systems healthy in production.

Job Responsibility

Participate in on-call and incident response: Respond to production incidents, contribute to service restoration, and support clear communication during incidents. Over time, take increasing responsibility for leading incidents end-to-end
Improve operational reliability: Identify recurring issues and reliability risks, and drive fixes through better alerting, automation, system changes, or process improvements
Own parts of the production environment: Operate and improve Kubernetes clusters, cloud infrastructure, and core platform services, with growing ownership as familiarity increases
Strengthen observability: Improve dashboards, alerts, logs, and traces so issues are detected earlier and diagnosed faster, with a strong focus on actionable signals
Reduce operational toil: Automate repetitive tasks, simplify runbooks, and improve tooling to make on-call and day-to-day operations easier and safer
Support safe change: Improve deployments, rollback mechanisms, and operational readiness to reduce the risk of incidents caused by change
Contribute to operational practices: Write and maintain runbooks, participate in blameless post-mortems, and help improve incident response processes over time
Collaborate closely with engineers: Work with product and feature teams to improve production readiness, service ownership, and reliability expectations

Requirements

3–6+ years in SRE, DevOps, Platform, or operations-heavy engineering roles
Experience supporting production systems and participating in on-call rotations
Comfortable debugging live systems under pressure
Experience operating cloud infrastructure (AWS preferred)
Working knowledge of Kubernetes and containerised workloads
Infrastructure as Code experience (Terraform or similar)
Familiarity with monitoring and alerting tools (Datadog, Prometheus, etc)
Scripting or automation experience (Python, Bash, or similar)

Nice to have

Experience leading incidents or mentoring others during on-call
Experience in regulated or security-sensitive environments
Familiarity with databases, queues, and caches in production
Interest in reliability practices such as SLOs, error budgets, and capacity planning

What we offer

Equity from day one
Personal development budget
Work from anywhere for a month
Dedicated wellness days
Birthday off
Hybrid environment, with 3 days in the office

Heidi - All Job Offers

Select Country

Senior Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Our AI answers in your language