This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We’re looking for a Site Reliability Engineer (SRE) to join our Global SRE team at Resmed. In this role, you’ll blend software engineering and systems engineering to help ensure our large-scale, distributed digital products are reliable, scalable, and efficient. You’ll work closely with software, platform, and product teams to design, build, and operate systems that support Resmed’s customers worldwide.
Job Responsibility:
Ensure the reliability, availability, and resiliency of Resmed’s digital products by designing and operating fault-tolerant systems
Partner with product and platform teams to define and improve service health using operational and customer-experience metrics
Design, implement, and maintain monitoring, alerting, logging, and tracing solutions that provide real-time visibility into system behavior and customer experience
Analyze system performance, scalability, and capacity, and drive optimizations to improve efficiency and stability in cloud environments
Build automation and tooling to support deployments, scaling, incident response, and operational workflows
Participate in an on-call rotation as part of a globally distributed team, lead incident response efforts, troubleshoot production issues, conduct postmortems, and drive continuous improvement initiatives
Collaborate with security and compliance partners to support secure, privacy-aware, and compliant operations
Work closely with engineering teams to improve developer experience, operational maturity, and overall customer experience
Requirements:
Experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles
Experience operating Kubernetes-based production systems
Hands-on experience with AWS and infrastructure-as-code tools
Experience designing and supporting CI/CD pipelines and automated deployments
Proficiency in Python for automation, tooling, or backend services
Solid understanding of distributed systems and networking concepts
Experience with monitoring and observability platforms such as Datadog and CloudWatch