This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Beacon Biosignals is seeking a skilled Site Reliability Engineer to join our Platform team, to help ensure the reliability, availability, and security of Beacon's cloud infrastructure that supports large-scale machine learning on terabytes of biosignal data. In this role, you'll be responsible for building and maintaining critical systems, such as the Kubernetes clusters that power our data scientists' hefty distributed numerical workloads, and observability infrastructure that makes it easy for users to monitor, trace, and identify bugs and resource utilization issues.
Job Responsibility:
Design and implement infrastructure as code solutions that improve reliability, security, and maintainability of our cloud infrastructure
Lead and execute major infrastructure initiatives including cluster upgrades, security improvements, and architectural changes
Develop and maintain CI/CD pipelines that enable teams to deploy safely and efficiently
Improve observability across our systems through enhanced monitoring, logging, and alerting
Participate in an on-call rotation and lead incident response efforts when issues arise
Collaborate with development teams to improve application reliability and performance
Maintain and enhance our security posture through infrastructure hardening and automation
Create and maintain documentation for infrastructure, deployment processes, and incident response procedures
Requirements:
Strong experience with Kubernetes administration, including cluster management, security, and troubleshooting
Proven track record implementing infrastructure as code using Terraform or similar tools
Experience building and maintaining CI/CD pipelines, particularly with GitHub Actions, Azure DevOps, or ArgoCD
Solid understanding of container technologies and build processes, especially Docker
Strong cloud provider (e.g. AWS) knowledge including networking, security, and infrastructure services
Experience with incident response and on-call responsibilities in a production environment
Deep experience with Linux systems administration and debugging
Proficiency in at least one programming language (Python, Go, Typescript etc.)
Understanding of security and networking concepts including OAuth2/OIDC, DNS, TLS, TCP/UDP, etc
Approximate experience: Bachelor's degree + 5-8 years of experience in SRE, DevOps, or other similar professional experience.
Nice to have:
Experience with Azure is a plus
familiarity with Windows Server environments is a plus