This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As a Site Reliability Engineer (SRE), you will be a key player in ensuring our production systems are highly available, scalable, and performant. You will bridge the gap between development and operations, applying a software engineering mindset to system administration topics. You'll be responsible for building and maintaining large-scale, fault-tolerant distributed systems, with a strong focus on automation, operational excellence, and reliability under real-time, high-throughput constraints. The ideal candidate has a strong background in software engineering and systems administration, with a passion for solving operational problems with code.
Job Responsibility:
Design, build, and maintain highly available, scalable infrastructure for distributed and stateful workloads, supporting real-time data ingestion, AI inference pipelines, and hybrid cloud/edge deployment
Automate repetitive manual tasks, infrastructure provisioning, and operational workflows to reduce toil and improve system efficiency
Implement and manage robust monitoring, logging, and alerting solutions to proactively detect and address issues
Define and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs)
Participate in an on-call rotation to respond to production incidents
Lead blameless post-mortem analyses for incidents in complex distributed systems, identifying root causes, systemic weaknesses, and implementing long-term preventative measures
Manage and provision cloud and on-premise infrastructure using IaC principles and tools like Terraform and Ansible
Conduct performance analysis, system tuning, and capacity planning to ensure our services meet performance and cost-efficiency goals
Develop, test, and maintain disaster recovery plans and business continuity strategies to ensure service resilience
Work closely with software development teams to consult on system design, platform choices, and reliability best practices for new features and services
Create and maintain comprehensive documentation for system architecture, runbooks, and operational procedures
Requirements:
Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field
3+ years of experience in Site Reliability Engineering, DevOps, or a related software/systems engineering role
Proficiency in one or more programming languages such as Python, Go, or Bash for automation and tooling
Deep understanding of Linux/Unix operating systems and networking fundamentals (TCP/IP, DNS, HTTP, load balancing)
Experience with cloud platforms such as AWS, Azure, or Google Cloud, with a focus on Google Cloud
Strong knowledge of CI/CD tools like Jenkins, GitLab CI, or CircleCI
Strong hands-on experience operating Kubernetes in production, including troubleshooting of networking, storage, scheduling, autoscaling, and stateful workloads
Experience with Infrastructure as Code (IaC) tools such as Terraform and Ansible
Understanding of version control systems (e.g., Git) and with CI/CD principles and tools (e.g., GitLab CI, Jenkins)
Knowledge of monitoring, logging and tracing tools (e.g., Prometheus, Grafana, ELK stack)
Strong analytical and problem-solving skills, with an ability to diagnose and resolve complex issues in distributed systems
Excellent verbal and written communication skills, with the ability to effectively collaborate with technical and non-technical stakeholders
High attention to detail and a commitment to ensuring the accuracy and quality of work
Ability to thrive in a fast-paced, dynamic environment and manage multiple projects simultaneously
What we offer:
An excellent work environment and an opportunity to create a real impact in the world
A truly high-tech, state-of-the-art engineering company with flat structure and no politics
Working with the very latest technologies in Data & AI, including Edge AI, Swarming - both within our software platforms and within our embedded on-board systems
Flexible work arrangements
Professional development opportunities
Collaborative and inclusive work environment
Salary compatible with the level of proven experience