This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Site Reliability Engineer is responsible for designing, developing, and maintaining software applications and solutions that meet business needs and ensuring the availability and performance of critical systems and applications. This role involves working closely with product managers, designers, and other engineers to create high-quality, scalable software solutions and automating operations, monitoring system health, and responding to incidents to minimize downtime.
Job Responsibility:
Design and implement systems and processes to improve the reliability, scalability, and performance of applications
Automate routine operational tasks, such as deployments, monitoring, and incident response, to improve efficiency and reduce human error
Develop and maintain monitoring tools and dashboards to track system health, performance, and availability
Respond to and resolve incidents promptly, conducting root cause analysis and implementing preventive measures
Provide ongoing maintenance and support for existing systems, ensuring that they are secure, efficient, and reliable
Work on integrating various software applications and platforms to ensure seamless operation across the organization
Implement and maintain security measures to protect systems from unauthorized access and other threats
Requirements:
Doctorate degree OR 6 to 10 years of Computer Science, IT or related field experience OR
Master’s degree and 7 to 10 years of Computer Science, IT or related field experience OR
Bachelor’s degree and 8 to 12 years of Computer Science, IT or related field experience
Working experience with various cloud services on AWS (Azure, GCP) and containerization technologies (Docker, Kubernetes)
Strong programing skills in languages such as Python
Working experience of infrastructure as code (IaC) tools (Terraform, CloudFormation)
Working experience with monitoring and alerting tools (Prometheus, Grafana, etc.)
Working experience with DevOps/MLOps practice and CI/CD pipelines
Proficiency in automated testing tools and frameworks (e.g., Selenium, JUnit, pytest), Incident Management, Production Issue Root Cause Analysis and Improve System Quality
Nice to have:
Strong understanding of cloud platforms (e.g., AWS, GCP, Azure) and containerization technologies (e.g., Docker, Kubernetes)
Experience with monitoring and logging tools (e.g., Prometheus, Grafana, Splunk)
Experience with data processing tools like Hadoop, Spark, or similar
Experience with SAP integration technologies
AWS Developer certification (preferred)
Excellent analytical and troubleshooting skills
Strong verbal and written communication skills
Ability to work effectively with global, virtual teams
High degree of initiative and self-motivation
Ability to manage multiple priorities successfully
Team-oriented, with a focus on achieving team goals
Strong presentation and public speaking skills
What we offer:
Competitive and comprehensive Total Rewards Plans that are aligned with local industry standards