This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals of SRE are to create scalable and highly reliable systems. Our SREs ensure our production systems' reliability, performance, and scalability while enabling rapid development and deployment of new features and services. SREs at OutSystems work closely with development teams, acting as an extension of the team, in adopting the reliability tenets with the shared goal of meeting Service Level Objectives (SLOs) and thus delivering a smooth and frictionless Customer Experience.
Job Responsibility:
Lead and onboard services and teams to the reliability tenets
Establish and maintain Service Level Objectives (SLOs) and Service Level Agreements (SLAs)
Design and implement scalable, reliable, and secure infrastructure, while ensuring cloud-native best practices
Collaborate with software development teams to ensure systems are resilient (observable, fault-tolerant, recoverable, scalable) and performant
Implement monitoring, alerting, logging, and tracing solutions to detect and respond to incidents
Lead incident response efforts, ensuring quick resolution and minimal downtime, and conduct RCA/post-mortems
Automate every operational task, with a special focus on fast incident detection & recovery
Foster a culture of continuous improvement and knowledge sharing
Communicate effectively with stakeholders, providing updates on system reliability and performance
Participate in on-call rotation to provide 24/7 support for production systems
Requirements:
STEM degree (BSc, MSc, in Software Engineering/Computer Science or related fields)
5+ years of experience in software development and/or operations
Proficiency in at least one high-level programming language (C++, Python, Java, C#, etc.)
Proficiency in monitoring and troubleshooting complex distributed systems
Containerization technologies and orchestration platforms, mainly Kubernetes (CKA, CKAD, CKS certifications are valued)
Experience with automation and Infrastructure as Code (IaC) tools, such as AWS CloudFormation, Terraform, Puppet, Chef, Spacelift, etc
Experience with Python, Go, Bash/Shell scripting, or other automation tools/languages
Familiarity with AWS services like EC2, RDS, ELB, CloudFront, Lambda, etc
Experience with Grafana, ELK stack, Prometheus, or others
Strong understanding of designing resilient and fault-tolerant systems