This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
There are NO limits to your career: come shape the future and be part of a truly unique global culture at OutSystems!
Job Responsibility:
Define and execute the M&O strategic vision and roadmap as Platform Engineering
Lead and mentor a team of M&O engineers, fostering innovation and operational excellence
Treat the M&O platform as an internal product
actively engage with engineering 'customers' (R&D) to understand their needs, gather feedback, and define the platform's roadmap
Manage and optimize cloud infrastructure costs for M&O tools and services
Own the full lifecycle of the M&O platform itself, using Infrastructure as Code, CI/CD, and SRE principles to ensure the platform is reliable, scalable, and cost-effective
Act as the primary evangelist for observability, developing 'golden paths,' documentation, and training to help teams effectively monitor their own services
Partner with development teams throughout the product lifecycle to ensure resilient, performant systems
Drive the enablement of Service Level Objectives (SLOs) by providing the tools, templates, and training for teams to define and measure their own SLOs
Develop, manage, and promote a self-service, company-wide observability platform for use by all engineering teams
Analyze and report on global reliability trends for the company (like aggregate MTTR and SLO compliance) to measure the effectiveness and adoption of the observability platform
Automate operational tasks, with a focus on fast incident detection & recovery
Foster continuous improvement and knowledge sharing
Communicate system reliability and performance updates to stakeholders
Requirements:
STEM degree (BSc, MSc, in Software Engineering/Computer Science or related fields)
7+ years of experience in SRE, DevOps, or Software Engineering roles
Proven track record in building, scaling, and maintaining highly available, distributed systems
Strong understanding of incident management, SLAs/SLOs/SLIs, and service reliability metrics
Excellent communication, stakeholder management, and cross-functional leadership skills
Ability to foster a culture of automation, reliability, and continuous improvement
Deep, hands-on experience with the Prometheus ecosystem, Grafana, FluentBit, Elastic Stack, and OpenTelemetry
Strong, practical expertise in AWS
Deep knowledge of Kubernetes
Proficiency with Terraform (we use Spacelift)
Expertise with GitHub (including GitHub Actions)
Solid grasp of DNS, load balancing, TLS, Ingress, Service Mesh, IAM, and security best practices
Proven ability to design resilient, fault-tolerant systems and debug complex distributed systems
Nice to have:
Familiarity with other M&O tools (e.g., Datadog)
Experience with other cloud platforms (e.g., GCP, Azure)
Knowledge of other CI/CD tools (e.g., Jenkins, GitLab CI, ArgoCD)
Software development experience (e.g., GoLang, Python)