Monitoring and Observability (M&O) Manager Job at OutSystems

Job Description

There are NO limits to your career: come shape the future and be part of a truly unique global culture at OutSystems!

Job Responsibility

Define and execute the M&O strategic vision and roadmap as Platform Engineering
Lead and mentor a team of M&O engineers, fostering innovation and operational excellence
Treat the M&O platform as an internal product
actively engage with engineering 'customers' (R&D) to understand their needs, gather feedback, and define the platform's roadmap
Manage and optimize cloud infrastructure costs for M&O tools and services
Own the full lifecycle of the M&O platform itself, using Infrastructure as Code, CI/CD, and SRE principles to ensure the platform is reliable, scalable, and cost-effective
Act as the primary evangelist for observability, developing 'golden paths,' documentation, and training to help teams effectively monitor their own services
Partner with development teams throughout the product lifecycle to ensure resilient, performant systems
Drive the enablement of Service Level Objectives (SLOs) by providing the tools, templates, and training for teams to define and measure their own SLOs
Develop, manage, and promote a self-service, company-wide observability platform for use by all engineering teams
Oversee incident response, ensuring quick resolution, minimal downtime, and effective RCA/post-mortems
Analyze and report on global reliability trends for the company (like aggregate MTTR and SLO compliance) to measure the effectiveness and adoption of the observability platform
Automate operational tasks, with a focus on fast incident detection & recovery
Foster continuous improvement and knowledge sharing
Communicate system reliability and performance updates to stakeholders

Requirements

STEM degree (BSc, MSc, in Software Engineering/Computer Science or related fields)
7+ years of experience in SRE, DevOps, or Software Engineering roles
Proven track record in building, scaling, and maintaining highly available, distributed systems
Strong understanding of incident management, SLAs/SLOs/SLIs, and service reliability metrics
Excellent communication, stakeholder management, and cross-functional leadership skills
Ability to foster a culture of automation, reliability, and continuous improvement
Deep, hands-on experience with the Prometheus ecosystem, Grafana, FluentBit, Elastic Stack, and OpenTelemetry
Strong, practical expertise in AWS
Deep knowledge of Kubernetes
Proficiency with Terraform (we use Spacelift)
Expertise with GitHub (including GitHub Actions)
Solid grasp of DNS, load balancing, TLS, Ingress, Service Mesh, IAM, and security best practices
Proven ability to design resilient, fault-tolerant systems and debug complex distributed systems

Nice to have

Familiarity with other M&O tools (e.g., Datadog)
Experience with other cloud platforms (e.g., GCP, Azure)
Knowledge of other CI/CD tools (e.g., Jenkins, GitLab CI, ArgoCD)
Software development experience (e.g., GoLang, Python)
Familiarity with relational and NoSQL databases

OutSystems - All Job Offers

Select Country

Monitoring and Observability (M&O) Manager

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?

Monitoring and Observability (M&O) Manager

O&M Manager

O&M Technician I

O&M Technician I

O&M Technician I

Quality Control Manager

Senior Construction Manager

Dyn 365 F&o Technical Lead / Manager

Pharmacy Technician

Our AI answers in your language