Principal Site Reliability Engineer Job at OutSystems

Job Description

Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals of SRE are to create scalable and highly reliable systems. Our SREs ensure our production systems' reliability, performance, and scalability while enabling rapid development and deployment of new features and services. SREs at OutSystems work closely with development teams, acting as an extension of the team, in adopting the reliability tenets with the shared goal of meeting Service Level Objectives (SLOs) and thus delivering a smooth and frictionless Customer Experience.

Job Responsibility

Help define and execute the strategic vision and roadmap for the Site Reliability Engineering function
Provide leadership and mentorship to more junior SREs, fostering a culture of innovation, collaboration, and operational excellence
Collaborate with leadership and other stakeholders to ensure cross-functional alignment
Take active participation, collaborate effectively with development teams, and influence the design of a highly reliable and scalable infrastructure, leveraging cloud technologies and industry best practices
Collaborate with development teams at all stages of the product development lifecycle to ensure systems are resilient (observable, fault-tolerant, recoverable, scalable) and performant
Drive the adoption, definition, and improvement of Service Level Objectives (SLOs)
Implement monitoring, alerting, logging, and tracing solutions to detect and respond to incidents
Oversee incident response efforts, ensuring quick resolution and minimal downtime, and effective RCA/post-mortems
Automate every operational task, with a special focus on fast incident detection & recovery
Foster a culture of continuous improvement and knowledge sharing
Communicate effectively with stakeholders, providing updates on system reliability and performance
Champion reliability as a core product feature, not an afterthought.

Requirements

STEM degree (BSc, MSc, in Software Engineering/Computer Science or related fields)
8+ years of experience in Software Engineering or SRE, ideally within high-growth, cloud-native environments
Expertise in Observability: Proven ability to implement SLIs/SLOs and telemetry systems that provide actionable insights into complex distributed systems
Cloud Mastery: Deep architectural knowledge of AWS/GCP/Azure, specifically regarding networking, security, and cost-optimization
Strategic Impact: Demonstrated success in leading cross-functional initiatives that improved system reliability or developer velocity at an organizational scale
System Design & Architecture: Expertise in designing highly available, fault-tolerant distributed systems (Microservices, Event-driven architecture)
Development: Professional-level proficiency in Go, Python, or Rust, with the ability to contribute to core product codebases and build custom internal tooling
Cloud Ecosystems: Deep-tier mastery of AWS, GCP, or Azure (specifically IAM, VPC networking, Transit Gateways, and Cross-region redundancy)
Orchestration at Scale: Extensive experience managing Kubernetes (K8s) in production, including Custom Resource Definitions (CRDs), Service Mesh (Istio/Linkerd), and Admission Controllers
Infrastructure as Code (IaC): Advanced usage of Terraform, CloudFormation, or Spacelift, focusing on modularity, state management, and CI/CD integration for infrastructure.

What we offer

A company that is always growing, changing, and innovating
Real career opportunities
Work colleagues that are as smart, hard-working, and driven as you
Disrupting the status quo is in our DNA
We ask “why” a lot
OutSystems nurtures an inclusive culture of diversity, where everyone feels empowered to be their authentic self and perform at their best.

OutSystems - All Job Offers

Select Country

Principal Site Reliability Engineer

Job Description

Job Responsibility

Requirements

What we offer

Looking for more opportunities?