This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are looking for an intelligent, resourceful, and highly skilled Senior Site Reliability Engineer (SRE) to join our Platform Site Reliability Engineering (PSRE) team. This team plays a critical role in ensuring the stability, reliability, and availability of mission-critical production applications on the Arcesium platform. The PSRE team is responsible for: Observability, monitoring, logging, and tracing to proactively detect and prevent issues; Building tools and infrastructure that enhance system stability and resilience; Troubleshooting live production issues with a deep focus on rapid incident resolution; Governing, declaring, managing, and recovering from platform-wide incidents to minimize downtime and business impact. As an SRE in this high-impact team, you will work under tight timelines in a high-pressure environment, where every second counts in resolving critical production incidents. This means you must be quick-thinking, highly analytical, and proactive in preventing and resolving disruptions.
Job Responsibility:
Lead reliability engineering projects and drive it to closure
Write code and perform code reviews for best practices and code quality
Contribute to the design/architecture of the system
Automate processes and find opportunities to improve observability and availability of the Platform and reduce toil
Supervise a team of SREs, ensuring production applications are stable, reliable, and well documented
Own end to end availability and performance of mission critical services
Analyze and debug complex issues across tiers from frontend to mid-tier to infrastructure
Practice sustainable incident response and blameless postmortems
Requirements:
5 to 8 years of experience handling systems for large scale production environments
A self-starter, able to build, drive and advocate for SRE solution
Effective cross-functional collaboration skills to develop tools for secured, scalable, and reliable systems
Solid understanding of SRE concepts like SLAs, SLOs, SLIs, error budgets, MTTR, MTTD, etc.
Experience with variety of tools that help manage, understand, and debug large, complex distributed systems
Good programming experience (Python/Go)
Hands-on experience with Kubernetes and Docker
Working knowledge in any one of the cloud platforms (AWS, Azure, GCP)
Experience with monitoring and logging tools (e.g. Datadog, ELK, Prometheus, Grafana)
Good knowledge of Unix system, networking, web technologies, and databases
Expert with troubleshooting issues and bugs
Incident Management experience coupled with effective communication skills
Nice to have:
Experience in financial domain (desirable)
Prior SRE/DevOps experience desirable
What we offer:
Flexible work arrangements (hybrid model) and a casual dress code
Opportunity to work on challenging projects in a dynamic, global environment
Continuous learning and development opportunities
Collaborative and innovative work culture
Competitive compensation and benefits package
Modern and comfortable office located at Avenida da Liberdade (Lisbon)