CrawlJobs Logo

Site Reliability Engineer

Portugal, Lisboa · Job Posted December 26, 2025
Apply Position
Job Link Share

Job Description

We are recruiting a Junior SRE for a company that provides an advanced data, operations, and analytics platform tailored for sophisticated asset managers, hedge funds, and banks.

Requirements

  • Up to 2-3 years of experience in a Site Reliability Engineering SRE, DevOps, or Production Engineering role, with a deep understanding of SRE principles and best practices
  • Incident management expertise, including triaging, escalation, and resolution of high-severity outages
  • Proficiency in at least one coding language Python or Java) for automation and debugging
  • Hands-on experience in Kubernetes K8s for managing and orchestrating containerized applications
  • Cloud experience AWS preferred) with exposure to key services like EC2, S3, Lambda, and CloudWatch
  • Excellent communication skills to articulate technical challenges and solutions effectively
  • Strong troubleshooting and problem-solving skills, with experience diagnosing complex production issues
  • Ability to stay calm under pressure, multitask, and prioritize effectively in fast-moving environments
  • Fluency in English (spoken and written) is required
  • Must have the legal right to work in the country

Nice to have

  • Experience with Terraform or CloudFormation for infrastructure-as-code
  • Experience with monitoring tools (e.g., Datadog, Prometheus, Grafana)
  • Familiarity with web application architectures and best practices
  • Exposure to CI/CD pipelines and DevOps workflows

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineer

8 matching positions

New

Site Reliability Engineer

We are looking for an experienced Site Reliability Engineer to strengthen the re...
Location
Location
United States , San Francisco
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Site Reliability Engineering, infrastructure engineering, or a closely related production environment role
  • Strong background operating, supporting, and troubleshooting distributed systems at scale
  • Hands-on experience with observability platforms such as Datadog, Grafana, OpenTelemetry, CloudWatch, or similar tools
  • Proven involvement in on-call operations, incident management, and reliability-focused problem resolution
  • Practical experience defining and using SLIs, SLOs, and error budgets to guide service reliability decisions
  • Familiarity with AWS environments, including serverless and container-based architectures
  • Experience working with relational databases such as Postgres and performance analysis in production systems
  • Ability to write automation scripts or lightweight tooling in languages such as Python or Bash, with strong judgment around failure modes and system design
Job Responsibility
Job Responsibility
  • Establish measurable reliability standards for critical services by creating and maintaining service indicators, objectives, and error budget practices
  • Take ownership of production stability by monitoring uptime, latency, and availability, and driving improvements that reduce operational risk
  • Lead live incident response efforts, coordinate troubleshooting during outages, and ensure issues are resolved efficiently and thoroughly
  • Run blameless post-incident reviews, document findings clearly, and track corrective actions through completion
  • Design and enhance observability across logs, metrics, and distributed tracing using tools such as Datadog, CloudWatch, Grafana, OpenTelemetry, and Sentry
  • Improve alert quality and dashboard design so engineering teams can quickly identify meaningful system issues without unnecessary noise
  • Evaluate system behavior under load, uncover performance constraints, and recommend changes that improve scalability and resource efficiency
  • Build automation and internal tooling that streamline operational work, strengthen deployment safety, and support incident management, debugging, and capacity planning
  • Contribute to infrastructure and delivery workflows across AWS, Terraform, Ansible, Linux, and GitHub Actions with a focus on dependable releases and resilient systems
  • Partner with security and compliance stakeholders to support operational standards, audit readiness, and the integration of monitoring into broader engineering practices
What we offer
What we offer
  • Medical, vision, dental, and life and disability insurance
  • Enrollment in company 401(k) plan
Read More
Arrow Right
New

Site Reliability Engineer

We’re hiring an SRE to help improve the availability, performance, scalability, ...
Location
Location
Israel , Netanya/Tel Aviv
Salary
Salary:
Not provided
jfrog.com Logo
JFrog
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2-4 years of experience in SRE, Production Engineering, DevOps, or a similar role with hands-on production exposure
  • Strong troubleshooting and analytical skills, with the ability to investigate production issues in a structured and methodical way
  • Hands-on experience with Kubernetes-based containerized workloads
  • Experience with at least one public cloud provider: AWS, GCP, or Azure
  • Experience developing backend services, internal platforms, automation, or production engineering tools using Python, Go, or another programming language
  • Practical understanding of Linux fundamentals, networking concepts, HTTP, DNS, service connectivity, and production troubleshooting
  • Familiarity with CI/CD tools such as Jenkins, ArgoCD, GitHub Actions, or similar
  • Exposure to observability tools covering metrics, logs, and traces, such as Prometheus, Grafana, Coralogix, New Relic, or similar platforms
  • Understanding of incident management processes, alerting systems, and production support workflows
  • Ability to learn quickly, take ownership, communicate clearly, and work well in a collaborative production environment
Job Responsibility
Job Responsibility
  • Support the reliability, availability, performance, and scalability of JFrog’s large-scale, multi-cloud, Kubernetes-based SaaS environments
  • Investigate and troubleshoot production issues across distributed systems, infrastructure, Kubernetes, and cloud environments in close collaboration with Engineering teams
  • Design and develop backend services, internal platforms, and production engineering tools using Python, Go, or similar technologies
  • Improve reliability, observability, and operational readiness through SRE practices, monitoring and alerting, capacity awareness, postmortems, and safer CI/CD and production change processes
  • Evaluate and contribute to AI-assisted and agentic automation solutions that improve operational efficiency, troubleshooting, and production workflows
  • Support resilience initiatives, including disaster recovery validation, service readiness, health checks, and production readiness reviews
  • Participate in on-call rotations, lead incident response when needed, and drive follow-up actions to prevent recurrence
  • Continuously learn and evaluate new technologies that can improve reliability, automation, and operational excellence
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

Microsoft Substrate is the foundational cloud platform that powers many of Micro...
Location
Location
United States , Redmond
Salary
Salary:
102100.00 - 202200.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • 4+ years technical experience in software engineering, network engineering, or systems administration
  • ability to meet Microsoft, customer and/or government security screening requirements
  • ability to obtain and maintain favorably adjudicated Tier 3 (T3) background investigation
  • ability to meet Criminal Justice Information Services (CJIS) eligibility requirements
  • must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Own reliability and operational health for one or more Substrate components or services in highly regulated environments
  • Serve as an actively engaged on-call engineer (OCE), participating in an on-call rotation and independently responding to incidents for owned services
  • Respond to, diagnose, and resolve production incidents with minimal supervision
  • Design and implement automation to reduce operational toil and improve service stability
  • Develop and maintain monitoring, alerting, and telemetry to support SLOs and operational metrics
  • Lead post-incident reviews for owned incidents, focusing on root cause analysis and durable fixes
  • Collaborate with software engineering teams to embed reliability and operability into service design
  • Write and maintain production-quality code and automation that improves reliability, scalability, and operational efficiency
What we offer
What we offer
  • Benefits and other compensation may be eligible
  • additional benefits and pay information available at https://careers.microsoft.com/us/en/us-corporate-pay
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

An Elite FinTech Firm is looking for a highly talented DevOps Engineer/Systems S...
Location
Location
Canada , Montreal
Salary
Salary:
200000.00 CAD / Year
hunterbond.com Logo
Hunter Bond
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Genuine passion in Linux & Open-source
  • Excellent knowledge of Python
  • Use of CI/CD, Docker, Ansible, Chef, Puppet
  • Knowledge of large-scale storage systems (on-prem)
Job Responsibility
Job Responsibility
  • Help architect a resilient, multi-petabyte storage solutions & build new data centres
  • Automate anything and everything with Python & config tools
  • Innovate whilst bringing in new ideas
What we offer
What we offer
  • Flexible hours/work options
  • Working in one of the world’s most elite teams
  • Invest heavily in cutting-edge and next-gen tech
  • Technologists only report to other technologists
  • Brand new skyline Manhattan office
  • Start-up style environment
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

We are currently seeking a Site Reliability Engineer to join our team in Guadala...
Location
Location
Mexico , Guadalajara
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Perform L1.5 activities such as monitoring, deployment, rollback
  • Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
  • Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
  • Understand the Microsoft Azure Cloud - ideally Azure Fundamentals certified OR Computer Science/Information Systems Management degree
  • Familiar with PaaS and IaaS - VMs, Storage, EventHub, Service Fabric Cluster (SFC), Azure Kubernetes Service (AKS), CosmosDB, SQL Server, IoT Hub, Databricks, KeyVault, Datalake
  • Understand the concept of Internet of Things (IoT) - telemetry, ingestion, processing, data storage, reporting
  • Understand the concept tools - Octopus, Bamboo, Terraform, Azure DevOps, Jenkins, Github, Ansible
  • Understand the concept of container orchestration platforms (e.g. Kubernetes)
  • Understand the concept of scripts: Powershell, Python
  • Understand the difference between NoSQL and SQL databases, and how to maintain them
Job Responsibility
Job Responsibility
  • Perform L1.5 activities such as monitoring, deployment, rollback
  • Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
  • Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

We are looking for a Lead Site Reliability Engineer (SRE) with strong experience...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
karix.com Logo
Karix
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in SRE / DevOps / Production Engineering roles
  • Strong expertise in troubleshooting distributed systems and microservices architecture
  • Hands-on experience with Kafka, RabbitMQ, and Redis
  • Strong knowledge of Kubernetes and container orchestration
  • Experience with CI/CD pipelines and deployment automation
  • Solid understanding of Linux, networking, and cloud platforms (AWS / Azure / GCP)
  • Experience with Infrastructure as Code (Terraform, Ansible)
  • Strong scripting skills (Python, Bash, or similar)
  • Database experience: MySQL / Oracle / MongoDB
  • Strong problem-solving, ownership mindset, and ability to lead initiatives
Job Responsibility
Job Responsibility
  • Lead troubleshooting and resolution of complex production issues in distributed systems
  • Drive reliability engineering practices, ensuring high availability and performance of systems
  • Manage and optimize messaging systems like Apache Kafka, RabbitMQ, and Redis
  • Architect, manage, and optimize Kubernetes clusters for scalability and resilience
  • Manage CI/CD pipelines and drive deployment automation
  • Implement and maintain monitoring, alerting, and observability using Prometheus, Grafana, and ELK stack
  • Lead incident management, root cause analysis (RCA), and post-mortem reviews
  • Mentor junior engineers and collaborate with cross-functional teams to improve system design and reliability
What we offer
What we offer
  • Impactful Work: Play a key role in ensuring reliability and scalability of platforms that handle large-scale, real-time communication systems
  • Tremendous Growth Opportunities: Accelerate your career by leading critical reliability initiatives and working on high-scale distributed systems
  • Innovative Environment: Work in a fast-paced ecosystem that embraces automation, cloud-native technologies, and continuous improvement
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

Microsoft is a company where passionate innovators come to collaborate, envision...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Work with all aspects of a high throughput and multi-tenant service
  • Collaborate effectively within the team and with partner teams across Microsoft
  • Be part of the on-call rotation for maintaining service health
  • Design, implement, and refine chosen solutions in close partnership with Product Management and partner teams
  • Champion operational excellence via established metrics, process governance, and policy controls for regular assessment and improvement
  • Document and define existing data engineering processes, data and technology, while evaluating them for optimization
  • System Reliability & Uptime – Ensuring high availability of services
  • Incident Management – Detecting, responding to, and mitigating system failures
  • Performance Monitoring – Tracking system health and resolving bottlenecks
  • Automation & Tooling – Reducing manual work through scripts and automation
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

Location
Location
South Africa , Johannesburg
Salary
Salary:
Not provided
nintex.com Logo
Nintex
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You provide guidance on infrastructure architecture and contribute to high-quality and successful product releases.
  • You contribute to your team and domain through successfully leading and consistently delivering on projects of ambiguous scope, high complexity, and critical business impact.
  • You contribute to relevant guilds, practice forums and other initiatives to improve Nintex’s DevOps and SRE discipline.
  • You have an in-depth understanding of distributed systems architecture, as well as monitoring and observability practices and tools.
  • You quickly resolve priority infrastructure issues and help other technical team members or Product Managers understand how to avoid them in the future.
  • You provide detailed estimates for work items you propose or assigned.
  • You assist in decision-making around tooling, automation practices, and testing solutions.
  • You stay up-to-date with technology trends and use this knowledge help your team and the broader Engineering practice.
  • You run Nintex infrastructure with IaC tools (as Terraform) and GitHub Actions for automation, containerize our environments (Kubernetes) and leverage cloud technologies to meet our goals
  • You build monitoring that alerts on symptoms rather than outages using tools like Prometheus, Grafana, Alertmanager and PagerDuty
Job Responsibility
Job Responsibility
  • You are highly skilled and sufficiently experienced in Nintex DevOps tools and processes to own a long-term program or technology such as Kubernetes, etc.
  • You write scripts, tools and utilities that support and integrate with delivery pipelines and you integrate telemetry where appropriate.
  • You are called into incidents and bring trusted knowledge in your platform domain.
  • You debug and fix infrastructure issues on production environments quickly using the relevant tools and guidelines to prevent recurrence.
  • You build, promote and support infrastructure patterns and practices within Nintex.
  • You provide coaching/mentoring to other Engineers on the team
  • You lead or contribute to post-mortems for incidents, including root cause analysis and identification of preventative and remedial actions.
  • You continuously monitor our platform performance and take immediate action to improve it
  • You review and advise on appropriate design patterns to solve automation and infrastructure problems without creating technical debt.
  • You design and build complex infrastructure components for distributed systems as Kubernetes.
What we offer
What we offer
  • Global Gratitude and Recharge Days
  • Flexible, paid time off policy
  • Employee wellness programs and counseling resources
  • Meaningful peer recognition and awards
  • Paid parental leave
  • Invention/patenting assistance
  • Community impact, paid volunteer time, and opportunities
  • Intercultural learning and celebration
  • Multiple tools through which to learn and grow, and an incredible global community
Read More
Arrow Right