CrawlJobs Logo

Site Reliability Engineer

Spain, Valencia · Job Posted February 16, 2026
Apply Position
Job Link Share

Job Description

Our agentic process automation platform helps enterprises automate complex, decision-heavy processes that traditional automation can’t handle and GenAI can’t be trusted with. We enable organizations to scale operations, resist hallucinations, and bring end-to-end visibility and control to your most complex processes. Powered by a new kind of computing platform, Maisa combines AI-driven problem solving with programmatic execution, so every action is reliable, auditable, and built for enterprise scale.

Job Responsibility

  • Build and maintain AWS cloud infrastructure using Terraform, Pulumi, and Helm charts
  • Manage and scale Kubernetes clusters and container orchestration
  • Design and implement infrastructure-as-code for repeatable, reliable deployments
  • Support both cloud-based and on-premise installation requirements
  • Optimize cloud costs while maintaining performance and reliability targets
  • Plan and execute infrastructure capacity and scaling strategies
  • Implement comprehensive monitoring and logging using Grafana, Prometheus, and (future) ElasticSearch/Kibana
  • Define and track SLIs, SLOs, and error budgets for critical services
  • Build alerting strategies that enable proactive incident response
  • Lead incident response, post-mortems, and continuous improvement initiatives
  • Create and maintain runbooks and operational documentation
  • Configure and maintain CI/CD pipelines in GitHub Actions
  • Automate deployment, scaling, and recovery processes
  • Implement infrastructure security best practices (encryption at rest/in transit, network policies, IAM)
  • Manage disaster recovery and business continuity procedures
  • Collaborate with development teams to optimize application performance and reliability
  • Work with enterprise infrastructure teams on deployment requirements and integration
  • Support technical discussions around architecture and deployment models
  • Respond to infrastructure and availability questions during vendor assessments

Requirements

  • Strong demonstrable experience in DevOps, SRE, or cloud infrastructure engineering roles
  • Strong hands-on experience with AWS cloud services and infrastructure
  • Proficiency with infrastructure-as-code tools (Terraform, Pulumi)
  • Experience managing Kubernetes in production environments
  • Knowledge of CI/CD pipelines and deployment automation
  • Experience with monitoring and observability tools (Grafana, Prometheus)
  • Understanding of networking, security best practices, and system hardening
  • Strong troubleshooting and problem-solving skills for complex distributed systems
  • Ability to balance reliability, performance, and velocity
  • Fluent Spanish (essential—you'll interact directly with client infrastructure teams)

Nice to have

  • Experience with Helm charts
  • programming (Go, Python, Rust)
  • container security
  • compliance frameworks (SOC 2, ISO 27001)
  • and penetration testing are all valuable additions

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineer

8 matching positions

New

Site Reliability Engineer

We are currently seeking a Site Reliability Engineer to join our team in Guadala...
Location
Location
Mexico , Guadalajara
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Perform L1.5 activities such as monitoring, deployment, rollback
  • Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
  • Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
  • Understand the Microsoft Azure Cloud - ideally Azure Fundamentals certified OR Computer Science/Information Systems Management degree
  • Familiar with PaaS and IaaS - VMs, Storage, EventHub, Service Fabric Cluster (SFC), Azure Kubernetes Service (AKS), CosmosDB, SQL Server, IoT Hub, Databricks, KeyVault, Datalake
  • Understand the concept of Internet of Things (IoT) - telemetry, ingestion, processing, data storage, reporting
  • Understand the concept tools - Octopus, Bamboo, Terraform, Azure DevOps, Jenkins, Github, Ansible
  • Understand the concept of container orchestration platforms (e.g. Kubernetes)
  • Understand the concept of scripts: Powershell, Python
  • Understand the difference between NoSQL and SQL databases, and how to maintain them
Job Responsibility
Job Responsibility
  • Perform L1.5 activities such as monitoring, deployment, rollback
  • Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
  • Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
  • Fulltime
Read More
Arrow Right
New

Site Reliability Engineer

Location
Location
South Africa , Johannesburg
Salary
Salary:
Not provided
nintex.com Logo
Nintex
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You provide guidance on infrastructure architecture and contribute to high-quality and successful product releases.
  • You contribute to your team and domain through successfully leading and consistently delivering on projects of ambiguous scope, high complexity, and critical business impact.
  • You contribute to relevant guilds, practice forums and other initiatives to improve Nintex’s DevOps and SRE discipline.
  • You have an in-depth understanding of distributed systems architecture, as well as monitoring and observability practices and tools.
  • You quickly resolve priority infrastructure issues and help other technical team members or Product Managers understand how to avoid them in the future.
  • You provide detailed estimates for work items you propose or assigned.
  • You assist in decision-making around tooling, automation practices, and testing solutions.
  • You stay up-to-date with technology trends and use this knowledge help your team and the broader Engineering practice.
  • You run Nintex infrastructure with IaC tools (as Terraform) and GitHub Actions for automation, containerize our environments (Kubernetes) and leverage cloud technologies to meet our goals
  • You build monitoring that alerts on symptoms rather than outages using tools like Prometheus, Grafana, Alertmanager and PagerDuty
Job Responsibility
Job Responsibility
  • You are highly skilled and sufficiently experienced in Nintex DevOps tools and processes to own a long-term program or technology such as Kubernetes, etc.
  • You write scripts, tools and utilities that support and integrate with delivery pipelines and you integrate telemetry where appropriate.
  • You are called into incidents and bring trusted knowledge in your platform domain.
  • You debug and fix infrastructure issues on production environments quickly using the relevant tools and guidelines to prevent recurrence.
  • You build, promote and support infrastructure patterns and practices within Nintex.
  • You provide coaching/mentoring to other Engineers on the team
  • You lead or contribute to post-mortems for incidents, including root cause analysis and identification of preventative and remedial actions.
  • You continuously monitor our platform performance and take immediate action to improve it
  • You review and advise on appropriate design patterns to solve automation and infrastructure problems without creating technical debt.
  • You design and build complex infrastructure components for distributed systems as Kubernetes.
What we offer
What we offer
  • Global Gratitude and Recharge Days
  • Flexible, paid time off policy
  • Employee wellness programs and counseling resources
  • Meaningful peer recognition and awards
  • Paid parental leave
  • Invention/patenting assistance
  • Community impact, paid volunteer time, and opportunities
  • Intercultural learning and celebration
  • Multiple tools through which to learn and grow, and an incredible global community
Read More
Arrow Right

Site Reliability Engineer

An Elite FinTech Firm is looking for a highly talented DevOps Engineer/Systems S...
Location
Location
Hong Kong , Hong Kong
Salary
Salary:
1200000.00 HKD / Year
hunterbond.com Logo
Hunter Bond
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Genuine passion in Linux & Open-source
  • Excellent knowledge of Python
  • Use of CI/CD, Docker, Ansible, Chef, Puppet
  • Knowledge of large-scale storage systems (on-prem)
Job Responsibility
Job Responsibility
  • Help architect a resilient, multi-petabyte storage solutions & build new data centres
  • Automate anything and everything with Python & config tools
  • Innovate whilst bringing in new ideas
What we offer
What we offer
  • Flexible hours/work options
  • Working in one of the world’s most elite teams
  • Invest heavily in cutting-edge and next-gen tech
  • Technologists only report to other technologists
  • Brand new skyline Manhattan office
  • Start-up style environment
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

As a Staff Software Engineer, you will play a key role in designing, building, a...
Location
Location
United States , San Jose
Salary
Salary:
120500.00 - 243000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum of 5 years of hands-on experience in Infra Ops, Dev Ops, or Site Reliability Engineering (SRE)
  • Proficiency with Linux systems, especially Debian-based distributions
  • Strong experience with cloud platforms such as AWS and GCP
  • Expertise in Infrastructure as Code tools like Terraform, Packer, and Ansible
  • Solid programming skills in Python and/or Golang
  • Deep understanding of containerization (Docker, Container) and orchestration tools (AWS EKS, GCP GKE)
  • Experience with GitOps workflows
  • Proven track record in implementing and maintaining CI/CD pipelines
  • Strong background in security and familiarity with security programs
  • Experience with monitoring and logging tools (Prometheus, Grafana, ELK)
Job Responsibility
Job Responsibility
  • Enhance Infrastructure as Code (IAC) and enforce best practices
  • Optimize cloud infrastructure for scalability, security, and cost-effectiveness
  • Develop internal tools to support and streamline cloud platform operations
  • Improve CI/CD pipelines and deployment workflows using FluxCD and Jenkins
  • Address container image vulnerabilities and standardize remediation processes
  • Build Amazon Machine Images (AMIs) aligned with CIS and STIG benchmarks
  • Strengthen monitoring, alerting, and observability using Prometheus, Grafana, and logging tools
  • Troubleshoot complex production issues to ensure system reliability and customer satisfaction
  • Fine-tune distributed systems such as Apache Kafka and Cassandra
  • Collaborate with development, security, and operations teams to align infrastructure with application needs
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right
New

Site Reliability Engineer

An Elite FinTech Firm is looking for a highly talented DevOps Engineer/Systems S...
Location
Location
United Kingdom , London
Salary
Salary:
150000.00 GBP / Year
hunterbond.com Logo
Hunter Bond
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Genuine passion in Linux & Open-source
  • Excellent knowledge of Python
  • Use of CI/CD, Docker, Ansible, Chef, Puppet
  • Knowledge of large-scale storage systems (on-prem)
Job Responsibility
Job Responsibility
  • Help architect a resilient, multi-petabyte storage solutions & build new data centres
  • Automate anything and everything with Python & config tools
  • Innovate whilst bringing in new ideas
What we offer
What we offer
  • Flexible hours/work options
  • Working in one of the world’s most elite teams
  • Invest heavily in cutting-edge and next-gen tech
  • Technologists only report to other technologists
  • Brand new skyline Manhattan office
  • Start-up style environment
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

Microsoft is a company where passionate innovators come to collaborate, envision...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 3+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Own the end-to-end readiness of Event Stream across Azure regions, including onboarding new regions, driving deployment automation, and ensuring consistent, secure, and compliant service rollout
  • Work closely with platform, infrastructure, and partner teams (e.g., Event Hubs, Kusto, Fabric platform) to deliver resilient, low-latency streaming experiences on a global scale
  • Play a key role in advancing our reliability posture, improving availability, monitoring, and incident response across regions
  • Build strong observability, telemetry, and automated recovery mechanisms to meet high availability and SLA targets
  • Region Build-out & Deployment: Onboard new regions, drive deployment automation, and ensure consistent service configuration
  • Reliability & SRE: Improve availability, resiliency, and incident response
  • own service health across regions
  • Observability & Operations: Enhance telemetry, monitoring, alerting, and troubleshooting capabilities
  • Cross-team Collaboration: Partner with platform and infra teams to unblock dependencies and ensure smooth rollout
  • Production Excellence: Drive root-cause analysis, repair items, and continuous improvement on service reliability
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

Location
Location
United Kingdom , Newcastle
Salary
Salary:
Not provided
trimble.com Logo
Trimble Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Engineering or a related field
  • At least 5 years of technical experience with a proven ability to take full ownership of production infrastructure
  • Excellent collaboration skills with leading cross-functional work
  • Demonstrated success in managing infrastructure in production environments
  • Expertise in capacity planning and cost optimisation for efficient operations
  • Extensive expertise managing cloud provider hosted infrastructure, specifically with Microsoft Azure or AWS
  • Proficient in high-level scripting languages like Python and Infrastructure as Code tools (Terraform), along with containerisation
  • Demonstrated success with Kubernetes or other containerization technologies
  • Familiarity with CI/CD pipelines and tools such as Azure DevOps, Jenkins, Argo CD, Helm, GitHub
  • Experience with monitoring tools and incident management processes like Prometheus, Grafana, New Relic, DataDog, Splunk, Cloudwatch, Sumologic etc
Job Responsibility
Job Responsibility
  • Develop and maintain scalable infrastructure as code (IaC) using Terraform to ensure reliable and scalable cloud environments
  • Implement and enhance observability solutions using tools like New Relic, DataDog, Sumologic and Splunk for monitoring, logging, and alerting
  • Perform code deployments and manage CI/CD pipelines using Jenkins, Github, and related tooling to ensure smooth and efficient delivery processes
  • Automate routine tasks and workflows to increase operational efficiency and reduce manual intervention
  • Evaluate system designs and architectures for reliability, performance, security, and efficiency, ensuring best practices are followed
  • Lead incident response efforts and conduct deep-dive root cause analysis to implement long-term, innovative technical solutions
  • Develop and maintain comprehensive runbooks and procedures for incident response and operational tasks
  • Collaborate with cross-functional teams to review and provide feedback on technical designs, ensuring alignment with SRE principles
  • Participate in on-call rotations and handle critical incidents with confidence and expertise
  • Continuously improve documentation for systems and services, contributing to a knowledge-sharing culture within the team
Read More
Arrow Right

Site Reliability Engineer

Shape the Future of Intelligent Operations as a Site Reliability Engineer (AI Op...
Location
Location
India , Chennai
Salary
Salary:
Not provided
trimble.com Logo
Trimble Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 1 to 2 years of professional experience in a DevOps, MLOps, or systems engineering environment
  • Bachelor's degree in Computer Science, Engineering, Information Technology, or a closely related technical field
  • Direct experience with Microsoft Azure cloud platforms and its specialized ecosystem services (such as Azure ML and Azure DevOps)
  • Proficiency with Python or other scripting languages (Shell / Bash / PowerShell) for rapid system integration and task automation
  • Foundational understanding of containerization (Docker), basic orchestration concepts (Kubernetes fundamentals), and version control system workflows (Git)
  • Solid baseline knowledge of fundamental DevOps principles (CI/CD, system administration) and a basic understanding of the end-to-end machine learning model lifecycle
Job Responsibility
Job Responsibility
  • Assist in the deployment and maintenance of machine learning models in production under direct supervision, building skills in containerization and orchestration architectures
  • Support the development of robust continuous integration and deployment pipelines for ML workflows, including model versioning, automated testing, and release processes
  • Monitor production ML model performance, detect data drift, and track system health by implementing foundational logging, alerting, and metrics solutions
  • Contribute to infrastructure automation and configuration management for machine learning workloads, learning to treat infrastructure as software
  • Partner closely with ML engineers and data scientists to operationalize complex models, ensuring reliability, scale, and strict adherence to established operational patterns
What we offer
What we offer
  • Structured environment to accelerate technical skills
  • Direct guidance from experienced engineering professionals
  • Projects that improve productivity, quality, safety, transparency and sustainability
  • Collaborative and supportive team
  • Entrepreneurial spirit empowering proactive doers
  • Flexible work arrangements
  • Fulltime
Read More
Arrow Right