CrawlJobs Logo

Site Reliability Engineer (DevOps)

Netherlands, Amsterdam · Job Posted March 18, 2026
Apply Position
Job Link Share

Job Description

Site Reliability Engineer (DevOps) - Netherlands Mist AI is the AI-native networking solution from HPE Juniper Networking and our Software Engineering team is seeking a Site Reliability Engineer to join our talented team and build high quality technology solutions that revolutionize networking, powered by Artificial Intelligence in the cloud. Mist AI provides services through SaaS applications to many Fortune 100 and Fortune 500 customers. You will take ops projects from concept through to launch. You will be responsible for maintaining and improving the company's production environment for rapid scaling and outstanding performance. You will be responsible to help us keep stellar uptime and reliability. The improvements you implement will be felt by the entire organization. For you to be successful, you need to have a hunger to learn and adapt to new technology quickly. We demand people who are naturally curious, can self-start and share learnings and outcomes effectively with a distributed team. You need to be a builder at heart.

Job Responsibility

  • Express your passion about infrastructure as code and continuous deployment to build scalable and highly reliable systems
  • Define and own KPIs around system availability, quality and scale
  • Partner with our developers and quality engineering teams to automate the monitoring, alerting, availability and scalability of our applications and systems
  • Ensure system availability and business continuity by implementing redundant servers/services
  • Manage after-hours infrastructure updates and maintenance
  • Proactively research and propose the use of new concepts, processes, technologies, and tools
  • Partner with software developers to create Mist standards for Microservices (APIs, schemas, serialization, data stores and best practices)
  • Run secure and scalable applications for highly available, multi-region, AWS and GCP deployments
  • Ship code several times per week
  • Be a part of our On-Call rotation
  • Own disaster recovery and business continuity plans

Requirements

  • An extensive background in developing and operating large-scale cloud-based distributed applications
  • Direct experience developing/running applications on AWS or Google Cloud
  • Laser focus and be able to design infrastructure solutions for scalability, reliability, high availability, performance, security, software maintainability, and operational excellence
  • The ability to 'fix the plane while in flight' (not just support greenfield solutions)
  • The ability to prioritize existing technical and infrastructure debt, and experience to build and execute a plan to pay it off
  • Delivering web-scale infrastructure for a global market at high release velocity
  • A deep understanding of distributed system design and dependency management
  • Must have solid experience with at least 2 of the languages: Go, Java, Python
  • 10+ years industry experience in managing infrastructure
  • 5 years Kubernetes administration in a large-scale SaaS environment
  • 5 years maintaining production systems on AWS or GCP
  • 3 years in implementing, managing, and monitoring metrics specific to SaaS applications
  • 3 years using infrastructure as code software (eg. Terraform, AWS and Google Cloud Deployment, CloudFormation)
  • 5 years’ experience in continuous integration practices & tools (Jenkins, Travis CI, CircleCI, etc…)
  • Previous experience of contributing to war rooms and blameless postmortems
  • Superb communication skills, written and verbal
  • Experience of working in a true DevOps environment with daily collaborations
  • Thrives in a fast-paced startup environment where there may be multiple competing priorities
  • Customer-service mindset
  • Passion for improvement

Nice to have

  • Experience with Kafka, Spark, Storm, Cassandra, ElasticSearch, PostgreSQL, Redis, Zookeeper, Nginx, Airflow
  • Experience of working with or contributing directly to Open Source projects
  • Understanding and experience of leading/managing technology products
  • Understand machine learning techniques and tools. Translate business requirements into data models and implement them for scale and production ready systems
  • Experience of working with failure-based testing
  • Experience working in a test-driven development environment

What we offer

  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineer (DevOps)

8 matching positions

Site Reliability Engineer / DevOps

As Scale's product portfolio and customer base expand, we are seeking skilled Si...
Location
Location
Mexico , Mexico City
Salary
Salary:
Not provided
scale.com Logo
Scale
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience as an IT Technician, Field Engineer, Facilities Technician, or similar hands-on technical role
  • Proven track record of installing and maintaining server hardware and workstations
  • Experience configuring and troubleshooting network infrastructure and physical installation work
  • Experience coordinating shared resources or facilities, managing schedules and access
  • Familiarity with Linux, Windows and MacOS
  • Knowledge of Python/C++, scripting, and command line (cmd/bash) operations
  • Basic understanding of electrical systems for robot stations and equipment
  • Comfort with rapidly changing, fast-paced environments and a passion for finding solutions to complex physical infrastructure problems
  • Basic understanding of safety protocols and an eagerness to integrate them into facility operations
  • A hunger for learning new technologies, particularly in the realm of robotics and automated systems
Job Responsibility
Job Responsibility
  • Install, configure, and maintain robot stations and related technical equipment on-site
  • Manage and maintain server infrastructure, workstations, and computing equipment in our facilities
  • Oversee network installations, including routers and cabling infrastructure
  • Coordinate access to technical facilities, maintaining schedules and ensuring orderly usage of shared resources
  • Establish and enforce safety protocols and usage guidelines for technical facilities
  • Provide hands-on support to remote engineers, performing physical changes and configurations as requested
  • Troubleshoot hardware, network, and connectivity issues, ensuring minimal disruption to operations
  • Document infrastructure configurations, changes, and maintenance procedures to ensure maintainability and knowledge transfer
  • Proactively identify maintenance needs and capacity constraints to prevent issues before they arise
  • Drive standardization and foster collaboration across different teams to achieve efficient facility usage
Read More
Arrow Right

Senior DevOps / Site Reliability Engineer

Our client is a leader in sustainable packaging solutions, leveraging cutting-ed...
Location
Location
Salary
Salary:
Not provided
n-ix.com Logo
N-iX
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Microsoft Azure (App Services, VM, Container Instances, AKS, SQL Server, Azure SQL)
  • Git, Github, Github Actions
  • SonarQube Cloud, Terraform, Docker
  • Datadog (extensive), SRE concepts (SLOs, SLIs, golden signals, instrumentation)
  • Incident management, dashboard development, business reporting
  • Shell scripting, YAML/JSON configs, Python
  • Ubuntu, RHEL, CentOS, Windows/Server (entry)
  • Atlassian Suite (Jira/Confluence)
  • ITSM / ITIL familiarity
  • AI tools (Claude, Github CoPilot, etc.)
Job Responsibility
Job Responsibility
  • Cloud Infrastructure: Architect, implement, and manage Microsoft Azure resources including App Services, Virtual Machines, Container Instances, AKS, SQL Server/Instance, and Azure SQL
  • DevOps Automation: Design and maintain CI/CD workflows using Git, Github Actions, SonarQube Cloud, Terraform, and Docker
  • SRE Practices: Develop and monitor SLOs, SLIs, and golden signals
  • instrument applications and infrastructure
  • build Datadog dashboards for real-time business and incident reporting
  • Incident Management: Lead incident response, root cause analysis, and post-mortem documentation. Maintain high availability and rapid recovery for business-critical systems
  • Monitoring & Observability: Extensive use of Datadog for monitoring, logging, and performance analytics
  • Configuration Management: Work with Shell, YAML, JSON, and Python for scripting, automation, and configuration
  • System Administration: Administer Ubuntu, RHEL, CentOS, and (entry-level) Windows Server environments
  • Collaboration: Utilize Atlassian Suite (Jira, Confluence) for documentation, ticketing, and project tracking
What we offer
What we offer
  • Flexible working format - remote, office-based or flexible
  • A competitive salary and good compensation package
  • Personalized career growth
  • Professional development tools (mentorship program, tech talks and trainings, centers of excellence, and more)
  • Active tech communities with regular knowledge sharing
  • Education reimbursement
  • Memorable anniversary presents
  • Corporate events and team buildings
  • Other location-specific benefits
Read More
Arrow Right

Middle Site Reliability Engineer (CDN & DevOps)

Provectus is looking for a Senior DevOps or SRE professional to join our team an...
Location
Location
Salary
Salary:
Not provided
provectus.com Logo
Provectus
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3 plus years in a DevOps, SRE, or Web Operations role
  • Hands on experience with a public CDN like Fastly, Varnish, Cloudflare, or Akamai including writing cache and routing rules
  • Strong understanding of HTTP fundamentals such as cache headers, redirects, surrogate keys, and purge strategies
  • Experience with GitLab CI CD pipelines and solid Linux administration skills
  • Scripting proficiency in Python or Bash and comfort using AI assisted coding tools like Copilot or Claude
  • Upper Intermediate English with strong communication skills for a distributed team
Job Responsibility
Job Responsibility
  • Manage and improve CDN configuration including caching rules, redirects, and traffic routing across a globally distributed edge
  • Triage CDN related production issues through log analysis and performance investigations for high traffic events
  • Review merge requests from product and platform engineering teams and advise on cache behavior and edge performance
  • Build and maintain CI CD pipelines in GitLab CI for safely delivering CDN configuration changes
  • Partner with development teams across US and EU time zones to onboard new services behind the CDN
  • Maintain documentation, runbooks, and operational procedures while contributing to monitoring and alerting on edge traffic
What we offer
What we offer
  • Opportunity to work with cutting-edge AI and cloud solutions
  • Internal training programs (Leadership, Public Speaking, and more) with full support for AWS and other professional certifications
  • Career growth: a clear path toward SA or beyond
  • we actively develop our engineers
  • Access to the latest AI tools and premium subscriptions
  • Long-term B2B collaboration
  • Remote with flexible hours
  • Private medical insurance or a budget for your medical needs
  • Paid sick leave, vacation, and public holidays
  • Equipment and all the tech you need for comfortable, productive work
Read More
Arrow Right

Middle Site Reliability Engineer (CDN & DevOps)

Provectus is looking for a Senior DevOps or SRE professional to join our team an...
Location
Location
Serbia; Spain; Poland; Armenia; North Macedonia , Serbia; Spain; Poland; Yerevan, Armenia; Skopje
Salary
Salary:
Not provided
provectus.com Logo
Provectus
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3 plus years in a DevOps, SRE, or Web Operations role
  • Hands on experience with a public CDN like Fastly, Varnish, Cloudflare, or Akamai including writing cache and routing rules
  • Strong understanding of HTTP fundamentals such as cache headers, redirects, surrogate keys, and purge strategies
  • Experience with GitLab CI CD pipelines and solid Linux administration skills
  • Scripting proficiency in Python or Bash and comfort using AI assisted coding tools like Copilot or Claude
  • Upper Intermediate English with strong communication skills for a distributed team
Job Responsibility
Job Responsibility
  • Manage and improve CDN configuration including caching rules, redirects, and traffic routing across a globally distributed edge
  • Triage CDN related production issues through log analysis and performance investigations for high traffic events
  • Review merge requests from product and platform engineering teams and advise on cache behavior and edge performance
  • Build and maintain CI CD pipelines in GitLab CI for safely delivering CDN configuration changes
  • Partner with development teams across US and EU time zones to onboard new services behind the CDN
  • Maintain documentation, runbooks, and operational procedures while contributing to monitoring and alerting on edge traffic
What we offer
What we offer
  • Opportunity to work with cutting-edge AI and cloud solutions
  • Internal training programs (Leadership, Public Speaking, and more) with full support for AWS and other professional certifications
  • Career growth: a clear path toward SA or beyond
  • we actively develop our engineers
  • Access to the latest AI tools and premium subscriptions
  • Long-term B2B collaboration
  • Remote with flexible hours
  • Private medical insurance or a budget for your medical needs
  • Paid sick leave, vacation, and public holidays
  • Equipment and all the tech you need for comfortable, productive work
  • Fulltime
Read More
Arrow Right

DevOps and Site Reliability Engineer

We’re seeking a DevOps and Site Reliability Engineer with strong expertise in Mi...
Location
Location
United Kingdom , Park Royal
Salary
Salary:
Not provided
jobs.360resourcing.co.uk Logo
360 Resourcing Solutions
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years in DevOps or SRE roles
  • Specific experience managing high-traffic Azure-hosted environments at scale
  • Mastery of Terraform — including module authoring, remote state management, and workspace strategies for multi-environment deployments
  • Expert-level KQL (Kusto Query Language) for Log Analytics
  • Comfortable building custom Azure Monitor Workbooks for operational reporting
  • Strong security automation experience: passwordless authentication via OIDC, Azure Key Vault integration, and secrets management best practices
  • In-depth knowledge of Azure Container Apps (ACA), VNet Integration, and Private Endpoint configuration for secure, network-isolated workloads
Job Responsibility
Job Responsibility
  • Observability Platform: Implement and own the Bestway Azure Observability Playbook — building comprehensive dashboards, alert rules, and runbooks using Application Insights, Log Analytics, and KQL
  • AIOps Automation: Develop intelligent alerting systems that leverage AI/ML to detect early-warning signals — including IP reputation degradation, database saturation trends, and anomalous traffic patterns — before they escalate to incidents
  • Release Assurance: Define and execute Operational Acceptance Testing (OAT) gates for all Production deployments, ensuring releases meet reliability, performance, and security thresholds before go-live
  • Infrastructure Hygiene: Conduct periodic audits of the Azure tenant to identify and decommission orphaned or unutilised resources ('Zombie Resources') — directly reducing operational burn rate
  • IaC & CI/CD: Build and maintain reusable Terraform modules
  • manage pipeline integrity across GitHub Actions workflows to ensure consistent, reproducible infrastructure deployments with multi-subscription Hub-and-Spoke Networking
What we offer
What we offer
  • Competitive salary
  • Pension
  • 22 days annual leave, plus bank holidays
  • Onsite parking
  • Life assurance
  • Fulltime
Read More
Arrow Right
New

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
  • Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
  • Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
  • Proficiency in Python, Go, or Java, with strong code review and readability standards
  • Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
  • Ability to think and act under pressure
  • Strong communication skills
Job Responsibility
Job Responsibility
  • Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
  • Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
  • Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
  • Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
  • Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
  • Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
  • Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
  • Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 345000.00 USD / Year
lambda.ai Logo
Lambda
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
  • Strong understanding of Linux-based systems in a distributed environment
  • Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
  • Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
  • Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Passion for continuous improvement and innovation
Job Responsibility
Job Responsibility
  • Define Fleet Health metrics and indicators to objectively measure and improve system availability
  • Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
  • Create runbooks and automated remediations for common failure scenarios
  • Build in automation and auditing to ensure compliance and improve efficiency and productivity
  • Participate in on-call rotations and provide support for incident response and resolution
  • Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc
What we offer
What we offer
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer / Observability Engineer

Rackspace is building up its Professional Services Center of Excellence on Appli...
Location
Location
Egypt , Giza
Salary
Salary:
Not provided
rackspace.com Logo
Rackspace
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering/computer science or equivalent
  • Senior-level experience with Site Reliability Engineering, DevOps, Code level application support and troubleshooting, AWS Infrastructure design, implementation and optimization, Automation for deployment, scaling and reliability
  • Experience with observability solutions tools like Splunk, Datadog, SignalFx, etc.
  • Experience deploying, maintaining and supporting software applications/services in the AWS ecosystem
  • Proactive approach to identifying problems and solutions
  • Experience writing code with one or more interpreted languages such as Python, PHP, Perl, Ruby, Linux Shell
  • Experience with Terraform or Cloud Formation scripting
  • Experience with configuration management tools like Ansible, Chef or Puppet
  • Experience with standard software development best practices and tools such as code repositories (Git preferred)
  • Experience executing in an agile software development environment
Job Responsibility
Job Responsibility
  • Work with customers and implement Observability solutions
  • Build and maintain scalable systems and robust automation that supports engineering goals
  • Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
  • Proactively gather and analyze both metric and log data from systems and applications to perform anomaly detection, performance tuning, capacity planning and fault isolation
  • Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability, security and performance standards
  • Collaborate with team members to document and share solutions
  • Maintain a deep understanding of the customer’s business as well as their technical environment
  • Identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues
  • Fulltime
Read More
Arrow Right