CrawlJobs Logo

DevOps and Site Reliability Engineer

United Kingdom, Park Royal · Job Posted April 20, 2026
Apply Position
Job Link Share

Job Description

We’re seeking a DevOps and Site Reliability Engineer with strong expertise in Microsoft Azure to manage our observability platform and AIOps Automation. The ideal candidate will have extensive hands-on experience with high traffic environments and security automation as well as in-depth platform knowledge.

Job Responsibility

  • Observability Platform: Implement and own the Bestway Azure Observability Playbook — building comprehensive dashboards, alert rules, and runbooks using Application Insights, Log Analytics, and KQL
  • AIOps Automation: Develop intelligent alerting systems that leverage AI/ML to detect early-warning signals — including IP reputation degradation, database saturation trends, and anomalous traffic patterns — before they escalate to incidents
  • Release Assurance: Define and execute Operational Acceptance Testing (OAT) gates for all Production deployments, ensuring releases meet reliability, performance, and security thresholds before go-live
  • Infrastructure Hygiene: Conduct periodic audits of the Azure tenant to identify and decommission orphaned or unutilised resources ('Zombie Resources') — directly reducing operational burn rate
  • IaC & CI/CD: Build and maintain reusable Terraform modules
  • manage pipeline integrity across GitHub Actions workflows to ensure consistent, reproducible infrastructure deployments with multi-subscription Hub-and-Spoke Networking

Requirements

  • 6+ years in DevOps or SRE roles
  • Specific experience managing high-traffic Azure-hosted environments at scale
  • Mastery of Terraform — including module authoring, remote state management, and workspace strategies for multi-environment deployments
  • Expert-level KQL (Kusto Query Language) for Log Analytics
  • Comfortable building custom Azure Monitor Workbooks for operational reporting
  • Strong security automation experience: passwordless authentication via OIDC, Azure Key Vault integration, and secrets management best practices
  • In-depth knowledge of Azure Container Apps (ACA), VNet Integration, and Private Endpoint configuration for secure, network-isolated workloads

What we offer

  • Competitive salary
  • Pension
  • 22 days annual leave, plus bank holidays
  • Onsite parking
  • Life assurance

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

DevOps and Site Reliability Engineer

8 matching positions

Site Reliability Engineer (DevOps)

Site Reliability Engineer (DevOps) - Netherlands Mist AI is the AI-native networ...
Location
Location
Netherlands , Amsterdam
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • An extensive background in developing and operating large-scale cloud-based distributed applications
  • Direct experience developing/running applications on AWS or Google Cloud
  • Laser focus and be able to design infrastructure solutions for scalability, reliability, high availability, performance, security, software maintainability, and operational excellence
  • The ability to 'fix the plane while in flight' (not just support greenfield solutions)
  • The ability to prioritize existing technical and infrastructure debt, and experience to build and execute a plan to pay it off
  • Delivering web-scale infrastructure for a global market at high release velocity
  • A deep understanding of distributed system design and dependency management
  • Must have solid experience with at least 2 of the languages: Go, Java, Python
  • 10+ years industry experience in managing infrastructure
  • 5 years Kubernetes administration in a large-scale SaaS environment
Job Responsibility
Job Responsibility
  • Express your passion about infrastructure as code and continuous deployment to build scalable and highly reliable systems
  • Define and own KPIs around system availability, quality and scale
  • Partner with our developers and quality engineering teams to automate the monitoring, alerting, availability and scalability of our applications and systems
  • Ensure system availability and business continuity by implementing redundant servers/services
  • Manage after-hours infrastructure updates and maintenance
  • Proactively research and propose the use of new concepts, processes, technologies, and tools
  • Partner with software developers to create Mist standards for Microservices (APIs, schemas, serialization, data stores and best practices)
  • Run secure and scalable applications for highly available, multi-region, AWS and GCP deployments
  • Ship code several times per week
  • Be a part of our On-Call rotation
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer / DevOps

As Scale's product portfolio and customer base expand, we are seeking skilled Si...
Location
Location
Mexico , Mexico City
Salary
Salary:
Not provided
scale.com Logo
Scale
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience as an IT Technician, Field Engineer, Facilities Technician, or similar hands-on technical role
  • Proven track record of installing and maintaining server hardware and workstations
  • Experience configuring and troubleshooting network infrastructure and physical installation work
  • Experience coordinating shared resources or facilities, managing schedules and access
  • Familiarity with Linux, Windows and MacOS
  • Knowledge of Python/C++, scripting, and command line (cmd/bash) operations
  • Basic understanding of electrical systems for robot stations and equipment
  • Comfort with rapidly changing, fast-paced environments and a passion for finding solutions to complex physical infrastructure problems
  • Basic understanding of safety protocols and an eagerness to integrate them into facility operations
  • A hunger for learning new technologies, particularly in the realm of robotics and automated systems
Job Responsibility
Job Responsibility
  • Install, configure, and maintain robot stations and related technical equipment on-site
  • Manage and maintain server infrastructure, workstations, and computing equipment in our facilities
  • Oversee network installations, including routers and cabling infrastructure
  • Coordinate access to technical facilities, maintaining schedules and ensuring orderly usage of shared resources
  • Establish and enforce safety protocols and usage guidelines for technical facilities
  • Provide hands-on support to remote engineers, performing physical changes and configurations as requested
  • Troubleshoot hardware, network, and connectivity issues, ensuring minimal disruption to operations
  • Document infrastructure configurations, changes, and maintenance procedures to ensure maintainability and knowledge transfer
  • Proactively identify maintenance needs and capacity constraints to prevent issues before they arise
  • Drive standardization and foster collaboration across different teams to achieve efficient facility usage
Read More
Arrow Right

DevOps, Site Reliability Engineer, Vice President

The Vice President, Technology (DevOps/SRE) will lead the engineering and operat...
Location
Location
United States , Jersey City
Salary
Salary:
142320.00 - 213480.00 USD / Year
https://www.citi.com/ Logo
Citi
Expiration Date
June 15, 2026
Flip Icon
Requirements
Requirements
  • 6-10 years of experience in DevOps, Site Reliability Engineering, or Infrastructure Engineering, with demonstrated ownership of production platforms and delivery outcomes
  • Hands-on administration and troubleshooting skills across Linux and Windows, including strong command-line diagnostics and log analysis
  • Strong experience with Kubernetes and/or OpenShift, including Helm-based deployments and cluster troubleshooting
  • Experience with automation/configuration management (Ansible and Ansible Tower/Starfleet or equivalent) and a strong bias toward eliminating manual operational work
  • Demonstrated experience driving vulnerability remediation, patching, and platform hardening in partnership with security/compliance teams
  • Proven ability to plan and execute platform migrations and upgrades (OS, middleware, databases), including change management, runbooks, and production readiness
  • Strong communication and stakeholder management skills
  • able to influence engineering teams and senior leaders while remaining hands-on in critical technical work
Job Responsibility
Job Responsibility
  • CI/CD ownership: Architect, implement, and operate scalable CI/CD pipelines and release workflows
  • define standards for build, test, security scanning, and deployment automation
  • Tooling and platform engineering: Provide deep expertise across Jenkins, UDeploy, Tekton, Harness (or equivalent) including architecture, configuration, upgrades, and governance
  • Incident and pipeline triage: Diagnose and remediate failed pipelines (Jenkins/UDeploy) and deployment issues quickly
  • drive root-cause analysis and implement preventative controls
  • Hands-on systems administration: Perform command-line troubleshooting and administration across Linux and Windows
  • partner with infrastructure teams to resolve OS, network, and runtime issues impacting production
  • Platform migrations and upgrades: Lead and execute OS (e.g., RHEL) and platform upgrade initiatives across middleware and databases
  • plan cutovers, rollback strategies, and production readiness
  • Middleware lifecycle management: Coordinate upgrades for critical runtimes and middleware (Node.js, Python, JDK, Nginx, Tomcat)
What we offer
What we offer
  • medical, dental & vision coverage
  • 401(k)
  • life, accident, and disability insurance
  • wellness programs
  • paid time off packages, including planned time off (vacation), unplanned time off (sick leave), and paid holidays
  • Fulltime
Read More
Arrow Right

Senior DevOps / Site Reliability Engineer

Our client is a leader in sustainable packaging solutions, leveraging cutting-ed...
Location
Location
Salary
Salary:
Not provided
n-ix.com Logo
N-iX
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Microsoft Azure (App Services, VM, Container Instances, AKS, SQL Server, Azure SQL)
  • Git, Github, Github Actions
  • SonarQube Cloud, Terraform, Docker
  • Datadog (extensive), SRE concepts (SLOs, SLIs, golden signals, instrumentation)
  • Incident management, dashboard development, business reporting
  • Shell scripting, YAML/JSON configs, Python
  • Ubuntu, RHEL, CentOS, Windows/Server (entry)
  • Atlassian Suite (Jira/Confluence)
  • ITSM / ITIL familiarity
  • AI tools (Claude, Github CoPilot, etc.)
Job Responsibility
Job Responsibility
  • Cloud Infrastructure: Architect, implement, and manage Microsoft Azure resources including App Services, Virtual Machines, Container Instances, AKS, SQL Server/Instance, and Azure SQL
  • DevOps Automation: Design and maintain CI/CD workflows using Git, Github Actions, SonarQube Cloud, Terraform, and Docker
  • SRE Practices: Develop and monitor SLOs, SLIs, and golden signals
  • instrument applications and infrastructure
  • build Datadog dashboards for real-time business and incident reporting
  • Incident Management: Lead incident response, root cause analysis, and post-mortem documentation. Maintain high availability and rapid recovery for business-critical systems
  • Monitoring & Observability: Extensive use of Datadog for monitoring, logging, and performance analytics
  • Configuration Management: Work with Shell, YAML, JSON, and Python for scripting, automation, and configuration
  • System Administration: Administer Ubuntu, RHEL, CentOS, and (entry-level) Windows Server environments
  • Collaboration: Utilize Atlassian Suite (Jira, Confluence) for documentation, ticketing, and project tracking
What we offer
What we offer
  • Flexible working format - remote, office-based or flexible
  • A competitive salary and good compensation package
  • Personalized career growth
  • Professional development tools (mentorship program, tech talks and trainings, centers of excellence, and more)
  • Active tech communities with regular knowledge sharing
  • Education reimbursement
  • Memorable anniversary presents
  • Corporate events and team buildings
  • Other location-specific benefits
Read More
Arrow Right

Middle Site Reliability Engineer (CDN & DevOps)

Provectus is looking for a Senior DevOps or SRE professional to join our team an...
Location
Location
Salary
Salary:
Not provided
provectus.com Logo
Provectus
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3 plus years in a DevOps, SRE, or Web Operations role
  • Hands on experience with a public CDN like Fastly, Varnish, Cloudflare, or Akamai including writing cache and routing rules
  • Strong understanding of HTTP fundamentals such as cache headers, redirects, surrogate keys, and purge strategies
  • Experience with GitLab CI CD pipelines and solid Linux administration skills
  • Scripting proficiency in Python or Bash and comfort using AI assisted coding tools like Copilot or Claude
  • Upper Intermediate English with strong communication skills for a distributed team
Job Responsibility
Job Responsibility
  • Manage and improve CDN configuration including caching rules, redirects, and traffic routing across a globally distributed edge
  • Triage CDN related production issues through log analysis and performance investigations for high traffic events
  • Review merge requests from product and platform engineering teams and advise on cache behavior and edge performance
  • Build and maintain CI CD pipelines in GitLab CI for safely delivering CDN configuration changes
  • Partner with development teams across US and EU time zones to onboard new services behind the CDN
  • Maintain documentation, runbooks, and operational procedures while contributing to monitoring and alerting on edge traffic
What we offer
What we offer
  • Opportunity to work with cutting-edge AI and cloud solutions
  • Internal training programs (Leadership, Public Speaking, and more) with full support for AWS and other professional certifications
  • Career growth: a clear path toward SA or beyond
  • we actively develop our engineers
  • Access to the latest AI tools and premium subscriptions
  • Long-term B2B collaboration
  • Remote with flexible hours
  • Private medical insurance or a budget for your medical needs
  • Paid sick leave, vacation, and public holidays
  • Equipment and all the tech you need for comfortable, productive work
Read More
Arrow Right

Middle Site Reliability Engineer (CDN & DevOps)

Provectus is looking for a Senior DevOps or SRE professional to join our team an...
Location
Location
Serbia; Spain; Poland; Armenia; North Macedonia , Serbia; Spain; Poland; Yerevan, Armenia; Skopje
Salary
Salary:
Not provided
provectus.com Logo
Provectus
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3 plus years in a DevOps, SRE, or Web Operations role
  • Hands on experience with a public CDN like Fastly, Varnish, Cloudflare, or Akamai including writing cache and routing rules
  • Strong understanding of HTTP fundamentals such as cache headers, redirects, surrogate keys, and purge strategies
  • Experience with GitLab CI CD pipelines and solid Linux administration skills
  • Scripting proficiency in Python or Bash and comfort using AI assisted coding tools like Copilot or Claude
  • Upper Intermediate English with strong communication skills for a distributed team
Job Responsibility
Job Responsibility
  • Manage and improve CDN configuration including caching rules, redirects, and traffic routing across a globally distributed edge
  • Triage CDN related production issues through log analysis and performance investigations for high traffic events
  • Review merge requests from product and platform engineering teams and advise on cache behavior and edge performance
  • Build and maintain CI CD pipelines in GitLab CI for safely delivering CDN configuration changes
  • Partner with development teams across US and EU time zones to onboard new services behind the CDN
  • Maintain documentation, runbooks, and operational procedures while contributing to monitoring and alerting on edge traffic
What we offer
What we offer
  • Opportunity to work with cutting-edge AI and cloud solutions
  • Internal training programs (Leadership, Public Speaking, and more) with full support for AWS and other professional certifications
  • Career growth: a clear path toward SA or beyond
  • we actively develop our engineers
  • Access to the latest AI tools and premium subscriptions
  • Long-term B2B collaboration
  • Remote with flexible hours
  • Private medical insurance or a budget for your medical needs
  • Paid sick leave, vacation, and public holidays
  • Equipment and all the tech you need for comfortable, productive work
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 345000.00 USD / Year
lambda.ai Logo
Lambda
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
  • Strong understanding of Linux-based systems in a distributed environment
  • Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
  • Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
  • Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Passion for continuous improvement and innovation
Job Responsibility
Job Responsibility
  • Define Fleet Health metrics and indicators to objectively measure and improve system availability
  • Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
  • Create runbooks and automated remediations for common failure scenarios
  • Build in automation and auditing to ensure compliance and improve efficiency and productivity
  • Participate in on-call rotations and provide support for incident response and resolution
  • Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc
What we offer
What we offer
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer / Observability Engineer

Rackspace is building up its Professional Services Center of Excellence on Appli...
Location
Location
Egypt , Giza
Salary
Salary:
Not provided
rackspace.com Logo
Rackspace
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering/computer science or equivalent
  • Senior-level experience with Site Reliability Engineering, DevOps, Code level application support and troubleshooting, AWS Infrastructure design, implementation and optimization, Automation for deployment, scaling and reliability
  • Experience with observability solutions tools like Splunk, Datadog, SignalFx, etc.
  • Experience deploying, maintaining and supporting software applications/services in the AWS ecosystem
  • Proactive approach to identifying problems and solutions
  • Experience writing code with one or more interpreted languages such as Python, PHP, Perl, Ruby, Linux Shell
  • Experience with Terraform or Cloud Formation scripting
  • Experience with configuration management tools like Ansible, Chef or Puppet
  • Experience with standard software development best practices and tools such as code repositories (Git preferred)
  • Experience executing in an agile software development environment
Job Responsibility
Job Responsibility
  • Work with customers and implement Observability solutions
  • Build and maintain scalable systems and robust automation that supports engineering goals
  • Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
  • Proactively gather and analyze both metric and log data from systems and applications to perform anomaly detection, performance tuning, capacity planning and fault isolation
  • Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability, security and performance standards
  • Collaborate with team members to document and share solutions
  • Maintain a deep understanding of the customer’s business as well as their technical environment
  • Identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues
  • Fulltime
Read More
Arrow Right