CrawlJobs Logo

Azure Site Reliability Engineer

Canada, Markham · Job Posted April 15, 2026
Apply Position
Job Link Share

Job Description

We are seeking a Senior Azure SRE to support a client’s Cloud & Infrastructure team. This role focuses on hands-on engineering, automation, monitoring, and operational reliability within an Azure environment.

Job Responsibility

  • Manage day-to-day operations of the Azure cloud platform
  • Support and execute disaster recovery (DR) testing activities
  • Enhance and automate Azure infrastructure deployment pipelines
  • Monitor system performance and availability using Azure Monitor
  • Participate in incident response, troubleshooting, and root cause analysis

Requirements

  • Strong hands-on experience in Azure cloud engineering and SRE practices
  • Experience with infrastructure automation and CI/CD pipelines
  • Proficiency with Azure Monitor and observability tools
  • Solid understanding of incident management and reliability engineering principles
  • Experience supporting production cloud environments
  • Strong troubleshooting and problem-solving skills
  • Ability to work in hybrid environments with cross-functional teams
  • Excellent communication skills

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Azure Site Reliability Engineer

8 matching positions

Site Reliability Engineering Manager

Hewlett Packard Enterprise (HPE) is looking for a Site Reliability Engineering M...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7–10 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
  • Minimum 2 years of experience managing or leading cloud operations teams
  • Deep understanding of cloud platforms (AWS, GCP, or Azure) and cloud-native architectures
  • Hands-on experience with Kubernetes, containers, infrastructure as code (e.g., Terraform), and configuration management tools
  • Strong foundation in observability (monitoring, logging, tracing), automation using Python, and incident response
  • Familiarity with modern CI/CD automation and tools
  • Excellent communication, stakeholder management, and team-building skills
  • Experience scaling SRE practices in high-growth or large-scale environments
  • Ability to balance long-term reliability initiatives with short-term delivery needs.
Job Responsibility
Job Responsibility
  • Lead and mentor a team of Site Reliability Engineers, supporting their growth, performance, and well-being
  • Own the reliability strategy for SASE cloud infrastructure systems, including incident management, SLIs/SLOs, and capacity planning
  • Partner with Engineering, Product, and Security teams to design and deliver highly available, scalable, and resilient cloud-native services
  • Guide the team in building automation, improving observability, and improve operational efficiency of our cloud infrastructure
  • Drive adoption of best practices in monitoring, alerting, on-call operations, and runbook development
  • Build and maintain a strong engineering culture based on ownership, collaboration, and continuous learning
  • Define and track key reliability metrics, and report on team performance and system health to leadership
  • Contribute to hiring, onboarding, and career development for SREs.
What we offer
What we offer
  • Health & Wellbeing benefits for physical, financial, and emotional wellbeing
  • Personal & Professional Development programs
  • Unconditional inclusion in the workplace.
  • Fulltime
Read More
Arrow Right

Cloud Security Site Reliability Engineer

This role sits within the Cloud Security team responsible for Private and Public...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree or equivalent work experience
  • 6+ years of relevant work experience
  • Highly motivated self-starter with excellent interpersonal and communication skills
  • Certification or formal training in site reliability engineering concepts and practices
  • Prior experience working towards SLIs, SLOs and observability capabilities at a large scale
  • 4+ years experience in Python (preferable) or Java, on large scale systems alongside Linux based scripting languages
  • Experience working on observability, logging and metrics toolsets
  • Experience of k8s and container technologies such as Docker, Openshift and EKS
  • Experience with public cloud technologies such as AWS, GCP or Azure
  • Experience with Secrets products such as HashiCorp Vault or CyberArk
Job Responsibility
Job Responsibility
  • Working across Container products and Secrets products, across Public and Private Cloud, as well as Cloud native specific products
  • Architecting and building tools and platforms that provide capabilities for SRE
  • Collaboration with multiple stakeholders and partners across Engineering and Operations as well as partner teams within the wider Citi organisation
  • Actively owning production level incidents till resolution.
What we offer
What we offer
  • Equal opportunity employer
  • Accessibility support for persons with disabilities.
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

We are looking for a Site Reliability Engineer to own our internal systems infra...
Location
Location
United States , Sunnyvale
Salary
Salary:
175000.00 - 250000.00 USD / Year
figure.ai Logo
Figure
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience with Linux/Unix systems administration
  • Proficiency in programming/scripting
  • Extensive experience with cloud platforms (Azure, AWS, GCP) and on-prem hardware architectures
  • Experience designing, deploying, and operating high-availability, fault-tolerant, and distributed systems
  • Mastery of infrastructure as code (Terraform, CloudFormation, Ansible…)
  • Familiarity with monitoring, logging, and alerting tools (Prometheus, Grafana, Datadog…)
  • Solid understanding of networking fundamentals (TCP/IP, DNS, HTTP, load balancers, firewalls)
  • Experience defining Service Level Objectives (SLO), developing runbooks/incident response plans, facilitating post-mortems and managing systems assets
  • Ability to work in cross-functional teams with developers, infra, and product teams
  • Excellent verbal and written communication skills
Job Responsibility
Job Responsibility
  • Be the go to person for mission critical infrastructure enabling critical operations such as Source Configuration Management, CI/CD systems, software distribution, supplier portals, manufacturing and more
  • Migrate SaaS to self-hosted solutions to enhance security and reliability
  • Implement monitoring and alerting systems, and define incident response plans and runbooks
  • Reduce human workload through automation to automate deployment and scaling
  • Establish strong relationships with stakeholders to identify infrastructure needs and establish Service Level Objectives
  • Use a data driven approach to demonstrate service robustness and track optimization work
  • Partner with the security team to ensure that security remediations and updates are applied in a timely manner
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

As a highly skilled Site Reliability Engineer (SRE), you will contribute to buil...
Location
Location
United States , New York City; San Francisco
Salary
Salary:
160000.00 - 300000.00 USD / Year
hebbia.ai Logo
Hebbia
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience
  • 5+ years software development experience at a venture-backed startup or top technology firm
  • Proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role
  • Strong expertise in managing CI/CD pipelines and deployment automation
  • Proficiency in cloud platforms such as AWS, Azure, or Google Cloud (we are an AWS shop)
  • Solid understanding of containerization and orchestration technologies such as Docker and Kubernetes
  • Experience with monitoring and observability tools such as Datadog, Prometheus, Grafana, or similar
  • Knowledge of infrastructure-as-code (IaC) tools such as Terraform or CloudFormation
  • Familiarity with security best practices and tools for infrastructure and application security
  • Excellent problem-solving skills and the ability to troubleshoot complex issues
Job Responsibility
Job Responsibility
  • Assist in managing deployment pipelines to facilitate smooth and efficient software releases
  • Help implement and maintain observability solutions for monitoring system performance and reliability
  • Support local development environments to optimize developer workflows
  • Work with development teams to ensure infrastructure aligns with project requirements
  • Contribute to improving the security of our infrastructure by assisting with proactive measures and audits
  • Assist in developing and maintaining automation scripts and tools to enhance operational efficiency
  • Help troubleshoot and resolve infrastructure and application issues to minimize downtime and maintain smooth operations
  • Participate in evaluating and integrating new technologies to enhance the scalability, reliability, and security of our infrastructure
What we offer
What we offer
  • PTO: Unlimited
  • Insurance: Medical + Dental + Vision + 401K
  • Eats: Catered lunch daily + doordash dinner credit if you ever need to stay late
  • Parental leave policy: 3 months non-birthing parent, 4 months for birthing parent
  • Fertility benefits: $15k lifetime benefit
  • New hire equity grant: competitive equity package with unmatched upside potential
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

You develop cloud platform according to modern principles. You advise our custom...
Location
Location
Spain , Valencia
Salary
Salary:
Not provided
maibornwolff.de Logo
MaibornWolff GmbH
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Ideally, a degree in computer science or comparable training
  • Sound technical understanding
  • Idea of how to build and run a secure application in the cloud
  • Experience with container orchestration, ideally with Kubernetes
  • Experience with Infrastructure-as-Code tools such as Terraform, Helm, Ansible, or CDK
  • Experience in setting up the release management process using modern CI/CD systems
  • Knowledge of a cloud provider (AWS, Azure, Google Cloud) certified in the best case
  • Development skills in at least one object-oriented, functional or scripting language
  • Very good English and good German Skills
Job Responsibility
Job Responsibility
  • Develop cloud platform according to modern principles
  • Advise customers on the sensible use of services in the cloud with regard to effort, costs and maintenance
  • Live a vibrant DevOps culture internally and carry it to customers
  • Help the customer to introduce the correct release processes and implement them based on the modern CI/CD tools (Azure DevOps, Gitlab, Github)
  • Develop and integrate monitoring and logging infrastructure to improve application maintainability
  • Design and develop scalable and fail-safe IT architectures
What we offer
What we offer
  • Home Office & Office
  • Flexible Working Hours
  • Part-Time Models
  • Working Time Account
  • Sabbatical
  • 30 days of paid vacation
  • An annual training budget of 1.5 gross monthly salaries for training, certifications, conferences, and more
  • Corporate seminars
  • Christmas parties
  • Private health and dental insurance
Read More
Arrow Right

Senior Site Reliability Engineer

As a Senior Site Reliability Engineer on the Platform team, you will identify is...
Location
Location
United States , Denver; San Francisco
Salary
Salary:
138000.00 - 191000.00 USD / Year
https://checkr.com Logo
Checkr
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Degree in Computer Science (or related field)
  • 6+ years of experience in building tools with Python (preferred), GoLang, or Ruby
  • 6+ years of experience in maintaining and observing production customer-facing environments in AWS or Azure
  • 6+ years of experience as a member of an incident response team
  • Deep understanding of the fundamental infrastructure and platform concepts behind a micro-service architecture, REST APIs, and asynchronous queueing models
  • Experience with observability platforms and frameworks like Datadog, Splunk, Grafana, Prometheus, or OpenTelemetry
  • Strong collaboration, documentation, communication, and project management skills
  • Experience with container orchestration using Kubernetes/Docker/Terraform
  • Experience driving platform adoption across engineering teams, guided by a self-service and product-first approach
  • A passion for customer-centricity and building relationships with other teams
Job Responsibility
Job Responsibility
  • Collaborate, drive, and execute architectural discussions with cross-functional teams
  • Lead cross-team projects and SREs' technical roadmap to enable engineering and help Checkr customers
  • Design, build, ship, and maintain the core observability libraries, tools, and patterns used by all of Checkr’s engineering teams
  • Proactively engage across teams to foster service reliability, efficiency, and scalability
  • Troubleshoot complex production issues across the stack, with respect to performance, availability, and data quality
  • Present detailed technical information and benefits of the Checkr platform to a wide array of customers, including operations, developers, technical architects, and executives
What we offer
What we offer
  • A fast-paced and collaborative environment
  • Learning and development allowance
  • Competitive cash and equity compensation and opportunities for advancement
  • 100% medical, dental, and vision coverage
  • Up to $25K reimbursement for fertility, adoption, and parental planning services
  • Flexible PTO policy
  • Monthly wellness stipend, home office stipend
  • In-office perks such as lunch four times a week, commuter stipend, and an abundance of snacks and beverages
  • Fulltime
Read More
Arrow Right

Junior Site Reliability Engineer

As a Jr. Site Reliability Engineer, you will 'make things scale' which includes ...
Location
Location
United Kingdom
Salary
Salary:
Not provided
accesso.com Logo
accesso
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Some practical exposure to cloud platforms (AWS/Azure/GCP)—coursework, internships, or self-led projects
  • Ability to self-learn with assistance from Senior Engineers
  • Basic scripting ability using Python or Bash
  • Familiarity with basic Linux systems and general command–line
  • Understanding of Git and basic CI/CD concepts
  • Good written and verbal communication
  • customer-focused approach
  • Ability to work with minimal direction
  • Willingness to learn, take direction and work within a team
Job Responsibility
Job Responsibility
  • Assisting with provisioning and deploying accesso Horizon components to customer cloud accounts using Infrastructure as Code (Terraform)
  • Help maintain CI/CD pipelines (GitHub Actions) for application and infrastructure deployments
  • Support monitoring, logging and alerting (Prometheus, Grafana & Coralogix) and respond to basic alerts with supervision
  • Implement and improve basic automation and scripting
  • Participate in incident triage, root cause investigation and follow-up tasks
  • Follow security and compliance requirements for customer cloud environments (identity, secrets, network controls)
  • Produce and maintain operational runbooks, deployment guides and change notes
  • Participate in on-call rotation as a L1 responder
  • Normal workday may require time outside the normal working day
  • Learn and apply accesso Horizon product architecture and configuration
What we offer
What we offer
  • Competitive compensation package including an annual bonus opportunity
  • 8-days of paid bank holiday leave and 26-days of paid annual leave (paid leave increases with tenure)
  • 8 hours of paid Volunteer Time Off
  • Inclusive Family Benefits, including a $7,500 benefit for surrogacy, adoption, and fertility
  • Robust health insurance scheme with the opportunity to participate in private medical scheme after satisfactory performance
  • Matching pension scheme (up to 8%)
  • Unlimited access to Udemy for Business
  • Flexible work schedule
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

Corporate Tools is looking for a Site Reliability Engineer. You will be a tradit...
Location
Location
United States
Salary
Salary:
175000.00 USD / Year
corporatetools.com Logo
Corporate Tools
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Software Engineering, or equivalent practical experience
  • 5+ years of experience in software engineering
  • 2+ years of experience in site reliability engineering, DevOps, or infrastructure engineering roles
  • Deep experience with cloud platforms (AWS, Azure, or GCP) and infrastructure as code tools such as Terraform, CloudFormation, or Pulumi
  • Strong proficiency with Kubernetes, Docker, and container orchestration in production environments
  • Hands-on experience with observability and monitoring tools like Prometheus, Grafana, OpenTelemetry, Sentry, or New Relic
  • Proven ability to design and implement highly available, fault-tolerant systems and lead proactive incident response efforts
  • Experience with performance tuning, database optimization, and caching strategies (e.g., PostgreSQL, Redis, Memcached)
  • Demonstrated ability to drive reliability improvements, reduce operational toil, and foster a culture of resilience and continuous improvement
  • Experience leading reliability-focused initiatives such as post-incident reviews, capacity planning, and root cause analysis
Job Responsibility
Job Responsibility
  • Stop problems before they start
  • Fix issues quickly and learn from them
  • Help keep systems steady, secure, and running
  • Work closely with DevOps engineers to build out tools and automation
  • Take ownership
What we offer
What we offer
  • 100% employer-paid medical, dental and vision for employees
  • Annual review with raise option
  • 22 days Paid Time Off accrued annually, and 4 holidays
  • After 3 years, PTO increases to 29 days
  • Employees transition to flexible time off after 5 years with the company—not accrued, not capped, take time off when you want
  • Paid Parental Leave
  • Up to 6% company matching 401(k) with no vesting period
  • Quarterly allowance
  • Open concept office with friendly coworkers
  • Creative environment where you can make a difference
  • Fulltime
Read More
Arrow Right