CrawlJobs Logo

Site Reliability Engineering Lead

Canada, Mississauga 120800.00 - 170800.00 USD / Year · Job Posted March 22, 2026
Apply Position
Job Link Share

Job Description

We are seeking an experienced and motivated team member to support our AI and DevOps Platform Support team in North America. This role is responsible for contributing to the stability, reliability, and performance of our critical AI and DevOps platforms. The team supports a wide range of services, including multiple AI applications, developer tools, and CI/CD pipeline technologies used across the organization. The ideal candidate will help lead a team of SRE and Support engineers, facilitate incident and problem resolution, and collaborate with engineering and development teams to enhance platform services and supportability. The role includes short‑term planning and coordination of actions and resources within the team.

Job Responsibility

  • Demonstrates a strong understanding of how application support contributes to the overall technology function and organizational objectives
  • Assist with vendor relationship management, including coordination with offshore managed services
  • Support efforts to improve service levels for end users by enhancing operational efficiencies and strengthening incident management, problem management, and knowledge‑sharing practices
  • Partner with development teams to guide improvements in application stability and supportability
  • Contribute to frameworks for managing capacity, throughput, and latency
  • Assist in defining and implementing application onboarding guidelines and standards
  • Support team members by fostering a collaborative environment and encouraging skill development
  • Participate in cost‑reduction efforts through Root Cause Analysis reviews, knowledge management, performance tuning, and user training
  • Participate in business review meetings to help align technology tools and strategies with business requirements
  • Ensure adherence to support processes and tool standards, and assist in enhancing processes to promote consistency and quality across the support program
  • Perform other duties and functions as assigned
  • Support platform leadership in defining the platform roadmap and partnering with engineering teams and business stakeholders
  • Assist in executing resilience activities such as wargaming scenarios, chaos engineering tests, and disaster recovery drills
  • Contribute to automation initiatives aimed at reducing manual toil and improving platform efficiency
  • Support the enterprise‑wide observability strategy, including monitoring, logging, tracing, and alerting
  • Maintain hands‑on familiarity with platform architecture and services as needed for operational support
  • Assist in overseeing the operational health of production platforms (including OpenShift, ECS, CI/CD), ensuring SLAs are supported and incident processes are followed
  • Help implement and operate effective monitoring and observability strategies to support proactive issue detection and system health assessments

Requirements

  • 6+ years of relevant experience in a hands‑on technical or support leadership role
  • Experience contributing to architecture discussions and ensuring solutions align with enterprise standards and long‑term maintainability
  • Experience working with senior stakeholders or technology partners
  • Demonstrated experience supporting IT service improvements or platform stability initiatives
  • Strong communication and presentation skills, with the ability to convey technical concepts clearly
  • Experience supporting or contributing to technical roadmaps or operational workstreams
  • Experience participating in resilience‑related activities such as incident simulations, disaster recovery exercises, or stability testing
  • Ability to collaborate with cross‑functional support teams and technology groups
  • Strong organizational and workload‑planning skills
  • Consistently demonstrates clear and concise written and verbal communication skills
  • Ability to communicate appropriately with relevant stakeholders
  • Bachelor’s/University degree required
  • Master’s degree preferred

Nice to have

  • Working knowledge of Generative AI concepts preferred
  • Experience with CI/CD and configuration management tools preferred
  • Experience with Red Hat OpenShift or similar Kubernetes technologies preferred
  • Experience working with databases such as Postgres, Oracle, MongoDB, or Redis preferred
  • Experience writing or maintaining code in Java, Python, Go, or similar languages preferred
  • Hands‑on experience with modern observability and monitoring tools (e.g., Prometheus, Grafana, Splunk, ELK) preferred

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineering Lead

8 matching positions

Site Reliability Engineering Lead

Engineer the future of global finance. At Citi, our Tech team doesn’t just suppo...
Location
Location
Canada , Mississauga
Salary
Salary:
120800.00 - 170800.00 USD / Year
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6–10 years of relevant experience in a hands‑on technical role
  • Experience contributing to architecture discussions and ensuring solutions align with enterprise standards and long‑term maintainability
  • Experience working with senior stakeholders or technology partners
  • Demonstrated experience supporting IT service improvements or platform stability initiatives
  • Strong communication and presentation skills, with the ability to convey technical concepts clearly
  • Experience supporting or contributing to technical roadmaps or operational workstreams
  • Experience participating in resilience‑related activities such as incident simulations, disaster recovery exercises, or stability testing
  • Ability to collaborate with cross‑functional support teams and technology groups
  • Strong organizational and workload‑planning skills
  • Consistently demonstrates clear and concise written and verbal communication skills
Job Responsibility
Job Responsibility
  • Demonstrates a strong understanding of how application support contributes to the overall technology function and organizational objectives
  • Assist with vendor relationship management, including coordination with offshore managed services
  • Support efforts to improve service levels for end users by enhancing operational efficiencies and strengthening incident management, problem management, and knowledge‑sharing practices
  • Partner with development teams to guide improvements in application stability and supportability
  • Contribute to frameworks for managing capacity, throughput, and latency
  • Assist in defining and implementing application onboarding guidelines and standards
  • Support team members by fostering a collaborative environment and encouraging skill development
  • Participate in cost‑reduction efforts through Root Cause Analysis reviews, knowledge management, performance tuning, and user training
  • Participate in business review meetings to help align technology tools and strategies with business requirements
  • Ensure adherence to support processes and tool standards, and assist in enhancing processes to promote consistency and quality across the support program
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering (SRE) / Lead Engineer

We are currently seeking a Site Reliability Engineering (SRE) / Lead Engineer to...
Location
Location
Mexico , Guadalajara
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8-10+ years of experience in SRE, Observability, or DevOps roles, with leadership responsibilities
  • Hands-on experience with OpenTelemetry for distributed tracing and observability instrumentation
  • Proven expertise with Application Performance Monitoring (APM) tools such as New Relic, Datadog, AppDynamics, or Dynatrace
  • Strong proficiency in Infrastructure as Code (IaC) using Terraform
  • Solid understanding of cloud platforms including AWS, GCP, or Azure
  • Experience with automation/configuration management tools like Ansible, Chef, or Puppet
  • Deep knowledge of CI/CD pipelines and tools such as GitHub Actions, Jenkins, or Azure DevOps
  • Experience managing Kubernetes and containerized environments (Docker, Helm)
  • Familiarity with log aggregation and analysis platforms like ELK Stack or Splunk
  • Excellent leadership, communication, and collaboration skills
Job Responsibility
Job Responsibility
  • Lead the strategic development and management of observability and reliability frameworks across the organization, ensuring alignment with business goals and technical requirements
  • Design and implementation of monitoring and observability solutions, collaborating with engineering teams to define standards and best practices
  • Manage Infrastructure as Code (IaC) initiatives using Terraform, coordinating with cloud and infrastructure teams to ensure scalable and secure deployments
  • Drive automation strategies for monitoring, alerting, and logging pipelines, focusing on process improvements and operational efficiency
  • Develop and maintain comprehensive observability roadmaps, including distributed tracing, logging, and metrics collection strategies
  • Collaborate with product management, sales, and pre-sales teams to provide technical expertise and support during solution design and customer engagements
  • Lead cross-functional teams to enhance CI/CD pipelines and deployment reliability, ensuring smooth integration of observability tools and practices
  • Engage with vendors and strategic partners to evaluate, select, and integrate observability and monitoring solutions, ensuring alignment with organizational needs and fostering strong collaborative relationships
  • Mentor and develop junior engineers and analysts, fostering a culture of reliability, observability, and operational excellence
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering (SRE) Team Lead

We are looking for a highly skilled and experienced Site Reliability Engineering...
Location
Location
United States , Irving
Salary
Salary:
Not provided
onemainfinancial.com Logo
OneMain Financial
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BA/BS in Computer Science, Engineering, related field, or equivalent experience
  • 7+ years of experience in site reliability engineering, systems engineering, or related roles, with at least 2 years in a leadership position
  • Proven experience leading and scaling high-performing engineering teams
  • Deep expertise in cloud platforms (AWS, GCP, Azure) and container orchestration (Kubernetes, Docker)
  • Strong skills in infrastructure as code tools (Terraform, Ansible, CloudFormation) and CI/CD pipelines
  • Proficiency with monitoring and alerting systems (Prometheus, Grafana, ELK, Datadog)
  • Solid programming and scripting skills (Python, Go, Bash, or similar)
  • Strong understanding of distributed systems, networking, security, and databases
  • Excellent leadership, communication, and collaboration skills
  • Experience managing incident response and on-call rotations
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow a team of site reliability engineers, promoting a culture of reliability, automation, and continuous improvement
  • Drive the design, implementation, and maintenance of scalable and fault-tolerant infrastructure to support high-availability services
  • Oversee incident management processes, including triage, root cause analysis, and postmortems to improve system reliability and prevent recurrence
  • Collaborate cross-functionally with software engineering, product, and operations teams to integrate reliability best practices into the software development lifecycle
  • Define and implement operational metrics, SLIs/SLOs, and dashboards to monitor system health and drive proactive improvements
  • Manage and assess the observability of critical environments proactively addressing gaps that may arise
  • Oversee the release management processes, artifacts and tools that drive a repeatable software delivery lifecycle
  • Champion automation efforts to reduce manual intervention, improve deployment pipelines, and optimize infrastructure management
  • Lead capacity planning, disaster recovery, and performance tuning efforts
  • Ensure security and compliance standards are upheld across infrastructure and operations
What we offer
What we offer
  • Health and wellbeing options including medical, prescription, dental, vision, hearing, accident, hospital indemnity, and life insurances
  • Up to 4% matching 401(k)
  • Employee Stock Purchase Plan (10% share discount)
  • Tuition reimbursement
  • Paid time off (15 days’ vacation per year, plus 2 personal days, prorated based on start date)
  • Paid sick leave as determined by state or local ordinance, prorated based on start date
  • Paid holidays (7 days per year, based on start date)
  • Paid volunteer time (3 days per year, prorated based on start date)
  • Access to Talkspace and Hinge for on-demand physical therapy via an app
  • Family back-up care
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering (SRE) / Observability Technical Lead

Join a dynamic team as a Site Reliability Engineer, leading observability and re...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in SRE, Observability, or DevOps roles, with leadership responsibilities
  • Proven expertise with Application Performance Monitoring (APM) tools such as New Relic, Datadog, AppDynamics, or Dynatrace
  • Hands-on experience with OpenTelemetry (OTel) for distributed tracing and observability instrumentation
  • Strong proficiency in Infrastructure as Code (IaC) using Terraform
  • Solid understanding of cloud platforms including AWS, GCP, or Azure
  • Experience with automation/configuration management tools like Ansible, Chef, or Puppet
  • Deep knowledge of CI/CD pipelines and tools such as GitHub Actions, Jenkins, or Azure DevOps
  • Experience managing Kubernetes and containerized environments (Docker, Helm)
  • Familiarity with log aggregation and analysis platforms like ELK Stack or Splunk
  • Excellent leadership, communication, and collaboration skills
Job Responsibility
Job Responsibility
  • Lead the strategic development and management of observability and reliability frameworks across the organization, ensuring alignment with business goals and technical requirements
  • Design and implementation of monitoring and observability solutions, collaborating with engineering teams to define standards and best practices
  • Manage Infrastructure as Code (IaC) initiatives using Terraform, coordinating with cloud and infrastructure teams to ensure scalable and secure deployments
  • Drive automation strategies for monitoring, alerting, and logging pipelines, focusing on process improvements and operational efficiency
  • Develop and maintain comprehensive observability roadmaps, including distributed tracing, logging, and metrics collection strategies
  • Collaborate with product management, sales, and pre-sales teams to provide technical expertise and support during solution design and customer engagements
  • Lead cross-functional teams to enhance CI/CD pipelines and deployment reliability, ensuring smooth integration of observability tools and practices
  • Engage with vendors and strategic partners to evaluate, select, and integrate observability and monitoring solutions, ensuring alignment with organizational needs and fostering strong collaborative relationships
  • Mentor and develop junior engineers and analysts, fostering a culture of reliability, observability, and operational excellence
What we offer
What we offer
  • Tailored benefits that support your physical, emotional, and financial wellbeing
  • Continuous growth and development opportunities
  • Flexible work options
  • Fulltime
Read More
Arrow Right

Lead Site Reliability Engineer

Trimble is looking for a Site Reliability Engineering Lead to join Business Syst...
Location
Location
India , Chennai
Salary
Salary:
Not provided
trimble.com Logo
Trimble Inc.
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Engineering, Computer Science, or a related field
  • 7+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles with at least 2+ years in a leadership or mentoring capacity
  • Deep AWS expertise (EC2, S3, RDS, IAM, VPC, Lambda, CloudFormation/Terraform, etc.)
  • Strong knowledge of Infrastructure-as-Code (IaC) using Terraform, AWS CDK, or CloudFormation
  • Proven experience with CI/CD tools (Jenkins, GitHub Actions, GitLab CI, or similar)
  • Proficiency in containerization and orchestration (Docker, Kubernetes, ECS, or EKS)
  • Expertise in monitoring and observability tools (Datadog, New Relic, Prometheus, Grafana, ELK, CloudWatch, etc.)
  • Strong scripting or programming background (Python, Bash, or Go)
  • Sound understanding of networking, security, and identity/access management in the cloud
  • Experience designing high-availability and disaster recovery strategies for critical workloads
Job Responsibility
Job Responsibility
  • Become well-versed in the opportunities and challenges of the business and Trimble's customers
  • Become an expert in Business Systems services, especially the interfaces—APIs, protocols (e.g. OAuth), and user interfaces
  • Establish, then utilize tight working relationships with stakeholders across the company, especially Trimble's engineering community
  • Prototype and create proofs of concept as required
  • Scope and deploy new integrations
  • Investigate, diagnose, and solve customer integration issues
  • Effectively communicate technical issues with stakeholders in non-technical language
  • Contribute to utilities and SDKs to help integration and migration efforts
  • Fulltime
Read More
Arrow Right

Lead Site Reliability Engineer/ Expert

Responsible for ensuring highly reliable, scalable, and resilient production sys...
Location
Location
Egypt; India , Cairo; Delhi
Salary
Salary:
Not provided
sita.aero Logo
SITA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related field. Master’s degree preferred for senior roles
  • Relevant certifications such as ITIL, CCNP/CCIE, Palo Alto Security, SASE, SDWAN, Juniper Mist/Aruba, CompTIA Security+, or Certified Kubernetes Administrator (CKA)
  • Certifications in cloud platforms (AWS, Azure, Google Cloud) or DevOps methodologies
  • Certifications in automation and IaC tools (Ansible, Terraform)
  • Certifications in observability and monitoring platforms (Dynatrace, Prometheus, Grafana, ELK)
  • Certifications in ServiceNow, Jira, or other operational tooling
  • 8+ years in IT operations, service management, or infrastructure reliability, including roles such as Site Reliability Engineer, Problem Manager, or DevOps Engineer
  • Strong experience with high availability systems, resilience engineering, and DR readiness
  • Deep expertise in RCA, incident management, PMIR, and implementing permanent fixes for recurring issues
  • Hands on experience with CI/CD, automation, IaC, and self healing/auto remediation workflows
Job Responsibility
Job Responsibility
  • Design & maintain resilient systems ensuring high availability, scalability, and fault tolerance
  • Ensure effective Disaster Recovery (DR), failover strategies, and resilience engineering across environments
  • Improve platform reliability, observability, and performance across cloud and on‑premises systems
  • Establish and maintain SLIs, SLOs, and error budgets to measure and govern service reliability
  • Take ownership of production availability, capacity planning, performance tuning, and long‑term reliability initiatives
  • Drive automation for infrastructure provisioning, deployment, monitoring, and operational workflows
  • Develop and implement auto‑remediation and self‑healing solutions to reduce manual intervention
  • Manage CI/CD pipelines and Infrastructure as Code (IaC) frameworks for secure, repeatable deployments
  • Implement and manage zero‑downtime deployment strategies (blue‑green, canary, rolling)
  • Support containerized and cloud‑native platforms including Kubernetes, Docker, and distributed systems
What we offer
What we offer
  • Work from home up to 2 days/week (depending on your team's needs)
  • Make your workday suit your life and plans
  • Take up to 30 days a year to work from any location in the world
  • Employee Assistance Program (EAP), for you and your dependents 24/7, 365 days/year
  • Champion Health - a personalized platform that supports a range of wellbeing needs
  • Access to world-class learning platforms and programs (LinkedIn Learning, Microsoft's Enterprise Skills Initiative, Airport Council International, Pluralsight, Harvard Business Publishing, Stanford)
  • Competitive benefits that make sense with both your local market and employment status
  • Fulltime
Read More
Arrow Right

Director, Site Reliability Engineering

As our Director of Infrastructure platform, you will be a key driver of Doctolib...
Location
Location
France , Paris
Salary
Salary:
Not provided
doctolib.fr Logo
Doctolib
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 12+ years in software engineering, including 6+ years leading large (30+) distributed, international platform or infrastructure teams
  • Proven experience driving platform-as-a-product transformations and modularizing large monolithic architectures at scale
  • Demonstrated ability to architect, deliver, and operate secure, reliable, and scalable developer platforms in SaaS, multi-product, or regulated environments
  • Strong process orientation: experience implementing OKRs, robust monitoring/observability, and best-in-class incident management
  • Measurable impact on developer productivity, platform adoption, reliability, and cost-efficiency
  • Effective communicator and influencer, with the ability to align and inspire cross-functional stakeholders
  • Experience leading change and building high-performing, people-first engineering cultures
  • Fluent in English and comfortable in fast-paced, international environments
Job Responsibility
Job Responsibility
  • Lead and scale a high-performing infrastructure organization of 30+ engineers across Infrastructure, Automation, SRE, and Database teams, while maintaining strong engagement and fostering a culture of excellence and ownership
  • Own the infrastructure platform strategy and roadmap that enables Doctolib's modularization journey, delivers on company OKRs, and ensures predictable execution across all infrastructure and automation initiatives
  • Champion platform-as-a-product by building self-service capabilities (infrastructure provisioning, CI/CD, observability, database management) that transform developer experience and unlock team autonomy across the engineering organization
  • Be the guardian of quality and reliability by establishing world-class incident management, driving measurable improvements in availability and performance, and ensuring infrastructure components operate at the highest standards of security and resilience
  • Accelerate engineering velocity by reducing platform friction, enabling faster modularization, and leveraging AI-augmented development tools to multiply productivity across feature teams
  • Drive the infrastructure transformation from monolith-supporting infrastructure to a modular, multi-service platform architecture - enabling international expansion, product velocity, and operational excellence at scale
  • Act as a senior technical leader within the Platform organization and broader Tech leadership team, bringing strong technical opinions and challenging architectural decisions while clearly articulating how infrastructure investments contribute to company strategy and business outcomes
What we offer
What we offer
  • Free comprehensive health insurance for you and your children
  • Parent Care Program: receive additional leave on top of the legal parental leave
  • Free mental health and coaching services through our partner Moka.care
  • For caregivers and workers with disabilities, a package including an adaptation of the remote policy, extra days off for medical reasons, and psychological support
  • Work from abroad for up to 10 days per year thanks to our flexibility days policy
  • Work Council subsidy to refund part of sport club membership or creative class
  • Up to 14 days of RTT
  • Lunch voucher with Swile card
  • Fulltime
Read More
Arrow Right

Manager of Site Reliability Engineering (SRE)

The Manager of Site Reliability Engineering leads and develops a team of SRE pra...
Location
Location
United States , Birmingham
Salary
Salary:
Not provided
genpt.com Logo
Genuine Parts Company
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Typically requires a bachelor's degree and 7 years of experience in a technology and/or software engineering role or an equivalent combination
  • Proven experience working in large, complex enterprise environments (Fortune 500 or equivalent)
  • Strong understanding and demonstrated implementation of Site Reliability Engineering (SRE) principles at scale
  • Hands-on experience with infrastructure-as-code (IaC) tools such as Terraform, and ArgoCD
  • In-depth knowledge and practical experience with CI/CD pipelines and automation of software delivery
  • Championing DevOps practices and embedding reliability early in the SDLC
  • Significant hands-on experience in Site Reliability Engineering or related roles focused on cloud infrastructure reliability
  • Strong software engineering background with proficiency in infrastructure-as-code tools (e.g., Terraform, ArgoCD) and CI/CD automation
  • Deep knowledge of cloud platforms, specifically Google Cloud Platform (GCP), Kubernetes, container orchestration, and cloud-native architecture
  • Familiarity with monitoring and observability tools such as Dynatrace, Datadog, or equivalents
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow a high-performing team of Site Reliability Engineers, fostering a culture of ownership, continuous improvement, and operational excellence
  • Implement and champion Site Reliability Engineering principles and DevOps best practices within the team to ensure service reliability, availability, and performance
  • Define and track key SRE metrics such as service uptime, incident response and resolution times
  • Drive automation efforts including CI/CD pipeline enhancements, infrastructure-as-code practices, and self-service infrastructure provisioning to increase deployment velocity while reducing manual toil
  • Own and continuously improve observability practices including system monitoring, logging, alerting, and diagnostics to ensure rapid issue detection and resolution
  • Participate in incident response processes including incident management, root cause analysis, post-mortems, and continuous improvement to enhance system resilience
  • Partner closely with software engineering, product management, architecture, and security teams to embed reliability and security early in the software development lifecycle (SDLC)
  • Oversee the management and scalability of cloud infrastructure environments, primarily on Google Cloud Platform (GCP), with a focus on Kubernetes, container orchestration, and hybrid cloud integrations
  • Advocate for and apply best practices in performance tuning, capacity planning, and system design for high availability
  • Develop and execute a long-term roadmap for our hybrid cloud platform, aligning with evolving business objectives and technology trends
What we offer
What we offer
  • comprehensive benefit plans and programs designed to support your health and wellness, provide income protection and build financial security for your retirement
  • Fulltime
Read More
Arrow Right