Site Reliability Engineer Job at MaintainX (Montreal & Toronto)

Site Reliability Engineer

We are looking for an experienced Site Reliability Engineer to strengthen the re...

Location

United States , San Francisco

Salary:

Not provided

Robert Half

Expiration Date

Until further notice

Requirements

7+ years of experience in Site Reliability Engineering, infrastructure engineering, or a closely related production environment role
Strong background operating, supporting, and troubleshooting distributed systems at scale
Hands-on experience with observability platforms such as Datadog, Grafana, OpenTelemetry, CloudWatch, or similar tools
Proven involvement in on-call operations, incident management, and reliability-focused problem resolution
Practical experience defining and using SLIs, SLOs, and error budgets to guide service reliability decisions
Familiarity with AWS environments, including serverless and container-based architectures
Experience working with relational databases such as Postgres and performance analysis in production systems
Ability to write automation scripts or lightweight tooling in languages such as Python or Bash, with strong judgment around failure modes and system design

Job Responsibility

Establish measurable reliability standards for critical services by creating and maintaining service indicators, objectives, and error budget practices
Take ownership of production stability by monitoring uptime, latency, and availability, and driving improvements that reduce operational risk
Lead live incident response efforts, coordinate troubleshooting during outages, and ensure issues are resolved efficiently and thoroughly
Run blameless post-incident reviews, document findings clearly, and track corrective actions through completion
Design and enhance observability across logs, metrics, and distributed tracing using tools such as Datadog, CloudWatch, Grafana, OpenTelemetry, and Sentry
Improve alert quality and dashboard design so engineering teams can quickly identify meaningful system issues without unnecessary noise
Evaluate system behavior under load, uncover performance constraints, and recommend changes that improve scalability and resource efficiency
Build automation and internal tooling that streamline operational work, strengthen deployment safety, and support incident management, debugging, and capacity planning
Contribute to infrastructure and delivery workflows across AWS, Terraform, Ansible, Linux, and GitHub Actions with a focus on dependable releases and resilient systems
Partner with security and compliance stakeholders to support operational standards, audit readiness, and the integration of monitoring into broader engineering practices

What we offer

Medical, vision, dental, and life and disability insurance
Enrollment in company 401(k) plan

New

Site Reliability Engineer

Join us as a Site Reliability Engineer at Barclays, where you will play a pivota...

Location

United Kingdom , Glasgow

Salary:

Not provided

Barclays

Expiration Date

Until further notice

Requirements

Experience on Lunix/Unix, Java Applications, Oracle, SQL (PLQSP/MySQL) and RDBMS with exposure to writing queries, tuning concepts, data analysis etc.
Experience of at least two of the Middleware technologies amongst JBOSS Apache Tomcat, Glassfish.
Experience of Cloud technologies, AWS, OpenShift/APIs.
Experience of monitoring tools AppDynamics. ITRS and scheduling Tools – TWS and/or Autosys.

Job Responsibility

Apply software engineering techniques, automation, and best practices in incident response, to ensure the reliability, availability, and scalability of the systems, platforms, and technology through them.
Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning.
Resolution, analysis and response to system outages and disruptions, and implement measures to prevent similar incidents from recurring.
Development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience.
Monitoring and optimisation of system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning.
Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and work closely with other teams to ensure smooth and efficient operations.
Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities to foster a culture of technical excellence and growth.

What we offer

Competitive holiday allowance
Life assurance
Private medical care
Pension contribution

Fulltime

New

Site Reliability Engineer

Location

Malaysia , Kuala Lumpur

Salary:

15000.00 MYR / Month

Randstad

Expiration Date

August 21, 2026

Requirements

Cloud Infrastructure: AWS
Containerization: Kubernetes and Docker
Operating Systems: Linux and Unix Systems
Database Systems: Oracle Database
Programming/Scripting: Python, Java, or Go for automation scripting
Automation & Infrastructure Tools: Ansible and Terraform
Monitoring & Observability: Prometheus, Grafana, and Nagios
Integration: API and Networking Integration

Job Responsibility

Maintain continuous system monitoring and configure active alerts to prevent failures
Automate manual operational tasks, system monitoring, and infrastructure provisioning
Participate in deep-dive troubleshooting and rigorous post-mortem analysis to minimize downtime
Manage the technical resumption of high-priority, Service at Risk (S@R), and medium/high severity incidents within SLAs
Direct second- and third-level support teams and perform Root Cause Analysis (RCA)
Review system dependencies and manage changes, releases, and rollouts for minimal stability impact
Lead the team to actively achieve the organization's strict conduct, compliance, and market principles
Take end-to-end accountability for incident, problem, change, and risk management related to the production platform
Surface operational/security risks and provide monthly governance dashboards outlining trends and Service Improvement Plans (SIP)

Fulltime

Site Reliability Engineer

We’re hiring an SRE to help improve the availability, performance, scalability, ...

Location

Israel , Netanya/Tel Aviv

Salary:

Not provided

JFrog

Expiration Date

Until further notice

Requirements

2-4 years of experience in SRE, Production Engineering, DevOps, or a similar role with hands-on production exposure
Strong troubleshooting and analytical skills, with the ability to investigate production issues in a structured and methodical way
Hands-on experience with Kubernetes-based containerized workloads
Experience with at least one public cloud provider: AWS, GCP, or Azure
Experience developing backend services, internal platforms, automation, or production engineering tools using Python, Go, or another programming language
Practical understanding of Linux fundamentals, networking concepts, HTTP, DNS, service connectivity, and production troubleshooting
Familiarity with CI/CD tools such as Jenkins, ArgoCD, GitHub Actions, or similar
Exposure to observability tools covering metrics, logs, and traces, such as Prometheus, Grafana, Coralogix, New Relic, or similar platforms
Understanding of incident management processes, alerting systems, and production support workflows
Ability to learn quickly, take ownership, communicate clearly, and work well in a collaborative production environment

Job Responsibility

Support the reliability, availability, performance, and scalability of JFrog’s large-scale, multi-cloud, Kubernetes-based SaaS environments
Investigate and troubleshoot production issues across distributed systems, infrastructure, Kubernetes, and cloud environments in close collaboration with Engineering teams
Design and develop backend services, internal platforms, and production engineering tools using Python, Go, or similar technologies
Improve reliability, observability, and operational readiness through SRE practices, monitoring and alerting, capacity awareness, postmortems, and safer CI/CD and production change processes
Evaluate and contribute to AI-assisted and agentic automation solutions that improve operational efficiency, troubleshooting, and production workflows
Support resilience initiatives, including disaster recovery validation, service readiness, health checks, and production readiness reviews
Participate in on-call rotations, lead incident response when needed, and drive follow-up actions to prevent recurrence
Continuously learn and evaluate new technologies that can improve reliability, automation, and operational excellence

Fulltime

Site Reliability Engineer

Microsoft Substrate is the foundational cloud platform that powers many of Micro...

Location

United States , Redmond

Salary:

102100.00 - 202200.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
4+ years technical experience in software engineering, network engineering, or systems administration
ability to meet Microsoft, customer and/or government security screening requirements
ability to obtain and maintain favorably adjudicated Tier 3 (T3) background investigation
ability to meet Criminal Justice Information Services (CJIS) eligibility requirements
must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter

Job Responsibility

Own reliability and operational health for one or more Substrate components or services in highly regulated environments
Serve as an actively engaged on-call engineer (OCE), participating in an on-call rotation and independently responding to incidents for owned services
Respond to, diagnose, and resolve production incidents with minimal supervision
Design and implement automation to reduce operational toil and improve service stability
Develop and maintain monitoring, alerting, and telemetry to support SLOs and operational metrics
Lead post-incident reviews for owned incidents, focusing on root cause analysis and durable fixes
Collaborate with software engineering teams to embed reliability and operability into service design
Write and maintain production-quality code and automation that improves reliability, scalability, and operational efficiency

What we offer

Benefits and other compensation may be eligible
additional benefits and pay information available at https://careers.microsoft.com/us/en/us-corporate-pay

Fulltime

Site Reliability Engineer

An Elite FinTech Firm is looking for a highly talented DevOps Engineer/Systems S...

Location

Canada , Montreal

Salary:

200000.00 CAD / Year

Hunter Bond

Expiration Date

Until further notice

Requirements

Genuine passion in Linux & Open-source
Excellent knowledge of Python
Use of CI/CD, Docker, Ansible, Chef, Puppet
Knowledge of large-scale storage systems (on-prem)

Job Responsibility

Help architect a resilient, multi-petabyte storage solutions & build new data centres
Automate anything and everything with Python & config tools
Innovate whilst bringing in new ideas

What we offer

Flexible hours/work options
Working in one of the world’s most elite teams
Invest heavily in cutting-edge and next-gen tech
Technologists only report to other technologists
Brand new skyline Manhattan office
Start-up style environment

Fulltime

Site Reliability Engineer

We are currently seeking a Site Reliability Engineer to join our team in Guadala...

Location

Mexico , Guadalajara

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

Perform L1.5 activities such as monitoring, deployment, rollback
Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
Understand the Microsoft Azure Cloud - ideally Azure Fundamentals certified OR Computer Science/Information Systems Management degree
Familiar with PaaS and IaaS - VMs, Storage, EventHub, Service Fabric Cluster (SFC), Azure Kubernetes Service (AKS), CosmosDB, SQL Server, IoT Hub, Databricks, KeyVault, Datalake
Understand the concept of Internet of Things (IoT) - telemetry, ingestion, processing, data storage, reporting
Understand the concept tools - Octopus, Bamboo, Terraform, Azure DevOps, Jenkins, Github, Ansible
Understand the concept of container orchestration platforms (e.g. Kubernetes)
Understand the concept of scripts: Powershell, Python
Understand the difference between NoSQL and SQL databases, and how to maintain them

Job Responsibility

Perform L1.5 activities such as monitoring, deployment, rollback
Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)

Fulltime

Site Reliability Engineer

We are looking for a Lead Site Reliability Engineer (SRE) with strong experience...

Location

India , Bangalore

Salary:

Not provided

Karix

Expiration Date

Until further notice

Requirements

5+ years of experience in SRE / DevOps / Production Engineering roles
Strong expertise in troubleshooting distributed systems and microservices architecture
Hands-on experience with Kafka, RabbitMQ, and Redis
Strong knowledge of Kubernetes and container orchestration
Experience with CI/CD pipelines and deployment automation
Solid understanding of Linux, networking, and cloud platforms (AWS / Azure / GCP)
Experience with Infrastructure as Code (Terraform, Ansible)
Strong scripting skills (Python, Bash, or similar)
Database experience: MySQL / Oracle / MongoDB
Strong problem-solving, ownership mindset, and ability to lead initiatives

Job Responsibility

Lead troubleshooting and resolution of complex production issues in distributed systems
Drive reliability engineering practices, ensuring high availability and performance of systems
Manage and optimize messaging systems like Apache Kafka, RabbitMQ, and Redis
Architect, manage, and optimize Kubernetes clusters for scalability and resilience
Manage CI/CD pipelines and drive deployment automation
Implement and maintain monitoring, alerting, and observability using Prometheus, Grafana, and ELK stack
Lead incident management, root cause analysis (RCA), and post-mortem reviews
Mentor junior engineers and collaborate with cross-functional teams to improve system design and reliability

What we offer

Impactful Work: Play a key role in ensuring reliability and scalability of platforms that handle large-scale, real-time communication systems
Tremendous Growth Opportunities: Accelerate your career by leading critical reliability initiatives and working on high-scale distributed systems
Innovative Environment: Work in a fast-paced ecosystem that embraces automation, cloud-native technologies, and continuous improvement

Fulltime

Select Country

Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Site Reliability Engineer

Our AI answers in your language