CrawlJobs Logo

SRE/ Observability Engineer

Canada, Toronto 140000.00 USD / Year · Job Posted March 19, 2026
Apply Position
Job Link Share

Job Description

We are looking for a Mid-Level Observability Engineer to help implement, operate, and improve observability capabilities across our applications and platforms. This role focuses on hands-on onboarding, instrumentation, dashboarding, and alerting, working under established standards and guidance from senior engineers. You will collaborate with application, SRE, and operations teams to ensure systems are observable, supportable, and production-ready.

Job Responsibility

  • Observability Implementation Implement and maintain metrics, logs, and traces for applications and infrastructure
  • Assist with onboarding applications into observability platforms (e.g., Dynatrace, ELK, Datadog)
  • Configure dashboards, alerts, and basic anomaly detection Application Support Instrumentation
  • Work with development teams to enable structured logging, basic distributed tracing, and core metrics
  • Validate observability requirements during Production Readiness Reviews (PRR)
  • Troubleshoot missing or low-quality telemetry
  • Monitoring Alerting Configure alerts based on golden signals (latency, errors, traffic, saturation)
  • Help reduce alert noise by tuning thresholds and alert logic
  • Support incident response by gathering logs, metrics, and traces
  • Operations Reliability Support root cause analysis using observability tools
  • Maintain dashboards and documentation used by on-call and support teams
  • Participate in on-call rotations (as applicable)
  • Automation Continuous Improvement Assist in automating observability onboarding and validation tasks
  • Create and maintain reusable dashboards and alert templates
  • Follow established observability standards and best practices

Requirements

  • 24 years of experience in Observability, or SRE
  • Working knowledge of metrics, logs, and basic tracing concepts
  • Hands-on experience with at least one observability platform (Dynatrace, Elastic ELK, Datadog, New Relic, etc.)
  • Basic understanding of SLIs SLOs and service health indicators
  • Experience with cloud platforms or hybrid environments
  • Ability to write scripts (Python, Bash, PowerShell) for automation and troubleshooting

Nice to have

  • Experience with Open Telemetry or APM agents
  • Familiarity with Kubernetes or containerized workloads
  • Experience working with incident management tools (PagerDuty, ServiceNow)
  • Exposure to Dynatrace Kibana ELK or similar cloud-native monitoring
  • Experience in regulated or enterprise environments

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

SRE/ Observability Engineer

8 matching positions

Site Reliability Engineer / Observability Engineer

Rackspace is building up its Professional Services Center of Excellence on Appli...
Location
Location
Egypt , Giza
Salary
Salary:
Not provided
rackspace.com Logo
Rackspace
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering/computer science or equivalent
  • Senior-level experience with Site Reliability Engineering, DevOps, Code level application support and troubleshooting, AWS Infrastructure design, implementation and optimization, Automation for deployment, scaling and reliability
  • Experience with observability solutions tools like Splunk, Datadog, SignalFx, etc.
  • Experience deploying, maintaining and supporting software applications/services in the AWS ecosystem
  • Proactive approach to identifying problems and solutions
  • Experience writing code with one or more interpreted languages such as Python, PHP, Perl, Ruby, Linux Shell
  • Experience with Terraform or Cloud Formation scripting
  • Experience with configuration management tools like Ansible, Chef or Puppet
  • Experience with standard software development best practices and tools such as code repositories (Git preferred)
  • Experience executing in an agile software development environment
Job Responsibility
Job Responsibility
  • Work with customers and implement Observability solutions
  • Build and maintain scalable systems and robust automation that supports engineering goals
  • Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
  • Proactively gather and analyze both metric and log data from systems and applications to perform anomaly detection, performance tuning, capacity planning and fault isolation
  • Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability, security and performance standards
  • Collaborate with team members to document and share solutions
  • Maintain a deep understanding of the customer’s business as well as their technical environment
  • Identifying performance bottlenecks, identifying anomalous system behavior, and resolving root cause of service issues
  • Fulltime
Read More
Arrow Right
New

Senior Systems Operations Engineer - SRE and AIOps

Wells Fargo is seeking a Senior Systems Operations Engineer within the Enterpris...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
https://www.wellsfargo.com/ Logo
Wells Fargo
Expiration Date
June 22, 2026
Flip Icon
Requirements
Requirements
  • 4+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • Strong Java / backend service development experience
  • Distributed systems and API-based service design
  • CI/CD pipelines and Git-based workflows
  • 3+ years of experience with scripting and infrastructure automation using Terraform
  • 3+ years of hands-on experience with OpenShift, GCP or Azure platform enablement and application migrations, build out of complex infrastructure programmable patterns using Infrastructure as Code (IaC)
  • 2+ years of knowledge and understanding of Cloud service offerings such as data, analytics, AL/ML on GCP or Azure
  • 2+ years of experience with key services provided by Azure and/or GCP such as BigQuery, Vertix AI, DataProc, Functions. AKS, Service Fabric
  • 2+ years working in a globally distributed team to provide innovative and robust cloud centric solutions
  • 2+ years gathering and analyzing data to diagnose the root cause of cloud workload issues, recommending and implementing solutions to resolve issues in timely manner
Job Responsibility
Job Responsibility
  • Lead or participate in managing all installed systems and infrastructure within the Systems Operations functional area
  • Contribute in increasing system efficiencies and lowering the human intervention time on related tasks
  • Review and analyze moderately complex operational support systems, application software, and system management tools to ensure the highest levels of systems and infrastructure availability
  • Work with vendors and other technical personnel for problem resolution
  • Lead team to meet technical deliverables while leveraging solid understanding of technical process controls or standards
  • Collaborate with vendors and other technical personnel to resolve technical issues and achieve highest levels of systems and infrastructure availability
  • Fulltime
!
Read More
Arrow Right

Senior Site Reliability Engineer (SRE)

The Senior SRE is responsible for deployment, updates, and operational support f...
Location
Location
India , Chennai
Salary
Salary:
Not provided
dalet.com Logo
Dalet
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Cloud platforms: AWS, Azure
  • Containerisation & Orchestration: Kubernetes
  • Infrastructure as Code: Terraform
  • Configuration Management: Ansible
  • Packaging & Deployment: Helm
  • Databases: MariaDB, MongoDB
  • Monitoring, observability, networking, and cloud security.
Job Responsibility
Job Responsibility
  • Act as a senior technical authority for APAC Site Reliability Engineering activities
  • Drive best practices in reliability, operations, and engineering standards
  • Promote technical excellence, collaboration, and accountability across stakeholders
  • Make infrastructure complexity transparent to both internal teams and customers, ensuring a consistently excellent client experience
  • Implement, track, and evolve service performance measures such as SLAs, SLOs, and SLIs
  • Anticipate risks related to service availability, capacity, performance regressions, and security vulnerabilities
  • Drive continuous improvement, including leading and facilitating Root Cause Analysis (RCA) activities
  • Ensure timely execution of deployments, upgrades, maintenance activities, and change requests
  • Anticipate workload, plan deliverables, and ensure qualification/validation of upcoming tasks
  • Collaborate closely with engineering to improve platform components, automation, and operational processes
What we offer
What we offer
  • Great career opportunities around the world
  • Truly collaborative environment with supportive leadership
  • Cutting edge technologies (AI, Cloud, Cybersecurity...)
  • Talented and passionate team members
  • Fun working environment
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Sre

Hybrid: This role is categorized as hybrid and is expected to report to Austin ...
Location
Location
United States , Austin; Warren
Salary
Salary:
Not provided
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science or a related field, or equivalent work experience
  • 7-10 years software experience with strong proficiency in PostgreSQL and at least one other (Oracle, SQL Server) database technologies
  • Proficiency in at least one programming language (e.g., Python, Go, Java) and familiarity with multiple language ecosystems
  • Solid understanding of operating systems, networking, distributed systems, databases, and storage architectures
  • Deep understanding of how code runs on underlying hardware, including operating systems, algorithms, and data structures
  • Ability to optimize or troubleshoot code by understanding its execution and the impact on system resources
  • Experience handling production incidents, including root cause analysis, mitigation, and working through complex system failures
  • Strong communication skills, with an ability to explain technical concepts to both engineering and business stakeholders
  • Commitment to collaborative problem-solving and shared ownership of services
  • Proven experience in automating manual processes, building deployment pipelines, or managing configuration systems
Job Responsibility
Job Responsibility
  • Develop tools and software to automate operational processes, improve system reliability, and reduce manual intervention
  • Lead, Implement and improve monitoring and observability frameworks, enabling proactive detection and resolution of incidents
  • Participate in an on-call rotation to diagnose, troubleshoot, and mitigate production incidents, ensuring minimal downtime and swift resolution
  • Work alongside developers to ensure the quality, scalability, and reliability of our database services
  • Practice shared ownership of services in production, fostering a "You build it, you run it" culture
  • Manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to manage reliability expectations effectively
  • Conduct deep-dive analyses of incidents and collaborate on post-incident reviews to derive learnings and prevent recurrence
  • Champion a culture of continuous improvement
  • Evaluate system performance and advocate for optimizations that reduce infrastructure costs while maintaining service reliability
  • Fulltime
Read More
Arrow Right

Network Automation Observability Engineer

Piper Companies is seeking a Network Automation Observability Engineer for a wor...
Location
Location
United States , Raleigh Durham
Salary
Salary:
140000.00 - 180000.00 USD / Year
pipercompanies.com Logo
Piper Companies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in network engineering, network automation, or related roles
  • Strong hands-on experience with Python for automation and scripting
  • Proven experience using Ansible or similar automation and configuration management tools
  • Deep understanding of core network protocols and enterprise network architectures
  • Experience with network observability platforms and concepts such as telemetry, monitoring, alerting, and logging
  • Familiarity with APIs, data models (YANG), and modern network operating systems is a plus
  • Strong problem-solving skills with the ability to collaborate in a fast-paced environment
Job Responsibility
Job Responsibility
  • Design, develop, and maintain network automation solutions using Python, Ansible, and related frameworks
  • Build automated workflows for network provisioning, configuration management, validation, and remediation
  • Apply strong expertise in network protocols including TCP/IP, BGP, OSPF, routing, switching, and VLANs
  • Implement and enhance network observability solutions using telemetry, SNMP, streaming data, logs, and metrics
  • Integrate network automation and observability tooling with CI/CD pipelines and source control systems
  • Partner with network, systems, and SRE teams to improve network reliability, performance, and scalability
  • Troubleshoot complex network, automation, and observability issues in production environments
What we offer
What we offer
  • medical
  • dental
  • vision
  • 401(k)
  • PTO
  • Sick Leave as required by law
  • Fulltime
Read More
Arrow Right

DevOps Engineer / SRE

As a DevOps Engineer / SRE, you will be a generalist with a broad impact on our ...
Location
Location
Serbia
Salary
Salary:
Not provided
fundraiseup.com Logo
Fundraise Up
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience as a DevOps Engineer, SRE, or Linux Systems Administrator
  • A strong foundation in Linux (we use Ubuntu), including core CLI troubleshooting tools
  • Solid experience with configuration management tools, particularly Ansible
  • Experience working with servers (VMs and/or bare metal), including setup and troubleshooting at the OS level
  • Proficiency in building and maintaining complex CI/CD pipelines (Jenkins experience is a major plus)
  • A good understanding of networking fundamentals, including TCP/IP and firewall configuration (iptables)
  • Experience with monitoring and observability principles (Prometheus/VictoriaMetrics stack preferred)
  • Experience working with Git
  • Scripting ability in Bash or Python
  • A high sense of ownership, responsibility, and attention to detail. We value professionals who are proactive and reliable
Job Responsibility
Job Responsibility
  • Work with servers (VMs and bare metal) at the OS level and below: configuration, maintenance, and troubleshooting
  • Automate infrastructure and routine operational tasks using Ansible and custom scripting (Bash / Python)
  • Build, maintain, and support complex CI/CD pipelines. We use scripted pipelines in Jenkins
  • Develop and support our monitoring and observability stack (Prometheus-style metrics, VictoriaMetrics, Grafana, Graylog)
  • Work with databases and data systems, including ClickHouse and MongoDB, with a focus on monitoring and operational stability
  • Investigate and resolve issues across Linux OS, networking, and application layers
  • Collaborate with engineers across teams to improve system reliability and automation
  • Take ownership of production systems and ensure stability and predictability in day-to-day operations
What we offer
What we offer
  • 31 days off
  • 100% paid telemedicine plan
  • Home Office Setup Assistance: the company offers assistance with purchasing furniture (office chair, office desk, monitor) and other items to create a comfortable workspace
  • English learning courses
  • Relevant professional education
  • Gym or swimming pool
  • Co-working
  • Remote working
  • Stock options
  • Fulltime
Read More
Arrow Right

Lead Software Engineer - SRE

Wells Fargo is seeking a Lead Site Reliability Engineer (SRE) to join the WIMT P...
Location
Location
United States , CHARLOTTE; SAINT LOUIS
Salary
Salary:
119000.00 - 187000.00 USD / Year
https://www.wellsfargo.com/ Logo
Wells Fargo
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
  • 5+ years of experience leading observability and monitoring tooling - Splunk, AppDynamics, Splunk Observability, Grafana, Open Telemetry
  • 5+ years in infrastructure (windows and Linux) support
  • 5+ years proven success in toil reduction initiatives
  • 5+ years in cloud application management especially OpenShift Container Platform
Job Responsibility
Job Responsibility
  • Design and implement scalability, reliability, and observability strategies for cloud and on-premise environments
  • Define SLIs (Service Level Indicators), SLOs (Service Level Objectives), and Error Budgets to improve system reliability
  • Provide vision, direction and expertise to leadership on implementing innovative and significant business solutions
  • Maintain knowledge of industry best practices and new technologies and recommend innovations that enhance operations or provide a competitive advantage to the organization
  • Strategically engage with all levels of professionals and managers across the enterprise and serve as an expert advisor to leadership
  • Review and analyze complex, large-scale technology solutions for tactical and strategic business objectives, enterprise technological environment, and technical challenges that require in-depth evaluation of multiple factors, including intangibles or unprecedented technical factors
  • Drive adoption of NFRs, best practices-quality and compliance across observability and performance engineering
  • Ensure high availability and performance of production systems through proactive monitoring and incident response
  • Collaborate and consult with key technical experts, senior technology team, and external industry groups to resolve complex technical issues and achieve goals
  • Lead projects, teams, or serve as a peer mentor
What we offer
What we offer
  • Health benefits
  • 401(k) Plan
  • Paid time off
  • Disability benefits
  • Life insurance, critical illness insurance, and accident insurance
  • Parental leave
  • Critical caregiving leave
  • Discounts and savings
  • Commuter benefits
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Devops Sre Engineer

We are looking for a mid-senior SRE/DevOps Engineer (5–8 years) to build and sca...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
acuverconsulting.com Logo
Acuver Consulting
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5–8 years of experience in DevOps / SRE roles
  • Strong hands-on experience with AWS (preferred) and/or GCP
  • Expertise in: Kubernetes & Docker
  • Terraform (Infrastructure as Code)
  • CI/CD tools (GitLab, Jenkins, or similar)
  • Experience with: Event-driven / asynchronous architectures (Kafka, Pub/Sub, etc.)
  • Monitoring & logging tools (Prometheus, Grafana, ELK, etc.)
  • Microservices and distributed systems
  • Solid understanding of: Networking, load balancing, scaling strategies
  • High availability and fault-tolerant systems
Job Responsibility
Job Responsibility
  • Design and implement robust CI/CD pipelines (GitLab CI, Jenkins, or similar)
  • Enable automated build, test, and deployment workflows
  • Implement blue-green / canary deployments for zero-downtime releases
  • Ensure release traceability, rollback mechanisms, and deployment governance
  • Design, provision, and manage infrastructure on AWS (primary) and/or GCP
  • Build infrastructure using Infrastructure as Code (Terraform preferred)
  • Create reusable modules for scalable, secure, and standardized environments
  • Optimize cost, performance, and scalability of cloud resources
  • Deploy and manage applications using Docker & Kubernetes
  • Manage Kubernetes workloads using Helm charts
  • Fulltime
Read More
Arrow Right