CrawlJobs Logo

Python Developer - Site Reliability Engineering (SRE)

Canada, Montreal · Job Posted March 24, 2026
Apply Position
Job Link Share

Job Description

We are seeking a skilled Python Developer with experience in the Site Reliability Engineering (SRE) domain to build automation tools, improve system reliability, and support scalable infrastructure.

Job Responsibility

  • Develop quality software working with public cloud service provider (CSP) infrastructure across different Public Cloud areas
  • Develop, enhance, and integrate automation workflows for Public Cloud Service Providers (CSP), initially focused on Azure, and integrate with in-house tooling
  • Integrate automation workflows into CI/CD pipelines using GitHub Actions and Jenkins
  • Build proof-of-concept solutions in new areas of cloud and automation development
  • Provide technical support and debugging for application failures in both on-premises and cloud environments
  • Participate in all phases of the Software Development Life Cycle (SDLC), including analysis, design, coding, testing, and deployment
  • Evaluate, onboard, and implement emerging DevOps and automation tools to improve efficiency
  • Build and integrate observability into cloud platforms and solutions using open-source tools (Prometheus, Grafana, OpenTelemetry)
  • Identify, highlight, and reduce operational toil through automation, architectural improvements, and process optimization
  • Collaborate with global teams to understand requirements, develop high‑quality code, and deliver cloud-focused projects

Requirements

  • 3+ years of experience with Python development
  • 6 years of experience working with Infrastructure as Code (Terraform and Ansible)
  • Experience with CI/CD pipelines, preferably GitHub Actions and Jenkins
  • Strong understanding of object-oriented design and development principles
  • Proficiency in Linux/Unix environments
  • Experience working with database technologies (preferably NoSQL), including data modeling, testing, and performance tuning
  • Ability to write reusable, optimized, maintainable, and well‑documented code following industry best practices
  • Experience implementing open-source monitoring and observability tools such as Prometheus, Grafana, Splunk or Open Telemetry
  • Strong problem‑solving skills and ability to take ownership of tasks and drive them independently to closure
  • Understanding of networking concepts (TCP/IP, DNS, Load Balancing)

Nice to have

  • Experience building cloud automation specifically for Azure
  • Experience evaluating new DevOps tools or contributing to internal automation frameworks
  • Exposure to multi-cloud environments or additional CSPs (AWS, GCP)
  • Familiarity with containerization or orchestration (Docker, Kubernetes)
  • Experience with high-scale systems or fintech platforms
  • Exposure to security and compliance practices
  • Knowledge of performance optimization and capacity planning

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Python Developer - Site Reliability Engineering (SRE)

8 matching positions

Site Reliability Engineering (SRE) Team Lead

We are looking for a highly skilled and experienced Site Reliability Engineering...
Location
Location
United States , Irving
Salary
Salary:
Not provided
onemainfinancial.com Logo
OneMain Financial
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BA/BS in Computer Science, Engineering, related field, or equivalent experience
  • 7+ years of experience in site reliability engineering, systems engineering, or related roles, with at least 2 years in a leadership position
  • Proven experience leading and scaling high-performing engineering teams
  • Deep expertise in cloud platforms (AWS, GCP, Azure) and container orchestration (Kubernetes, Docker)
  • Strong skills in infrastructure as code tools (Terraform, Ansible, CloudFormation) and CI/CD pipelines
  • Proficiency with monitoring and alerting systems (Prometheus, Grafana, ELK, Datadog)
  • Solid programming and scripting skills (Python, Go, Bash, or similar)
  • Strong understanding of distributed systems, networking, security, and databases
  • Excellent leadership, communication, and collaboration skills
  • Experience managing incident response and on-call rotations
Job Responsibility
Job Responsibility
  • Lead, mentor, and grow a team of site reliability engineers, promoting a culture of reliability, automation, and continuous improvement
  • Drive the design, implementation, and maintenance of scalable and fault-tolerant infrastructure to support high-availability services
  • Oversee incident management processes, including triage, root cause analysis, and postmortems to improve system reliability and prevent recurrence
  • Collaborate cross-functionally with software engineering, product, and operations teams to integrate reliability best practices into the software development lifecycle
  • Define and implement operational metrics, SLIs/SLOs, and dashboards to monitor system health and drive proactive improvements
  • Manage and assess the observability of critical environments proactively addressing gaps that may arise
  • Oversee the release management processes, artifacts and tools that drive a repeatable software delivery lifecycle
  • Champion automation efforts to reduce manual intervention, improve deployment pipelines, and optimize infrastructure management
  • Lead capacity planning, disaster recovery, and performance tuning efforts
  • Ensure security and compliance standards are upheld across infrastructure and operations
What we offer
What we offer
  • Health and wellbeing options including medical, prescription, dental, vision, hearing, accident, hospital indemnity, and life insurances
  • Up to 4% matching 401(k)
  • Employee Stock Purchase Plan (10% share discount)
  • Tuition reimbursement
  • Paid time off (15 days’ vacation per year, plus 2 personal days, prorated based on start date)
  • Paid sick leave as determined by state or local ordinance, prorated based on start date
  • Paid holidays (7 days per year, based on start date)
  • Paid volunteer time (3 days per year, prorated based on start date)
  • Access to Talkspace and Hinge for on-demand physical therapy via an app
  • Family back-up care
  • Fulltime
Read More
Arrow Right

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...
Location
Location
Canada , Mississauga
Salary
Salary:
115000.00 - 128000.00 CAD / Year
pointclickcare.com Logo
PointClickCare
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years' experience in software engineering
  • Experience with SRE principles
  • Experience with AI/ML in production environments
  • A passion for automation, intelligent systems, and operational excellence
  • Strong debugging, problem-solving, and system design skills
  • Languages: Python, Java, Bash, Terraform
  • Platforms: Azure, Kubernetes, Docker
  • Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
  • ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
  • CI/CD: Jenkins, ArgoCD, Spinnaker
Job Responsibility
Job Responsibility
  • Build ML-based anomaly detection and pattern recognition systems
  • Enhance telemetry with smart tagging and metadata for better AI insights
  • Develop event-driven workflows and self-healing systems using AI triggers
  • Automate incident response with generative AI and custom AI agent orchestration
  • Use time-series forecasting and predictive modelling to anticipate failures
  • Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
  • Build scalable, fault-tolerant systems in a cloud-native environment
  • Participate in on-call rotations and lead incident response for critical systems
  • Skilled in API integration for streamlined data exchange and system connectivity
  • Run internal AIOps workshops and help teams adopt AI maturity models
What we offer
What we offer
  • Benefits starting from Day 1!
  • Retirement Plan Matching
  • Flexible Paid Time Off
  • Wellness Support Programs and Resources
  • Parental & Caregiver Leaves
  • Fertility & Adoption Support
  • Continuous Development Support Program
  • Employee Assistance Program
  • Allyship and Inclusion Communities
  • Employee Recognition … and more!
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Lead

Engineer the future of global finance. At Citi, our Tech team doesn’t just suppo...
Location
Location
Canada , Mississauga
Salary
Salary:
120800.00 - 170800.00 USD / Year
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6–10 years of relevant experience in a hands‑on technical role
  • Experience contributing to architecture discussions and ensuring solutions align with enterprise standards and long‑term maintainability
  • Experience working with senior stakeholders or technology partners
  • Demonstrated experience supporting IT service improvements or platform stability initiatives
  • Strong communication and presentation skills, with the ability to convey technical concepts clearly
  • Experience supporting or contributing to technical roadmaps or operational workstreams
  • Experience participating in resilience‑related activities such as incident simulations, disaster recovery exercises, or stability testing
  • Ability to collaborate with cross‑functional support teams and technology groups
  • Strong organizational and workload‑planning skills
  • Consistently demonstrates clear and concise written and verbal communication skills
Job Responsibility
Job Responsibility
  • Demonstrates a strong understanding of how application support contributes to the overall technology function and organizational objectives
  • Assist with vendor relationship management, including coordination with offshore managed services
  • Support efforts to improve service levels for end users by enhancing operational efficiencies and strengthening incident management, problem management, and knowledge‑sharing practices
  • Partner with development teams to guide improvements in application stability and supportability
  • Contribute to frameworks for managing capacity, throughput, and latency
  • Assist in defining and implementing application onboarding guidelines and standards
  • Support team members by fostering a collaborative environment and encouraging skill development
  • Participate in cost‑reduction efforts through Root Cause Analysis reviews, knowledge management, performance tuning, and user training
  • Participate in business review meetings to help align technology tools and strategies with business requirements
  • Ensure adherence to support processes and tool standards, and assist in enhancing processes to promote consistency and quality across the support program
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Analyst - Assistant Vice President

The Engineer Sr Analyst is an intermediate level position responsible for a vari...
Location
Location
India , Pune
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5-8 years of relevant experience in an Engineering role
  • Experience working in Financial Services or a large complex and/or global environment
  • Project Management experience
  • Consistently demonstrates clear and concise written and verbal communication
  • Comprehensive knowledge of design metrics, analytics tools, benchmarking activities and related reporting to identify best practices
  • Demonstrated analytic/diagnostic skills
  • Ability to work in a matrix environment and partner with virtual teams
  • Ability to work independently, prioritize, and take ownership of various parts of a project or initiative
  • Ability to work under pressure and manage to tight deadlines or unexpected changes in expectations or requirements
  • Proven track record of operational process change and improvement
Job Responsibility
Job Responsibility
  • Contribute to the budgetary requirement definition for assigned product area, develop functional specifications, and create project plans and software release schedules
  • Partner with business and development teams to identify engineering requirements and assist in defining application and system requirements and processes and maintain engineering relationships with the end user/client
  • Ensure requirements/tasks from technology departments and/or end users are communicated to stakeholders
  • Provide solutions and processes in accordance with audit initiatives and requirements and consult with Business Information Security officers (BISOs) and TISOs
  • Exhibit in-depth understanding of engineering concepts and principles
  • Assist with training activities and mentor junior team members
  • Appropriately assess risk when business decisions are made, demonstrating particular consideration for the firm's reputation and safeguarding Citigroup, its clients and assets, by driving compliance with applicable laws, rules and regulations, adhering to Policy, applying sound ethical judgment regarding personal behavior, conduct and business practices, and escalating, managing and reporting control issues with transparency
  • Automate Core Processes: Design, develop, and implement automation solutions to replace manual activities, repetitive processes, to support migrations to new infrastructure
  • Continuous Improvement: Proactively identify opportunities for process improvements and efficiency gains across the service lifecycle
  • Support AI Integration: Collaborate with development and data science teams to support the seamless integration of services with AI solutions
  • Fulltime
Read More
Arrow Right

Director, Site Reliability Engineering

The Director of Site Reliability Engineering (SRE) will provide strategic leader...
Location
Location
United States , Mountain View
Salary
Salary:
315000.00 - 385000.00 USD / Year
earnin.com Logo
EarnIn
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS, MS, or PhD degree in Computer Science, Engineering, or related field, or related experience
  • 7+ years of experience in the field, including 3+ years leading SRE teams or a team in a similar role
  • Strong experience with container orchestration (Kubernetes), infrastructure as code (Terraform), and CI/CD pipelines
  • Hands-on experience with observability platforms (e.g., Datadog, Prometheus, Grafana) and incident management tools (e.g., incident.io, PagerDuty)
  • Proficiency in at least one programming language (Python, Go, or Java) with the ability to review code and guide system design decisions
  • Proven experience in architecting and managing highly available, scalable, and fault-tolerant systems
  • Ability to define a clear reliability vision and inspire teams and stakeholders toward long‑term reliability goals
  • Demonstrated sound judgment and calm decision‑making under pressure, particularly during high‑severity incidents
  • Strong people leadership skills, with experience coaching and mentoring engineering talent, developing future leaders, and aligning peer engineering managers and leaders on reliability best practices
  • Strategic planning skills with a track record of aligning technical direction with organizational objectives
Job Responsibility
Job Responsibility
  • Drive organizational transformation toward SRE principles and own the strategic direction for reliability maturity, cultivating a culture centered on reliability, efficiency, and continuous improvement
  • Develop and oversee automation strategies, tools, and frameworks that improve system reliability, reduce operational toil, and enhance team productivity
  • Architect and evolve robust observability, monitoring, and alerting systems
  • champion chaos engineering and resilience testing practices to proactively validate system behavior under failure conditions
  • Partner with engineering, product, and operations teams to embed SRE practices throughout the development lifecycle and influence architectural decisions for reliability
  • Build, mentor, and develop a high‑performing global SRE organization, fostering technical excellence, career growth, and a strong culture of knowledge sharing
  • Oversee capacity planning, scalability assessments, and future‑state demand forecasting across critical systems
  • Lead and govern high‑severity incident response practices—ensuring rapid triage, thorough root cause analysis, and follow‑through on corrective and preventative actions
What we offer
What we offer
  • equity and benefits
  • Fulltime
Read More
Arrow Right

AI Platform Site Reliability Engineering Specialist

The AI Platform Site Reliability Engineering Specialist will operate and maintai...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science or related field, or equivalent job experience
  • 5 years of production experience in SRE / Infrastructure / ops for large-scale systems
  • Strong programming/scripting skills (Python, Go, Java, or equivalent)
  • Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
  • Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
  • Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
  • Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
  • Networking and systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
  • Solid experience in capacity planning, performance tuning, scaling, and incident response
  • Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
Job Responsibility
Job Responsibility
  • Operate, monitor, and maintain the infrastructure supporting GenAI applications ( training, inference, feature store, data ingestion, model serving)
  • Design and build automation for core platform capabilities, reducing manual toil
  • Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
  • Establish, monitor and enforce SLOs/SLIs/LSAs, error budgets, alerting, and dashboards
  • Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
  • Perform capacity planning, scaling strategies, workload scheduling and resource forecasting
  • Optimize cost vs. performance trade-offs in large-scale compute environments
  • Harden systems for security, compliance, auditability, and data governance
  • Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems
  • Define disaster recover (DR) strategies, back/restore practices, fault tolerance mechanisms
Read More
Arrow Right

Site Reliability Engineering Manager

Hewlett Packard Enterprise (HPE) is looking for a Site Reliability Engineering M...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7–10 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
  • Minimum 2 years of experience managing or leading cloud operations teams
  • Deep understanding of cloud platforms (AWS, GCP, or Azure) and cloud-native architectures
  • Hands-on experience with Kubernetes, containers, infrastructure as code (e.g., Terraform), and configuration management tools
  • Strong foundation in observability (monitoring, logging, tracing), automation using Python, and incident response
  • Familiarity with modern CI/CD automation and tools
  • Excellent communication, stakeholder management, and team-building skills
  • Experience scaling SRE practices in high-growth or large-scale environments
  • Ability to balance long-term reliability initiatives with short-term delivery needs.
Job Responsibility
Job Responsibility
  • Lead and mentor a team of Site Reliability Engineers, supporting their growth, performance, and well-being
  • Own the reliability strategy for SASE cloud infrastructure systems, including incident management, SLIs/SLOs, and capacity planning
  • Partner with Engineering, Product, and Security teams to design and deliver highly available, scalable, and resilient cloud-native services
  • Guide the team in building automation, improving observability, and improve operational efficiency of our cloud infrastructure
  • Drive adoption of best practices in monitoring, alerting, on-call operations, and runbook development
  • Build and maintain a strong engineering culture based on ownership, collaboration, and continuous learning
  • Define and track key reliability metrics, and report on team performance and system health to leadership
  • Contribute to hiring, onboarding, and career development for SREs.
What we offer
What we offer
  • Health & Wellbeing benefits for physical, financial, and emotional wellbeing
  • Personal & Professional Development programs
  • Unconditional inclusion in the workplace.
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer (SRE) - Identity Access Management IAM

Join us as a Site Reliability Engineer (SRE) - Identity Access Management. You w...
Location
Location
India , Pune
Salary
Salary:
Not provided
barclays.co.uk Logo
Barclays
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience in designing, implementing, deploying, and running highly available, fault-tolerant, auto-scaling and auto-healing systems
  • Strong expertise in AWS (essential), (Azure, and GCP (Google cloud platform) is a plus), including Kubernetes (ECS is essential, Fargate and GCE is a plus) and server-less architectures
  • Strong experience in running disaster recovery, zero downtime solutions and in designing and implementing continuous delivery across large-scale, distributed, cloud-based micro service and API service solutions with 99.9%+ uptime
  • Hands-on experience coding in Python, Bash and JSON/Yaml (Configuration as Code)
  • The ability to drive reliability best practices across engineering teams, embed SRE principles into the DevSecOps lifecycle and partner with engineering, security and product teams, to balance reliability and feature velocity
  • Experience in hands-on configuration, deployment and operation of ForgeRock COTS based IAM (Identity Access management) solutions (PingGateway, PingAM, PingIDM, PingDS) with embedded security gates, HTTP header signing, access token and data at rest encryption, PKI based self-sovereign identity, or open source
Job Responsibility
Job Responsibility
  • Applying software engineering techniques, automation, and best practices in incident response, to ensure the reliability, availability, and scalability of the systems, platforms, and technology through them
  • Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning
  • Resolution, analysis and response to system outages and disruptions, and implement measures to prevent similar incidents from recurring
  • Development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience
  • Monitoring and optimisation of system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning
  • Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and work closely with other teams to ensure smooth and efficient operations
  • Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities to foster a culture of technical excellence and growth
What we offer
What we offer
  • Competitive holiday allowance
  • Life assurance
  • Private medical care
  • Pension contribution
  • Fulltime
Read More
Arrow Right