Python Developer - Site Reliability Engineering (SRE) Job at NTT DATA (Montreal)

Site Reliability Engineering (SRE) Team Lead

We are looking for a highly skilled and experienced Site Reliability Engineering...

Location

United States , Irving

Salary:

Not provided

OneMain Financial

Expiration Date

Until further notice

Requirements

BA/BS in Computer Science, Engineering, related field, or equivalent experience
7+ years of experience in site reliability engineering, systems engineering, or related roles, with at least 2 years in a leadership position
Proven experience leading and scaling high-performing engineering teams
Deep expertise in cloud platforms (AWS, GCP, Azure) and container orchestration (Kubernetes, Docker)
Strong skills in infrastructure as code tools (Terraform, Ansible, CloudFormation) and CI/CD pipelines
Proficiency with monitoring and alerting systems (Prometheus, Grafana, ELK, Datadog)
Solid programming and scripting skills (Python, Go, Bash, or similar)
Strong understanding of distributed systems, networking, security, and databases
Excellent leadership, communication, and collaboration skills
Experience managing incident response and on-call rotations

Job Responsibility

Lead, mentor, and grow a team of site reliability engineers, promoting a culture of reliability, automation, and continuous improvement
Drive the design, implementation, and maintenance of scalable and fault-tolerant infrastructure to support high-availability services
Oversee incident management processes, including triage, root cause analysis, and postmortems to improve system reliability and prevent recurrence
Collaborate cross-functionally with software engineering, product, and operations teams to integrate reliability best practices into the software development lifecycle
Define and implement operational metrics, SLIs/SLOs, and dashboards to monitor system health and drive proactive improvements
Manage and assess the observability of critical environments proactively addressing gaps that may arise
Oversee the release management processes, artifacts and tools that drive a repeatable software delivery lifecycle
Champion automation efforts to reduce manual intervention, improve deployment pipelines, and optimize infrastructure management
Lead capacity planning, disaster recovery, and performance tuning efforts
Ensure security and compliance standards are upheld across infrastructure and operations

What we offer

Health and wellbeing options including medical, prescription, dental, vision, hearing, accident, hospital indemnity, and life insurances
Up to 4% matching 401(k)
Employee Stock Purchase Plan (10% share discount)
Tuition reimbursement
Paid time off (15 days’ vacation per year, plus 2 personal days, prorated based on start date)
Paid sick leave as determined by state or local ordinance, prorated based on start date
Paid holidays (7 days per year, based on start date)
Paid volunteer time (3 days per year, prorated based on start date)
Access to Talkspace and Hinge for on-demand physical therapy via an app
Family back-up care

Fulltime

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...

Location

Canada , Mississauga

Salary:

115000.00 - 128000.00 CAD / Year

PointClickCare

Expiration Date

Until further notice

Requirements

5+ years' experience in software engineering
Experience with SRE principles
Experience with AI/ML in production environments
A passion for automation, intelligent systems, and operational excellence
Strong debugging, problem-solving, and system design skills
Languages: Python, Java, Bash, Terraform
Platforms: Azure, Kubernetes, Docker
Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
CI/CD: Jenkins, ArgoCD, Spinnaker

Job Responsibility

Build ML-based anomaly detection and pattern recognition systems
Enhance telemetry with smart tagging and metadata for better AI insights
Develop event-driven workflows and self-healing systems using AI triggers
Automate incident response with generative AI and custom AI agent orchestration
Use time-series forecasting and predictive modelling to anticipate failures
Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
Build scalable, fault-tolerant systems in a cloud-native environment
Participate in on-call rotations and lead incident response for critical systems
Skilled in API integration for streamlined data exchange and system connectivity
Run internal AIOps workshops and help teams adopt AI maturity models

What we offer

Benefits starting from Day 1!
Retirement Plan Matching
Flexible Paid Time Off
Wellness Support Programs and Resources
Parental & Caregiver Leaves
Fertility & Adoption Support
Continuous Development Support Program
Employee Assistance Program
Allyship and Inclusion Communities
Employee Recognition … and more!

Fulltime

Site Reliability Engineering Lead

Engineer the future of global finance. At Citi, our Tech team doesn’t just suppo...

Location

Canada , Mississauga

Salary:

120800.00 - 170800.00 USD / Year

Citi

Expiration Date

Until further notice

Requirements

6–10 years of relevant experience in a hands‑on technical role
Experience contributing to architecture discussions and ensuring solutions align with enterprise standards and long‑term maintainability
Experience working with senior stakeholders or technology partners
Demonstrated experience supporting IT service improvements or platform stability initiatives
Strong communication and presentation skills, with the ability to convey technical concepts clearly
Experience supporting or contributing to technical roadmaps or operational workstreams
Experience participating in resilience‑related activities such as incident simulations, disaster recovery exercises, or stability testing
Ability to collaborate with cross‑functional support teams and technology groups
Strong organizational and workload‑planning skills
Consistently demonstrates clear and concise written and verbal communication skills

Job Responsibility

Demonstrates a strong understanding of how application support contributes to the overall technology function and organizational objectives
Assist with vendor relationship management, including coordination with offshore managed services
Support efforts to improve service levels for end users by enhancing operational efficiencies and strengthening incident management, problem management, and knowledge‑sharing practices
Partner with development teams to guide improvements in application stability and supportability
Contribute to frameworks for managing capacity, throughput, and latency
Assist in defining and implementing application onboarding guidelines and standards
Support team members by fostering a collaborative environment and encouraging skill development
Participate in cost‑reduction efforts through Root Cause Analysis reviews, knowledge management, performance tuning, and user training
Participate in business review meetings to help align technology tools and strategies with business requirements
Ensure adherence to support processes and tool standards, and assist in enhancing processes to promote consistency and quality across the support program

Fulltime

Site Reliability Engineering Analyst - Assistant Vice President

The Engineer Sr Analyst is an intermediate level position responsible for a vari...

Location

India , Pune

Salary:

Not provided

Citi

Expiration Date

Until further notice

Requirements

5-8 years of relevant experience in an Engineering role
Experience working in Financial Services or a large complex and/or global environment
Project Management experience
Consistently demonstrates clear and concise written and verbal communication
Comprehensive knowledge of design metrics, analytics tools, benchmarking activities and related reporting to identify best practices
Demonstrated analytic/diagnostic skills
Ability to work in a matrix environment and partner with virtual teams
Ability to work independently, prioritize, and take ownership of various parts of a project or initiative
Ability to work under pressure and manage to tight deadlines or unexpected changes in expectations or requirements
Proven track record of operational process change and improvement

Job Responsibility

Contribute to the budgetary requirement definition for assigned product area, develop functional specifications, and create project plans and software release schedules
Partner with business and development teams to identify engineering requirements and assist in defining application and system requirements and processes and maintain engineering relationships with the end user/client
Ensure requirements/tasks from technology departments and/or end users are communicated to stakeholders
Provide solutions and processes in accordance with audit initiatives and requirements and consult with Business Information Security officers (BISOs) and TISOs
Exhibit in-depth understanding of engineering concepts and principles
Assist with training activities and mentor junior team members
Appropriately assess risk when business decisions are made, demonstrating particular consideration for the firm's reputation and safeguarding Citigroup, its clients and assets, by driving compliance with applicable laws, rules and regulations, adhering to Policy, applying sound ethical judgment regarding personal behavior, conduct and business practices, and escalating, managing and reporting control issues with transparency
Automate Core Processes: Design, develop, and implement automation solutions to replace manual activities, repetitive processes, to support migrations to new infrastructure
Continuous Improvement: Proactively identify opportunities for process improvements and efficiency gains across the service lifecycle
Support AI Integration: Collaborate with development and data science teams to support the seamless integration of services with AI solutions

Fulltime

Director, Site Reliability Engineering

The Director of Site Reliability Engineering (SRE) will provide strategic leader...

Location

United States , Mountain View

Salary:

315000.00 - 385000.00 USD / Year

EarnIn

Expiration Date

Until further notice

Requirements

BS, MS, or PhD degree in Computer Science, Engineering, or related field, or related experience
7+ years of experience in the field, including 3+ years leading SRE teams or a team in a similar role
Strong experience with container orchestration (Kubernetes), infrastructure as code (Terraform), and CI/CD pipelines
Hands-on experience with observability platforms (e.g., Datadog, Prometheus, Grafana) and incident management tools (e.g., incident.io, PagerDuty)
Proficiency in at least one programming language (Python, Go, or Java) with the ability to review code and guide system design decisions
Proven experience in architecting and managing highly available, scalable, and fault-tolerant systems
Ability to define a clear reliability vision and inspire teams and stakeholders toward long‑term reliability goals
Demonstrated sound judgment and calm decision‑making under pressure, particularly during high‑severity incidents
Strong people leadership skills, with experience coaching and mentoring engineering talent, developing future leaders, and aligning peer engineering managers and leaders on reliability best practices
Strategic planning skills with a track record of aligning technical direction with organizational objectives

Job Responsibility

Drive organizational transformation toward SRE principles and own the strategic direction for reliability maturity, cultivating a culture centered on reliability, efficiency, and continuous improvement
Develop and oversee automation strategies, tools, and frameworks that improve system reliability, reduce operational toil, and enhance team productivity
Architect and evolve robust observability, monitoring, and alerting systems
champion chaos engineering and resilience testing practices to proactively validate system behavior under failure conditions
Partner with engineering, product, and operations teams to embed SRE practices throughout the development lifecycle and influence architectural decisions for reliability
Build, mentor, and develop a high‑performing global SRE organization, fostering technical excellence, career growth, and a strong culture of knowledge sharing
Oversee capacity planning, scalability assessments, and future‑state demand forecasting across critical systems
Lead and govern high‑severity incident response practices—ensuring rapid triage, thorough root cause analysis, and follow‑through on corrective and preventative actions

What we offer

equity and benefits

Fulltime

AI Platform Site Reliability Engineering Specialist

The AI Platform Site Reliability Engineering Specialist will operate and maintai...

Location

India , Bengaluru

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

Bachelor's or Master's degree in Computer Science or related field, or equivalent job experience
5 years of production experience in SRE / Infrastructure / ops for large-scale systems
Strong programming/scripting skills (Python, Go, Java, or equivalent)
Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
Networking and systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
Solid experience in capacity planning, performance tuning, scaling, and incident response
Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements

Job Responsibility

Operate, monitor, and maintain the infrastructure supporting GenAI applications ( training, inference, feature store, data ingestion, model serving)
Design and build automation for core platform capabilities, reducing manual toil
Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.
Establish, monitor and enforce SLOs/SLIs/LSAs, error budgets, alerting, and dashboards
Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation
Perform capacity planning, scaling strategies, workload scheduling and resource forecasting
Optimize cost vs. performance trade-offs in large-scale compute environments
Harden systems for security, compliance, auditability, and data governance
Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems
Define disaster recover (DR) strategies, back/restore practices, fault tolerance mechanisms

Site Reliability Engineering Manager

Hewlett Packard Enterprise (HPE) is looking for a Site Reliability Engineering M...

Location

India , Bangalore

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

7–10 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
Minimum 2 years of experience managing or leading cloud operations teams
Deep understanding of cloud platforms (AWS, GCP, or Azure) and cloud-native architectures
Hands-on experience with Kubernetes, containers, infrastructure as code (e.g., Terraform), and configuration management tools
Strong foundation in observability (monitoring, logging, tracing), automation using Python, and incident response
Familiarity with modern CI/CD automation and tools
Excellent communication, stakeholder management, and team-building skills
Experience scaling SRE practices in high-growth or large-scale environments
Ability to balance long-term reliability initiatives with short-term delivery needs.

Job Responsibility

Lead and mentor a team of Site Reliability Engineers, supporting their growth, performance, and well-being
Own the reliability strategy for SASE cloud infrastructure systems, including incident management, SLIs/SLOs, and capacity planning
Partner with Engineering, Product, and Security teams to design and deliver highly available, scalable, and resilient cloud-native services
Guide the team in building automation, improving observability, and improve operational efficiency of our cloud infrastructure
Drive adoption of best practices in monitoring, alerting, on-call operations, and runbook development
Build and maintain a strong engineering culture based on ownership, collaboration, and continuous learning
Define and track key reliability metrics, and report on team performance and system health to leadership
Contribute to hiring, onboarding, and career development for SREs.

What we offer

Health & Wellbeing benefits for physical, financial, and emotional wellbeing
Personal & Professional Development programs
Unconditional inclusion in the workplace.

Fulltime

Site Reliability Engineer (SRE) - Identity Access Management IAM

Join us as a Site Reliability Engineer (SRE) - Identity Access Management. You w...

Location

India , Pune

Salary:

Not provided

Barclays

Expiration Date

Until further notice

Requirements

Experience in designing, implementing, deploying, and running highly available, fault-tolerant, auto-scaling and auto-healing systems
Strong expertise in AWS (essential), (Azure, and GCP (Google cloud platform) is a plus), including Kubernetes (ECS is essential, Fargate and GCE is a plus) and server-less architectures
Strong experience in running disaster recovery, zero downtime solutions and in designing and implementing continuous delivery across large-scale, distributed, cloud-based micro service and API service solutions with 99.9%+ uptime
Hands-on experience coding in Python, Bash and JSON/Yaml (Configuration as Code)
The ability to drive reliability best practices across engineering teams, embed SRE principles into the DevSecOps lifecycle and partner with engineering, security and product teams, to balance reliability and feature velocity
Experience in hands-on configuration, deployment and operation of ForgeRock COTS based IAM (Identity Access management) solutions (PingGateway, PingAM, PingIDM, PingDS) with embedded security gates, HTTP header signing, access token and data at rest encryption, PKI based self-sovereign identity, or open source

Job Responsibility

Applying software engineering techniques, automation, and best practices in incident response, to ensure the reliability, availability, and scalability of the systems, platforms, and technology through them
Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning
Resolution, analysis and response to system outages and disruptions, and implement measures to prevent similar incidents from recurring
Development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience
Monitoring and optimisation of system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning
Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and work closely with other teams to ensure smooth and efficient operations
Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities to foster a culture of technical excellence and growth

What we offer

Competitive holiday allowance
Life assurance
Private medical care
Pension contribution

Fulltime

Select Country

Python Developer - Site Reliability Engineering (SRE)

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?