CrawlJobs Logo

Python Developer - Site Reliability Engineering (SRE)

nttdata.com Logo

NTT DATA

Location Icon

Location:
Canada , Montreal

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

We are seeking a skilled Python Developer with experience in the Site Reliability Engineering (SRE) domain to build automation tools, improve system reliability, and support scalable infrastructure.

Job Responsibility:

  • Develop quality software working with public cloud service provider (CSP) infrastructure across different Public Cloud areas
  • Develop, enhance, and integrate automation workflows for Public Cloud Service Providers (CSP), initially focused on Azure, and integrate with in-house tooling
  • Integrate automation workflows into CI/CD pipelines using GitHub Actions and Jenkins
  • Build proof-of-concept solutions in new areas of cloud and automation development
  • Provide technical support and debugging for application failures in both on-premises and cloud environments
  • Participate in all phases of the Software Development Life Cycle (SDLC), including analysis, design, coding, testing, and deployment
  • Evaluate, onboard, and implement emerging DevOps and automation tools to improve efficiency
  • Build and integrate observability into cloud platforms and solutions using open-source tools (Prometheus, Grafana, OpenTelemetry)
  • Identify, highlight, and reduce operational toil through automation, architectural improvements, and process optimization
  • Collaborate with global teams to understand requirements, develop high‑quality code, and deliver cloud-focused projects

Requirements:

  • 3+ years of experience with Python development
  • 6 years of experience working with Infrastructure as Code (Terraform and Ansible)
  • Experience with CI/CD pipelines, preferably GitHub Actions and Jenkins
  • Strong understanding of object-oriented design and development principles
  • Proficiency in Linux/Unix environments
  • Experience working with database technologies (preferably NoSQL), including data modeling, testing, and performance tuning
  • Ability to write reusable, optimized, maintainable, and well‑documented code following industry best practices
  • Experience implementing open-source monitoring and observability tools such as Prometheus, Grafana, Splunk or Open Telemetry
  • Strong problem‑solving skills and ability to take ownership of tasks and drive them independently to closure
  • Understanding of networking concepts (TCP/IP, DNS, Load Balancing)

Nice to have:

  • Experience building cloud automation specifically for Azure
  • Experience evaluating new DevOps tools or contributing to internal automation frameworks
  • Exposure to multi-cloud environments or additional CSPs (AWS, GCP)
  • Familiarity with containerization or orchestration (Docker, Kubernetes)
  • Experience with high-scale systems or fintech platforms
  • Exposure to security and compliance practices
  • Knowledge of performance optimization and capacity planning

Additional Information:

Job Posted:
March 24, 2026

Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Python Developer - Site Reliability Engineering (SRE)

Site Reliability Engineering (SRE)

Fyld is a Portuguese consulting company specializing in IT services. We bring hi...
Location
Location
Portugal , Lisboa; Porto
Salary
Salary:
Not provided
https://www.fyld.pt Logo
Fyld
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Degree in Computer Science, Information Technology, Engineering, or a related
  • Previous experience working as an SRE or in a similar role within DevOps, system administration, or software engineering
  • Familiarity with industry-specific applications and regulatory requirements (e.g., HIPAA, GDPR)
  • Proficiency in system administration for Linux/Unix and Windows systems
  • Strong understanding of networking concepts, including TCP/IP, DNS, load balancing, and firewalls
  • Proficiency in programming languages such as Python, Go, Java, or C++
  • Strong skills in scripting languages like Bash, Perl, or Ruby
  • Experience with automation tools such as Ansible, Puppet, Chef, or Terraform
  • Knowledge of Infrastructure as Code (IaC) principles and practices
  • Experience with monitoring and logging tools such as Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana), or Splunk
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

As a Site Reliability Engineer (SRE), you will actively work to improve the perf...
Location
Location
United States
Salary
Salary:
116700.00 - 187400.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong coding/scripting experience
  • Serious troubleshooting skills across different levels of the stack
  • Maintaining a high standard of code quality
  • Understanding of Linux systems
  • Experience configuring and managing enterprise monitoring/metrics/logging solutions
  • Building, automating, and maintaining infrastructure in Amazon Web Services with infrastructure as code
  • Maintaining and troubleshooting continuous integration / continuous delivery pipelines in support of development teams
Job Responsibility
Job Responsibility
  • Improve the performance and reliability of Atlassian Analytics and our Analytics Visualization Platform
  • Expand our system to handle new system capabilities
  • Scale to support growing usage by customers and adoption in new Atlassian products
  • Address root causes of incidents and reduce incident rates
  • Serve in an on-call weekly rotation to make sure our products meet established SLO targets
What we offer
What we offer
  • Health coverage
  • Paid volunteer days
  • Wellness resources
  • Fulltime
Read More
Arrow Right

Staff Site Reliability Engineer

At Ledger, we are looking for an experienced Reliability Engineer to join our SR...
Location
Location
France , Paris
Salary
Salary:
Not provided
https://www.ledger.com Logo
Ledger
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years on cloud engineering at scale, on organizations operating SaaS solutions
  • Proficiency in working in Unix/Linux environments, Git, Python, Terraform, Kubernetes, AWS cloud solutions and architectures, CI/CD tools, Argocd, Ansible, configuration management, etc.
  • Strong knowledge on observability practices, with experience implementing and managing Logging, Monitoring and Alerting framework with solutions such as Datadog or Prometheus/Grafana/Loki.
  • Experience of cross-functional work and the ability to demonstrate a collaborative approach with regards to building key relationships across the organization and define projects scope, goals, plan and deliverables
  • Customer focused with the ability to identify and understand both internal and external customer's needs
  • Creative problem-solving and analysis skills with an ability to identify, develop, and implement solutions to meet the needs of the business
  • Excellent presentation and written communication
  • Ability to deal with ambiguity, high level of pressure and rapidly changing environments
  • Engineering degree.
Job Responsibility
Job Responsibility
  • Participate in building a DevOps / SRE culture and enable the transition to modern infrastructure management and deployment practices
  • Participate in building the SRE team roadmap (vision and delivery accountability). Anticipate stakeholder needs, game-changing technologies emergence and challenge scope / deadlines
  • Perform integration of platform software components
  • Participate to design and deliver solutions to improve the availability, scalability, latency, and efficiency of systems
  • Influence and create standards & best practices in support of service level objectives
  • Automate key SRE metrics including SLOs/SLAs and error budgets
  • Provide expert support to our level-2/application support team, to troubleshoot priority incidents, and conduct post-mortems
  • Apply analytics on past incidents and usage patterns to predict issues and take proactive actions
  • Ensure control of technical debt and promote quality practices
  • Follow SRE and chaos engineering approaches across all strategic systems to predict in coordination with Service Design and prevent outages and improve solution availability
What we offer
What we offer
  • Equity: Employees are the foundation of our success, and we award stock options so you can share in that success as we grow
  • Flexibility: A hybrid work policy
  • Social: Annual company outing for Ledgerdary Days, plus frequent social events, snacks and drinks
  • Medical: Comprehensive health insurance policy offering extensive medical, dental and vision care coverage
  • Well-being: Personal development, coaching & fitness with our dedicated partners
  • Vacation: Five weeks of paid leave per year, in addition to national holidays and rest & relaxation (RTT) days
  • High tech: Access to high performance office equipment and gadgets, including Apple products
  • Transport: Ledger reimburses part of your preferred means of transportation
  • Discounts: Employee discount on all our products.
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Manager

Hewlett Packard Enterprise (HPE) is looking for a Site Reliability Engineering M...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7–10 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
  • Minimum 2 years of experience managing or leading cloud operations teams
  • Deep understanding of cloud platforms (AWS, GCP, or Azure) and cloud-native architectures
  • Hands-on experience with Kubernetes, containers, infrastructure as code (e.g., Terraform), and configuration management tools
  • Strong foundation in observability (monitoring, logging, tracing), automation using Python, and incident response
  • Familiarity with modern CI/CD automation and tools
  • Excellent communication, stakeholder management, and team-building skills
  • Experience scaling SRE practices in high-growth or large-scale environments
  • Ability to balance long-term reliability initiatives with short-term delivery needs.
Job Responsibility
Job Responsibility
  • Lead and mentor a team of Site Reliability Engineers, supporting their growth, performance, and well-being
  • Own the reliability strategy for SASE cloud infrastructure systems, including incident management, SLIs/SLOs, and capacity planning
  • Partner with Engineering, Product, and Security teams to design and deliver highly available, scalable, and resilient cloud-native services
  • Guide the team in building automation, improving observability, and improve operational efficiency of our cloud infrastructure
  • Drive adoption of best practices in monitoring, alerting, on-call operations, and runbook development
  • Build and maintain a strong engineering culture based on ownership, collaboration, and continuous learning
  • Define and track key reliability metrics, and report on team performance and system health to leadership
  • Contribute to hiring, onboarding, and career development for SREs.
What we offer
What we offer
  • Health & Wellbeing benefits for physical, financial, and emotional wellbeing
  • Personal & Professional Development programs
  • Unconditional inclusion in the workplace.
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

This role involves enabling SRE support and monitoring for HPE Networking SASE p...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master’s degree in Computer Science, Information Systems, or equivalent
  • 4-7 years of overall experience in DevOps or SRE
  • 5+ years programming experience in Python is a must
  • 5+ years of experience in developing Cloud native applications using Kubernetes, Helm, or Docker container environments is a must
  • Expertise in automation and CI-CD pipeline tools like Terraform, Ansible, Jenkins, and/or Git is a must
  • Expertise in monitoring tools like Grafana, Datadog, or Prometheus is a must
  • Experience in developing, deploying, and maintaining applications for Public Cloud environments (AWS, Azure, GCP, etc)
  • Knowledge of networking protocols and concepts such as routing, TCP/IP, BGP, OSPF/ISIS, NetFlow, SNMP, and Internet Traffic Engineering techniques
  • Good communication skills, written and verbal, along with ability to communicate complex procedures
  • A desire to constantly grow and learn new skills
Job Responsibility
Job Responsibility
  • Enable SRE support and monitoring for HPE Networking SASE products to ensure that applications are running as per their requirements
  • Create strategies to detect issues, address those issues, and design systems to troubleshoot automatically using tools like Prometheus, Grafana, or Datadog
  • Ensure high availability and performance of cloud-based applications and services
  • Design, implement, and maintain scalable infrastructure using Infrastructure as Code (IaC) tools such as Terraform or CloudFormation
  • Collaborate with development teams to improve application performance and reliability from design through production
  • Gain insights from the data fetched from monitoring tools to enhance the product's performance
  • Drive automation for deployment, monitoring, scaling, and incident response
  • Manage and optimize Kubernetes clusters and containerized applications
  • Define and implement SLOs/SLIs and continuously improve observability and monitoring practices
  • Lead and participate in incident management and root cause analysis to prevent recurrence
What we offer
What we offer
  • Comprehensive suite of benefits that supports physical, financial and emotional wellbeing
  • Specific programs catered to helping you reach career goals
  • Flexibility to manage work and personal needs
  • Inclusive environment that celebrates individual uniqueness
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

AutoRABIT is the leader in DevSecOps for SaaS platforms such as Salesforce. Its ...
Location
Location
India , Hyderabad
Salary
Salary:
25.00 - 30.00 INR / Year
autorabit.com Logo
AutoRABIT
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in SRE, DevOps, or related roles
  • Solid hands-on experience with AWS services (EKS, ECS, EC2, RDS, S3, Redis, etc.)
  • Proficient in writing Terraform infrastructure scripts
  • Strong scripting skills in Python using Boto3
  • Deep understanding of monitoring/logging tools (ELK, CloudWatch, TrendMicro)
  • Experience building and managing CI/CD pipelines (CodeBuild, CodePipeline)
  • Knowledge of infrastructure security and incident response practices
  • Willing to work in rotational shifts and rotational week-offs
  • Bachelor’s in computers or any related field
  • AWS certifications is preferred
Job Responsibility
Job Responsibility
  • Provision and manage AWS infrastructure using Terraform
  • Write AWS Lambda functions (Python3 + Boto3) to automate operational tasks
  • Set up monitoring, logging, and alerting with ELK, TrendMicro, and AWS CloudWatch
  • Configure alerts for performance and security anomalies
  • Develop and maintain CI/CD pipelines using AWS CodeBuild and CodePipeline
  • Troubleshoot production issues and contribute to blameless postmortems
  • Contribute to system hardening and security compliance efforts
  • Responsibility to adhere to set internal controls
  • Fulltime
Read More
Arrow Right

Software Engineer, Site Reliability

As a Site Reliability Engineer (SRE) at Fireworks AI, you will play a critical r...
Location
Location
United States , San Mateo
Salary
Salary:
Not provided
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, related technical field, or equivalent practical experience
  • 5+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on large-scale production systems
  • Deep expertise in SRE principles and practices, including SLOs, SLIs, operational automation, incident management, and post-mortems
  • Extensive hands-on experience with public cloud platforms (AWS, GCP, Azure), including compute, networking, storage, and database services
  • Strong experience with containerization technologies (Docker) and orchestration platforms (Kubernetes)
  • Proficiency in designing and implementing robust monitoring, logging, and alerting systems using tools like Prometheus, Grafana, ELK stack, and distributed tracing
  • Solid programming/scripting skills in at least one language (e.g., Python, Go) for automation and tool development
  • In-depth knowledge of Linux operating systems, networking fundamentals, and system debugging
  • Proven ability to troubleshoot complex issues across the entire stack
  • Excellent communication, collaboration, and problem-solving skills
Job Responsibility
Job Responsibility
  • Ensuring System Reliability: Ensure systems are designed and implemented with high availability, scalability, and performance. Focus on fault tolerance, disaster recovery, identifying and removing scaling bottlenecks, and performance optimization across our multi-cloud infrastructure
  • Incident Management & Response: Lead efforts in incident detection, response, and resolution for critical production issues. Drive post-mortems to identify root causes and implement preventative measures to improve system reliability
  • Observability & Monitoring: Develop, implement, and maintain comprehensive monitoring, alerting, logging, and tracing solutions to provide deep insights into system health and performance
  • Automation & Toil Reduction: Identify and automate repetitive operational tasks to reduce toil and improve operational efficiency. Develop tools and scripts to streamline deployments, scaling, and system management
  • Capacity Planning & Performance Tuning: Work proactively on capacity planning to ensure our infrastructure can gracefully handle growth and peak loads. Optimize system performance and resource utilization
  • Reliability Best Practices: Collaborate with software engineers to embed reliability principles (e.g., SLOs, SLIs, error budgets) into the development lifecycle, promoting a culture of operational excellence
  • On-call Rotation: Participate in a periodic on-call rotation to support our production environment and respond to critical alerts
  • Fulltime
Read More
Arrow Right

Lead Site Reliability Engineer

Groupon is a marketplace where customers discover new experiences and services e...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in systems engineering
  • at least 5+ years in SRE or DevOps roles
  • expertise in cloud platforms (GCP, AWS) and container orchestration (Kubernetes, Docker)
  • proficiency in programming and scripting languages like Python, Go, and Bash
  • advanced knowledge of Infrastructure as Code (IaC) tools such as Terraform and Ansible
  • deep understanding of networking, DNS, load balancing, and security principles
  • proven track record of managing high-availability systems in demanding environments
  • exceptional analytical and problem-solving skills
Job Responsibility
Job Responsibility
  • Architect and maintain fault-tolerant systems, ensuring uptime SLAs of 99.9% or higher
  • drive automation in infrastructure management and deployment using Terraform, Ansible, Kubernetes, and similar tools
  • create and optimize CI/CD pipelines to ensure reliable, secure, and efficient software delivery
  • build and enhance comprehensive observability solutions, including monitoring, logging, and alerting systems using Prometheus, Grafana, and the ELK stack
  • collaborate with stakeholders to define and achieve SLIs, SLOs, and error budgets aligned with business needs
  • lead incident response during on-call rotations, ensuring rapid resolution and root cause analysis for critical issues
  • design and execute performance testing, capacity planning, and scalability strategies for evolving workloads
  • proactively identify and resolve bottlenecks, increasing system performance and developer efficiency
  • mentor junior engineers, fostering a collaborative and growth-oriented team environment
  • guide architectural decisions that drive innovation and enhance system reliability
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • a collaborative and innovative work values alignment that values your expertise and contributions
  • professional growth and leadership development pathways tailored to your aspirations
  • a chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right