SRE Ansible developer Job at Realign (Toronto)

SRE Developer

We are looking for a proactive SRE Developer with 3–5 years of experience to man...

Location

India , Bangalore South

Salary:

Not provided

Wissen

Expiration Date

Until further notice

Requirements

Strong hands-on experience in SRE or DevOps operations
Expertise in CI/CD tools such as GitHub Actions, GitLab CI, Jenkins, Azure DevOps
Experience with monitoring and observability tools (Grafana, Prometheus, ELK, Splunk, Datadog, New Relic, etc.)
Good understanding of cloud platforms (AWS, Azure, or GCP)
Practical experience using AI tools in daily engineering workflows (CursorAI, ChatGPT, GenAI tools, automation assistants)
Ability to identify repetitive operational tasks and automate using AI or scripts
Familiarity with AI-driven troubleshooting and documentation
Proficiency in Python, Bash, PowerShell, or similar scripting languages
Exposure to Infrastructure as Code (IaC) tools such as Terraform, CloudFormation, ARM, or Ansible

Job Responsibility

Handle SRE BAU operations including incident management, root cause analysis, problem resolution, and service restoration
Manage and maintain CI/CD pipelines and deployment automation across environments
Improve system reliability, scalability, and performance through automation and proactive monitoring
Implement and manage observability solutions including logging, metrics, alerting, and dashboards
Utilize AI tools (CursorAI, Generative AI, automation copilots) for faster troubleshooting, documentation, code generation, and incident analysis
Collaborate with engineering, product, and security teams to ensure smooth releases and secure infrastructure
Reduce manual operational effort through AI-assisted automation and scripting
Drive DevOps best practices and continuous improvement initiatives

Fulltime

Python Developer - Site Reliability Engineering (SRE)

We are seeking a skilled Python Developer with experience in the Site Reliabilit...

Location

Canada , Montreal

Salary:

Not provided

NTT DATA

Expiration Date

Until further notice

Requirements

3+ years of experience with Python development
6 years of experience working with Infrastructure as Code (Terraform and Ansible)
Experience with CI/CD pipelines, preferably GitHub Actions and Jenkins
Strong understanding of object-oriented design and development principles
Proficiency in Linux/Unix environments
Experience working with database technologies (preferably NoSQL), including data modeling, testing, and performance tuning
Ability to write reusable, optimized, maintainable, and well‑documented code following industry best practices
Experience implementing open-source monitoring and observability tools such as Prometheus, Grafana, Splunk or Open Telemetry
Strong problem‑solving skills and ability to take ownership of tasks and drive them independently to closure
Understanding of networking concepts (TCP/IP, DNS, Load Balancing)

Job Responsibility

Develop quality software working with public cloud service provider (CSP) infrastructure across different Public Cloud areas
Develop, enhance, and integrate automation workflows for Public Cloud Service Providers (CSP), initially focused on Azure, and integrate with in-house tooling
Integrate automation workflows into CI/CD pipelines using GitHub Actions and Jenkins
Build proof-of-concept solutions in new areas of cloud and automation development
Provide technical support and debugging for application failures in both on-premises and cloud environments
Participate in all phases of the Software Development Life Cycle (SDLC), including analysis, design, coding, testing, and deployment
Evaluate, onboard, and implement emerging DevOps and automation tools to improve efficiency
Build and integrate observability into cloud platforms and solutions using open-source tools (Prometheus, Grafana, OpenTelemetry)
Identify, highlight, and reduce operational toil through automation, architectural improvements, and process optimization
Collaborate with global teams to understand requirements, develop high‑quality code, and deliver cloud-focused projects

Technical Architect

Lead the design, modernization, and implementation of scalable, secure, and resi...

Location

United States , Armonk

Salary:

247319.00 - 250000.00 USD / Year

The New York Times

Expiration Date

Until further notice

Requirements

Bachelor's degree or equivalent in Computer Science, Information Technology, Engineering or related and five (5) years of experience as a Consultant Architect, Virtualization Architect, Senior Cloud Architect or related
Five (5) years of experience must include utilizing Hybrid Cloud, AWS, Azure, Red Hat Linux, Terraform, Ansible, Python, VMware Cloud Foundation (VCF) Stack

Job Responsibility

Lead the design, modernization, and implementation of scalable, secure, and resilient hybrid cloud and containerized infrastructure platforms
Define and lead the technical architecture strategy for hybrid cloud, container orchestration (Kubernetes, RedHat OpenShift, VMware Tanzu), and virtualized environments (VMware, Nutanix, RedHat)
Architect secure and scalable infrastructure across private, public, and hybrid cloud ecosystems
Evaluate, design, and implement solutions for computing, storage, networking, identity, and availability zones across global regions
Design and implement Kubernetes, RedHat OpenShift clusters across multi-cloud and on-prem environments, including CI/CD integration, policy enforcement, and workload orchestration
Define governance, observability, and security patterns for containerized workloads
Lead Infrastructure-as-Code (IaC) initiatives using Terraform, Ansible, GitOps, GitHub, PowerShell, and Python
Enable self-service infrastructure capabilities through automation frameworks and developer platforms
Partner with DevSecOps, SRE, Infrastructure Operations, Security, and Datacenter Operation teams to scope, define, size, and execute application onboarding, modernization, and consolidation initiatives
Mentor engineering teams and influence enterprise architecture (EA) roadmaps

Fulltime

New

DevOps & Infrastructure Support Engineer

Your opportunity: At Schwab, you’re empowered to make an impact on your career. ...

Location

United States , Austin

Salary:

57.21 - 67.79 USD / Hour

Charles Schwab

Expiration Date

June 20, 2026

Requirements

5+ years in production support, reliability engineering, or platform operations within an enterprise environment
Hands-on experience supporting business-critical systems with high uptime requirements
Experience with Java and/or .NET application stacks
WebSphere, IIS, and enterprise middleware
Strong Linux and Windows operational experience
Solid administration skills (RHEL, CentOS, or Ubuntu) with a strong grasp of file systems and permissions
Scripting: Strong hands-on experience writing Bash scripts
Development experience with PowerShell, Python, Bash, Java
Familiarity with SQL, NoSQL databases, Messaging platforms (RabbitMQ, IBM MQ)
Log Analysis: Proven ability to troubleshoot complex issues using system and application logs

Job Responsibility

Systems Administration: Maintain enterprise Linux environments, focusing specifically on file systems, permissions, and system configurations
Troubleshooting: Thoroughly analyze system and application logs to diagnose and resolve complex issues across multiple environments
Own production stability, availability, and performance for a portfolio of Java, .NET, batch jobs, and web-based applications running on Linux, Windows, on-prem, and PCF
Automation & CI/CD: Write and maintain Bash scripts to automate routine operational tasks, and support continuous integration and deployment (CI/CD) pipelines
Configuration Management: Utilize configuration management tools to streamline, automate, and standardize environments
Application Support: Support Java applications, WebSphere Application Server, and manage workloads in Cloud Foundry environments
Observability: Use Splunk and Grafana for log aggregation, creating dashboards, proactive monitoring, and alerting
Networking Integration: Configure and troubleshoot core networking components necessary for application delivery, including DNS, firewall rules, and load balancer routing
Automation & Toil Reduction: Design and build automation (scripts, tooling, frameworks) to eliminate repetitive operational tasks
Improve self-service diagnostics, alert hygiene, and recovery automation

What we offer

401(k) with company match and Employee stock purchase plan
Paid time for vacation, volunteering, and 28-day sabbatical after every 5 years of service for eligible positions
Paid parental leave and family building benefits
Tuition reimbursement
Health, dental, and vision insurance
bonus or incentive opportunities

Fulltime

New

Senior Staff Engineer Software (Cloud Platform, Production & Reliability – Machine Identity Security)

The Production Engineering team is responsible for building, scaling, and operat...

Location

United States , Santa Clara

Salary:

126000.00 - 203500.00 USD / Year

Palo Alto Networks

Expiration Date

Until further notice

Requirements

5+ years of experience in DevOps, Platform Engineering, or Site Reliability Engineering (SRE)
Strong experience designing and operating cloud infrastructure on AWS, Azure, or GCP
Deep expertise managing and scaling Kubernetes environments (EKS, AKS, or GKE)
Strong experience with Infrastructure as Code tools (Terraform, Ansible, or Pulumi)
Proven experience designing and maintaining complex CI/CD systems (Jenkins, GitLab CI, ArgoCD, GitHub Actions)
Strong programming/scripting skills (Python, Go, or similar) for automation and tooling
Experience operating in high-scale, 24/7 production environments with ownership of incident response and reliability
Solid understanding of Linux systems and networking fundamentals (DNS, TCP/IP, load balancing, VPC, mTLS)
Strong problem-solving skills and ability to work across teams

Job Responsibility

Design, build, and evolve highly available cloud infrastructure platforms with a focus on scalability, resilience, and reliability
Lead improvements across production systems, including performance, availability, and incident response
Drive and standardize Infrastructure as Code (IaC) practices to improve consistency and reduce operational overhead
Design and optimize CI/CD pipelines to support fast, secure, and reliable software delivery at scale
Partner with development teams to improve system reliability, observability, and cloud-native design patterns
Define and implement monitoring, alerting, and observability strategies across distributed systems
Lead incident response efforts, including root cause analysis and long-term remediation strategies
Identify and eliminate operational toil through automation and system improvements
Mentor engineers and contribute to raising the bar for production engineering practices

What we offer

restricted stock units
bonus

Fulltime

New

Sr Principal Site Reliability Engineer (Sovereign Cloud)

Palo Alto Networks runs a large infrastructure and is one of the largest GCP cus...

Location

Bulgaria , Sofia

Salary:

Not provided

Palo Alto Networks

Expiration Date

Until further notice

Requirements

10+ years as an engineer in Infrastructure, Operations, DevOps, or System Engineering
7+ years building high availability, scalable cloud-native applications on AWS and GCP
BS or MS in Computer Science, a related field, or equivalent professional experience required
Expertise in configuration management with a framework such as Ansible, Terraform, Helm
Passion for infrastructure and monitoring as code
Solid experience in container workloads and Kubernetes
Familiarity with PKI concepts, Networking concepts
In-depth knowledge of different security controls ( app-id, user-id, security profile, url category, content, ssl decryption, firewall MFA etc)
Linux administration, internals, and network troubleshooting
Proficiency with programming languages like Golang or Python along with shell scripting to automate tasks

Job Responsibility

Contribute to the success of SRE and DevOps
Develop expertise in new technologies
Work with developers, researchers, data scientists, and security experts
Design, build and operate reliable, secure Cloud infrastructure
Ensure that applications are production-ready, scalable, and reliable
Develop tools and automation frameworks
Automate robust deployment of robust services
Orchestrate end-to-end monitoring and alerting
Participate in on-call rotations to support critical business and production systems
Lead root cause analysis of critical business and production issues

Fulltime

New

Senior Ansible Automation & Platform Engineer

The Senior Ansible Automation & Platform Engineer is a strategic member of the o...

Location

United States , Austin; Mountain View; Warren

Salary:

Not provided

General Motors

Expiration Date

Until further notice

Requirements

7–12+ years in Architecture, DevOps, SRE, Platform Engineering, or Infrastructure Engineering
Expert-level proficiency with Ansible (playbooks, roles, collections, Jinja2, modules)
Hands-on experience designing and operating Ansible Automation Platform (AAP)
Strong experience with Terraform, Chef, or other IaC tools
Deep Linux engineering background and configuration management expertise
Expert in integrating automation with ServiceNow (CMDB, ITSM, workflows)
Exceptional scripting skills (Python, Bash, PowerShell)
Experience with AWS/Azure/GCP automation
Experience with Kubernetes, containerization, and orchestration
Experience with CI/CD pipelines (GitHub Actions, GitLab, Jenkins, Azure DevOps)

Job Responsibility

Architect, design, and operate the Ansible Automation Platform (AAP) including controller, execution environments, mesh architecture, and collections strategy
Define and maintain the Ansible Platform roadmap, including feature evolution, lifecycle management, scalability planning, and enterprise adoption milestones
Establish platform governance: coding standards, role/playbook patterns, collections, testing frameworks, and security guardrails
Build and maintain Execution Environments (EEs) optimized for performance, security, and dependency management
Lead platform upgrades, migrations, and cross-environment standardization
Design enterprise-grade Ansible automation frameworks with reusable roles, collections, and modular playbooks
Build automation for provisioning, configuration management, patching, compliance, and cloud infrastructure
Integrate Ansible with Terraform, CI/CD pipelines, GitOps workflows, and event-driven automation systems
Implement self-service automation capabilities for developers, operations, and business teams
Integrate Agentic AI systems to enhance automation workflows, including: AI-driven playbook generation and validation, Automated remediation recommendations, Intelligent change-impact analysis, AI-assisted troubleshooting and root-cause analysis

What we offer

Relocation benefits (may be eligible)

Fulltime

Senior Site Reliability Engineer, Wikimedia Enterprise

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to jo...

Location

United States

Salary:

116633.00 - 181243.00 USD / Year

Wikimedia Foundation

Expiration Date

Until further notice

Requirements

Automation & Configuration Management: Experience with Infrastructure as Code and automation tools (e.g., Terraform, Ansible) and proficiency in at least one programming language (e.g., Python, Go, or similar)
Cloud Infrastructure: Experience designing, operating, and optimizing cloud-based systems across platforms such as AWS, Azure, or GCP, including scalability, reliability, and cost efficiency
CI/CD & Deployment Practices: Experience building and maintaining CI/CD pipelines and GitOps workflows (e.g., GitLab or similar, ArgoCD), with familiarity in progressive delivery approaches such as canary and blue-green deployments
Incident Management & Reliability Operations: Experience with incident response, on-call practices, and leading postmortems, with a focus on continuous improvement and operational excellence
SRE Principles & Observability: Strong understanding of SRE best practices, including SLOs, SLIs, and error budgets, along with experience in observability (metrics, logging, and distributed tracing e.g., Prometheus, OpenTelemetry)
Collaboration & Communication: Ability to work effectively in a distributed, cross-functional environment, with strong documentation and communication skills
Proven experience operating highly available, large-scale distributed systems, with a deep understanding of reliability, scalability, and failure modes
Ownership mindset: Takes end-to-end responsibility for system reliability, proactively identifying and addressing risks before they impact users
Bias for automation: Continuously seeks to reduce operational toil through automation and scalable solutions
Continuous improvement mindset: Actively learns from incidents and drives improvements through blameless postmortems and iterative enhancements

Job Responsibility

Define, track, and improve Service Level Objectives (SLOs), SLIs, and error budgets to ensure reliability targets are met
Build and enhance observability systems (metrics, logs, and distributed tracing) to enable proactive detection and faster troubleshooting
Drive reliability engineering practices, including capacity planning, load testing, and resilience validation (e.g., chaos testing)
Improve developer experience (DevEx) by enabling self-service infrastructure and streamlining deployment workflows
Partner with engineering team members to embed reliability best practices early in the development lifecycle
Design, implement, and optimize CI/CD and GitOps workflows using tools such as GitLab (or similar) and ArgoCD(or similar), enabling automated, reliable deployments with support for progressive delivery strategies like canary and blue-green releases
Implement secure-by-default infrastructure and enforce best practices (e.g., IAM, secrets management, encryption)
Continuously optimize infrastructure cost and efficiency using FinOps principles while maintaining performance and availability
Establish and track operational metrics such as MTTR, MTTD, and incident frequency to drive continuous improvement
Reduce operational toil by identifying repetitive work and implementing automation-first solutions

Fulltime

Select Country

SRE Ansible developer

Requirements

Looking for more opportunities?