CrawlJobs Logo

SRE Ansible developer

realign-llc.com Logo

Realign

Location Icon

Location:
Canada , Toronto

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

155000.00 USD / Year

Requirements:

  • Design and implement automation scripts using Ansible for infrastructure provisioning and configuration management
  • Develop and maintain monitoring solutions leveraging Dynatrace for application and system performance
  • Configure and optimize ITRS monitoring tools to ensure proactive alerting and incident management
  • Collaborate with development and operations teams to improve system reliability and scalability
  • Automate deployment pipelines and integrate with CICD processes for faster releases
  • Troubleshoot performance issues and implement solutions to enhance system resilience
  • Ensure compliance with security and operational standards across environments
  • Document automation workflows, monitoring configurations, and best practices for knowledge sharing
  • Total Experience: 6-8 years

Additional Information:

Job Posted:
March 21, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for SRE Ansible developer

Software Engineering Professional

Working in this role you will play a critical part in the operation of the BT Bu...
Location
Location
United Kingdom , Belfast
Salary
Salary:
Not provided
plus.net Logo
Plusnet
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Programming / scripting experience
  • Understanding of SRE principles and a willingness to grow and develop these new principles with BT Business SRE
  • Understanding of CI/CD pipelines
  • Experience of identifying and automating manual processes using technologies like Ansible
  • Experience of using Application Performance Monitoring tools such as Dynatrace
  • You are organised and like to get things done. The ability to adapt, take risks and embrace change will be a necessity
  • Empathetic and good with people
  • you like working with people and finding solutions together
  • Have an understanding of agile methodologies/frameworks
  • Good communication skills, comfortable with presenting to team members and other wider teams
Job Responsibility
Job Responsibility
  • Work with colleagues across the various Business SRE teams to design and develop SRE software solutions
  • Be part of a team responsible for the implementation of APM and service monitoring and reporting with a desire to auto-remediate problem solutions
  • Be part of a team responsible for the development of SRE Tooling and BT Business infrastructure automation using SRE software approaches
  • Be part of a team responsible for the design, build and deployment of AI/ML solutions
  • Support and contribute to BT Business Service Assurance SRE goals
  • Ensure good software engineering practices
  • Produce clear documentation for the Observability and Tooling solutions we develop
  • Support our agile methods and ambition to grow our SRE culture throughout with enthusiasm
  • Contributing to our culture and team’s wellbeing
What we offer
What we offer
  • Competitive salary
  • 25 days annual leave (plus bank holidays)
  • 10% on target bonus
  • Life Assurance
  • Pension scheme
  • Direct share scheme
  • Option to join the Healthcare Cash Plan or other benefits such as dental insurance, gym memberships etc.
  • 50% off EE mobile pay monthly or SIM only plans
  • Exclusive colleague discounts on our latest and greatest BT broadband packages
  • BT TV with TNT Sports and NOW Entertainment & 50% discount for friends and family on EE SIM Only plans & airtime element off a Flex Pay plan
  • Fulltime
Read More
Arrow Right

Sr sre

Location
Location
India , Putlibowli
Salary
Salary:
Not provided
https://www.randstad.com Logo
Randstad
Expiration Date
March 30, 2026
Flip Icon
Requirements
Requirements
  • Develop and maintain Infrastructure as Code (IaC) using tools like Terraform, Ansible, Dynatrace
  • Build and manage CI/CD pipelines
  • Improve infrastructure provisioning and configuration through automation
  • Monitor the health, performance, and reliability of production systems and applications
  • Design, implement, and maintain automated monitoring solutions, using tools such as Datadog
  • Define and monitor service level objectives (SLOs), service level indicators (SLIs), and error budgets
  • Implement effective alerting systems
  • Lead root cause analysis (RCA) and post-mortem investigations
  • Respond to production incidents, diagnose root causes, and implement corrective actions
  • Create and maintain playbooks and documentation for incident response
Job Responsibility
Job Responsibility
  • Develop and maintain Infrastructure as Code (IaC) using tools like Terraform, Ansible, Dynatrace to automate deployment and management of infrastructure
  • Build and manage CI/CD pipelines to ensure efficient and reliable application deployments
  • Improve infrastructure provisioning and configuration through automation, minimizing manual interventions and reducing human error
  • Monitor the health, performance, and reliability of production systems and applications
  • Design, implement, and maintain automated monitoring solutions, using tools such as Datadog
  • Define and monitor service level objectives (SLOs), service level indicators (SLIs), and error budgets to ensure system reliability and availability meet customer expectations
  • Implement effective alerting systems to identify and address potential issues before they impact users
  • Lead root cause analysis (RCA) and post-mortem investigations after incidents to identify improvements and avoid recurrence
  • Respond to production incidents, diagnose root causes, and implement corrective actions
  • Create and maintain playbooks and documentation for incident response, troubleshooting, and recovery processes
  • Fulltime
Read More
Arrow Right

Browser Infrastructure Engineer

Infrastructure Engineer for Browser Development builds reliable, automated, and ...
Location
Location
Serbia , Belgrade
Salary
Salary:
Not provided
perplexity.ai Logo
Perplexity
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years in software development infrastructure, preferably Chromium browsers
  • Hands-on DevOps and SRE experience, including monitoring and incident management
  • Proficiency in k8s, Terraform, Datadog, Sentry, AWS, Unix, TeamCity
  • Strong CI/CD implementation skills
  • Ability to thrive in Agile teams with excellent communication
Job Responsibility
Job Responsibility
  • Set up and maintain CI/CD pipelines for builds and testing (TeamCity, Jenkins, etc.)
  • Support and evolve Chromium browser development infrastructure (k8s, terraform, ansible)
  • Configure monitoring and alerting systems (Sentry, Datadog)
  • Manage cloud infrastructure (AWS), Linux servers, and virtual environments
  • Develop automation scripts in Bash, Python, and Go
  • Ensure high availability, resilience, and security of development infrastructure
  • Collaborate with developers to optimize workflows and resolve incidents
What we offer
What we offer
  • Dynamic team with growth and learning opportunities
  • Fulltime
Read More
Arrow Right

Senior+ Site Reliability Engineer

Crusoe is building the most reliable, energy-efficient, AI-optimized cloud platf...
Location
Location
United States , San Francisco
Salary
Salary:
172000.00 - 209000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in cloud operations, SRE, or related roles
  • Background working with GPU workloads, high-performance computing, or latency/throughput-sensitive systems
  • Strong knowledge of Unix/Linux systems (kernel/user space) and networking including debugging complex issues in live systems
  • Understanding of cloud platforms and infrastructure fundamentals (Kubernetes, AWS/GCP, virtualization, distributed systems)
  • Familiarity with incident management practices and operational frameworks (SRE/ITIL/etc.)
  • Experience with monitoring and alerting tools (Prometheus, Grafana) or a strong willingness to learn
  • Familiarity with infrastructure-as-code and configuration management tools such as Terraform and Ansible
  • Basic Scripting and automation experience (Go, Python, C, C++, or similar)
  • Strong communication skills, with the ability to clearly articulate technical issues to diverse stakeholders
  • Ability to stay calm, focused, and effective in fast-moving or high-pressure situations
Job Responsibility
Job Responsibility
  • Collaborate with cross-functional teams to define and refine availability metrics for Crusoe’s cloud infrastructure, including establishing, tracking, and improving SLIs and SLOs
  • Assist in incident response by identifying, diagnosing, and resolving service disruptions, and support post-incident processes through RCA documentation and participation in post-incident reviews
  • Build, operate, and monitor infrastructure health using Crusoe’s observability stack (Prometheus, Grafana, Alertmanager, OpenTelemetry)
  • Identify and communicate reliability risks, performance bottlenecks, and early indicators of potential incidents that could impact service availability
  • Develop automation and tooling to reduce operational toil, minimize manual intervention, and enhance service recovery and self-healing capabilities
  • Partner with compute, network, storage, and platform teams to improve service resilience and strengthen disaster recovery readiness
  • Contribute to knowledge sharing, process improvements, and the development of operational best practices across the organization
  • Participate in ongoing training, mentorship, and professional development to grow into advanced SRE responsibilities
What we offer
What we offer
  • Industry competitive pay
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Fulltime
Read More
Arrow Right

Lead Service Reliability Engineer

As Service Reliability Engineer (SRE) in DAMO service line, you will take a mult...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
thoughtworks.com Logo
Thoughtworks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You can program with one or more high-level languages such as Python, Golang, Shell scripting, Ruby or Java
  • You are familiar with DevOps and GitOps practices, driving the integration of observability automation into CI/CD pipelines, e.g.: GitLab, Jenkins, CircleCI or equivalent
  • You have in-depth knowledge of configuration management and Infrastructure as Code (IAC) tools such as Terraform, Ansible, ARM and CloudFormation for provisioning and managing infrastructure
  • You have an expertise in observability, logs, tracing and monitoring tools such as Grafana (Loki and Tempo), Prometheus, Graylog, Jaeger, Zipkin, ELK stack or equivalent
  • You have a strong understanding of container-based architecture and hands-on experience with orchestration tools such as Kubernetes, AWS EKS, Docker Swarm, Nomad, etc
  • You have in-depth experience in application and infrastructure performance tuning and scaling to handle heavy loads under different scenarios e.g.: Periodic traffic load and tsunami patterns
  • You have a good understanding of essential concepts such as quality gates encompassing SLI/SLO/SLA, chaos engineering, golden signals, blameless postmortem methodologies, synthetic monitoring, distributed tracing, end-user monitoring and performance testing
  • You have experience with network load balancing, security tech stacks, Transport Layer Security (TLS) and certificate management, and an understanding of standard networking protocols and configurations
  • You have strong communication and articulation skills, and are proficient in English
  • You are able to convey resolutions to audiences with varying degrees of technical/business proficiency and bring them to consensus
Job Responsibility
Job Responsibility
  • You will be responsible for understanding requirements or SRE goals in depth from both tech and business perspectives
  • You will provide solutions to improve reliability, including identifying and implementing mechanisms and architectures that enable fault tolerance and faster median time to respond and median time to detect
  • You will be responsible for enhancing the incident management process, including the development of an incident prioritization matrix, triage, communication, mitigation, post-mortem analysis and implementation of corrective actions
  • You will manage client stakeholder expectations and queries during production incidents, providing detailed technical analysis of issues and remediation plans for mitigation and prevention in future, and act as the interface for C-level executives, if or when needed
  • You will be a liaison with client engineering teams, build trust and productive relationships with senior client stakeholders and team leads to influence them in making better decisions
  • You will be responsible for identifying opportunities for enhancing system performance and reliability in alignment with business SLAs, SLOs, KPIs and objectives, and provide guidance and assistance to SRE teams in implementing the identified improvements
  • As an SRE expert, you will collaborate with Thoughtworks application development leads and solution architects, recommending changes in system design and adopting best practices for improved reliability from day one
  • You will oversee and mentor other SREs on the team, contributing to their growth and development
What we offer
What we offer
  • There is no one-size-fits-all career path
  • career is supported by interactive tools, numerous development programs and teammates who want to help you grow
  • Fulltime
Read More
Arrow Right

Systems Engineer (SRE III) - Email Infrastructure & Automation

Groupon is looking for a Systems Engineer (SRE III) to join the Marketing Engine...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
  • 3+ years of hands-on experience in systems engineering, infrastructure, or DevOps roles
  • Strong experience configuring and operating MTAs and SMTP-based systems
  • Solid understanding of SMTP, DNS, TLS certificates, and email deliverability best practices
  • Practical experience with Ansible and Terraform for infrastructure automation
  • Strong experience with Linux systems and shell scripting
  • Hands-on experience with CI/CD pipelines, especially Jenkins, for build and deployment workflows
  • Experience in troubleshooting email delivery, bouncing, blocking, and security-related issues
  • Experience working with cloud platforms such as AWS, GCP, or Azure, including end-to-end migrations
  • Good understanding of cloud networking, including VPCs, subnets, routing, BYOIP processes, ARIN IP management, and IP propagation to ISPs
Job Responsibility
Job Responsibility
  • Design, configure, and operate Mail Transfer Agents (MTAs) such as Postfix, Sendmail, or Exim, along with SMTP servers
  • Build and maintain email notification systems used by multiple services and teams
  • Develop and manage Infrastructure as Code (IaC) using Terraform and Ansible
  • Automate system provisioning, configuration, and deployment using CI/CD pipelines, primarily with Jenkins
  • Ensure high availability, performance, and security of email and system infrastructure
  • Monitor email delivery and system health to consistently meet a 99.5% SLA
  • Troubleshoot email delivery issues, including bounces, blocks, throttling, and reputation-related problems
  • Implement and maintain email security best practices, including DNS, SPF, DKIM, and DMARC
  • Maintain clear documentation and runbooks for infrastructure and operational processes
  • Participate in on-call rotations and support incident response when required
Read More
Arrow Right

DevOps Engineer

Platform DevOps and in turn the Run teams are responsible for the performance, s...
Location
Location
United Kingdom , Prudhoe, Hessle, Nelson
Salary
Salary:
Not provided
giacom.com Logo
Giacom
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven Experience: Hands-on experience in a DevOps, SRE, or similar role
  • Cloud Expertise: Strong proficiency with at least one major cloud provider (AWS, Azure, or GCP)
  • Container Technologies: In-depth knowledge of container orchestration
  • CI/CD Tools: Demonstrable experience with CI/CD tools like Jenkins, GitHub Actions, or Azure DevOps
  • Infrastructure as Code (IaC): Expertise in using tools like Terraform or Ansible
  • Scripting Skills: Proficiency in a scripting language such as Python or Bash
  • Networking Knowledge: Solid understanding of networking principles (TCP/IP, DNS, HTTP/S, Firewalls) is essential for a telco environment
Job Responsibility
Job Responsibility
  • CI/CD Pipeline Management: Design, implement, and manage our continuous integration and continuous delivery (CI/CD) pipelines for our Marketplace and Software Tools platforms, enabling rapid and reliable software releases
  • Infrastructure as Code (IaC): Develop and maintain our cloud and on-premise infrastructure using IaC principles with tools like Terraform and Ansible
  • Containerization & Orchestration: Manage and scale our containerized applications, ensuring high availability and efficient resource utilization in a multi-tenant environment
  • Automation & Scripting: Automate manual processes related to deployment, monitoring, and operations using scripting languages such as Python, Bash, or Go
  • Monitoring & Logging: Implement and manage robust monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana, ELK Stack) to proactively identify and resolve system issues
  • Collaboration: Work closely with software developers, network engineers, and product managers to troubleshoot issues, optimize performance, and ensure our platforms meet the stringent requirements of our telco/MSP clients
  • Security: Integrate security best practices (DevSecOps) into the development lifecycle, including vulnerability scanning, static code analysis, and compliance checks
What we offer
What we offer
  • Hybrid working
  • No dress code
  • 25 days annual leave, plus bank holidays
  • Birthday off
  • A pension plan for your future
  • Complimentary refreshments in all our offices
  • Fulltime
Read More
Arrow Right

Senior Service Reliability Engineer

As a Service Reliability Engineer (SRE) in DAMO service line, you will take a mu...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
thoughtworks.com Logo
Thoughtworks
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • You have expertise in Ansible orchestration including advanced strategies, failure logic handling, and Jinja2 templating
  • You have the ability to integrate Terraform with Ansible for seamless provisioning-to-configuration workflows
  • You have hands-on experience with Python, Go, Bash or PowerShell scripting
  • You have working knowledge of at least one public cloud (AWS/Azure/GCP)
  • You have experience with observability tools (Grafana, Datadog, NewRelic, ELK, Dynatrace, etc.) and can use data for RCA
  • You have familiarity with DevOps, SRE and GitOps concepts and practices
  • You have knowledge of container technologies and orchestration (Kubernetes, EKS, Docker Swarm, Nomad, etc.)
  • You have understanding of modern architecture (microservices, serverless, NoSQL, REST APIs) and experience debugging and building metrics/dashboards
  • You have experience designing infrastructure aligned with Cloud Well-Architected principles (reliability, security, cost, performance, operations)
  • You are able to mentor team members through workshops and knowledge enablement
Job Responsibility
Job Responsibility
  • You will conduct SRE and Disaster Recovery (DR) maturity assessments
  • You will engineer automation solutions using Ansible to replace manual workflows
  • You will own and manage the current manual Disaster Recovery process/pipeline
  • You will improve site reliability through mechanisms and architectures that enhance fault tolerance and reduce MTTR/MTTD
  • You will drive the integration of observability automation into the CI/CD pipeline
  • You will handle production incidents, lead client communication, and create root cause analysis documentation
  • You will monitor performance of production systems and improve scaling to meet SLA and SLO targets
  • You will work closely with application development teams to advise and implement reliability improvements
  • You will improve system observability across logging, metrics and alerting, reducing false alarms to eliminate unnecessary toil and improving overall process efficiency, while implementing chaos engineering practices to regularly validate system reliability
  • You have a clear understanding of client goals and business needs, setting direction for site reliability in alignment with business expectations - including high availability targets such as 99.999% with minimal/no disruption where required.
What we offer
What we offer
  • Learning & Development: There is no one-size-fits-all career path at Thoughtworks: however you want to develop your career is entirely up to you. But we also balance autonomy with the strength of our cultivation culture. This means your career is supported by interactive tools, numerous development programs and teammates who want to help you grow. We see value in helping each other be our best and that extends to empowering our employees in their career journeys.
  • Fulltime
Read More
Arrow Right