Site Reliability & Infrastructure Automation Engineer Job at Piper Companies (Durham)

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...

Location

Canada , Mississauga

Salary:

115000.00 - 128000.00 CAD / Year

PointClickCare

Expiration Date

Until further notice

Requirements

5+ years' experience in software engineering
Experience with SRE principles
Experience with AI/ML in production environments
A passion for automation, intelligent systems, and operational excellence
Strong debugging, problem-solving, and system design skills
Languages: Python, Java, Bash, Terraform
Platforms: Azure, Kubernetes, Docker
Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
CI/CD: Jenkins, ArgoCD, Spinnaker

Job Responsibility

Build ML-based anomaly detection and pattern recognition systems
Enhance telemetry with smart tagging and metadata for better AI insights
Develop event-driven workflows and self-healing systems using AI triggers
Automate incident response with generative AI and custom AI agent orchestration
Use time-series forecasting and predictive modelling to anticipate failures
Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
Build scalable, fault-tolerant systems in a cloud-native environment
Participate in on-call rotations and lead incident response for critical systems
Skilled in API integration for streamlined data exchange and system connectivity
Run internal AIOps workshops and help teams adopt AI maturity models

What we offer

Benefits starting from Day 1!
Retirement Plan Matching
Flexible Paid Time Off
Wellness Support Programs and Resources
Parental & Caregiver Leaves
Fertility & Adoption Support
Continuous Development Support Program
Employee Assistance Program
Allyship and Inclusion Communities
Employee Recognition … and more!

Fulltime

Senior Site Reliability Engineer, Infrastructure Foundations

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to su...

Location

United States

Salary:

113082.00 - 175725.00 USD / Year

Wikimedia Foundation

Expiration Date

Until further notice

Requirements

6+ years of experience in an SRE/Operations/DevOps role as part of a team
Experience with shell and any scripting languages used in an SRE context (Python, Go, Bash, Ruby
we primarily use Python) and configuration management tools (Puppet, Ansible
we use Puppet)
Experience designing and managing infrastructure security for large fleets of diverse services
Experience with technical response during security incidents
Experience with package management on Linux systems (we use Debian)
Strong Linux system-level troubleshooting skills
History of automating tasks and processes, identifying process gaps, and finding automation opportunities
Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones

Job Responsibility

Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
Work closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure
Collaborating with a global, cross-functional team in an asynchronous communication environment
Mentoring peers in your areas of technical and operational strength
Ability and willingness to travel 1-2 times a year for in-person events and team meetings
Most importantly, share our values and work in accordance with them

Fulltime

Site Reliability Engineer - Infrastructure

We are actively looking for a talented Site Reliability Engineer to join the Inf...

Location

United States , San Mateo

Salary:

130000.00 - 280000.00 USD / Year

Verkada

Expiration Date

Until further notice

Requirements

Must have a BS, MS, or PhD in Computer Science, or similar technical field of study
Minimum of 1-2+ years of experience in a similar position
Experience in at least one scripting language (preferably Python)
Experience with one of the major cloud platforms (preferably AWS)
Experience with Kubernetes
Experience with Terraform
Enthusiasm for learning about new technologies and tooling

Job Responsibility

Keep our infrastructure up!
Improve infrastructure automation
Define infrastructure roadmap
Provide technical support for engineers on other teams

What we offer

Generous company paid medical, dental & vision insurance coverage
Unlimited paid time off & 11 companywide paid holidays
Wellness allowance
Commuter benefits
Healthy lunches and dinners provided daily
Generous paid parental leave policy & fertility benefits

Fulltime

Senior Site Reliability Engineer - Automation Platform

Join a team of passionate and hardworking entrepreneurs to transform healthcare!...

Location

France , Paris

Salary:

Not provided

Doctolib

Expiration Date

Until further notice

Requirements

At least 5+ years of site reliability engineering experience
Experience with AWS, Terraform, Kubernetes, GitHub Actions supporting applications deployment developed on the JVM and/or TypeScript
Proactive, curious, collaborative and eager to learn
Proven experience with cloud services such as AWS, Azure or Google Cloud
Solid understanding of containerization and orchestration technologies (Docker and Kubernetes)
Proficiency in at least one programming language (Go, Java, Ruby, Python etc.) and a deep understanding of infrastructure as code principles

Job Responsibility

Collaborating with Feature teams to ensure services align with developer needs
Driving improvements by evaluating new technologies and processes
Defining best practices (golden paths) for software development and deployment
Developing and maintaining tools and services that facilitate implementation of best practices
Ensuring reliability, scalability, traceability, and monitoring of services and infrastructure
Collaborating on roadmap delivery

What we offer

Free Health Insurance for you
Up to 14 days of RTT
A flexible workplace policy offering both hybrid and office-based modes
Flexibility days allowing to work in EU countries and the UK 10 days per year
Wellbeing program with free mental health and coaching through moka.care
Special support package for caregivers and workers with disabilities
Lunch voucher with Swile card
Work Council subsidy for sport club membership or creative activities
Bicycle subsidy
Public transportation reimbursement

Fulltime

Senior Site Reliability Engineer - Automation Platform

Join a team of passionate and hardworking entrepreneurs to transform healthcare....

Location

Germany , Berlin

Salary:

Not provided

Doctolib

Expiration Date

Until further notice

Requirements

At least 5+ years of site reliability engineering experience
Experience with AWS, Terraform, Kubernetes, GitHub Actions supporting applications deployment developed on the JVM and/or TypeScript
Proactive, curious, collaborative and eager to learn
Proven experience with cloud services such as AWS, Azure or Google Cloud
Solid understanding of containerization and orchestration technologies (Docker and Kubernetes)
Proficiency in at least one programming language (Go, Java, Ruby, Python etc.) and a deep understanding of infrastructure as code principles

Job Responsibility

Collaborating with Feature teams to ensure services align with developer needs
Driving improvements by evaluating new technologies and processes
Defining best practices ("golden paths") for software development and deployment
Developing and maintaining tools and services that facilitate best practices
Ensuring reliability, scalability, traceability, and monitoring of services and infrastructure
Collaborating on roadmap delivery

What we offer

Company health insurance through partner Allianz
Minimum 28 days of paid leave
Parent Care Program: one additional month of leave on top of legal parental leave
Free mental health and coaching services through partner Moka.care
For caregivers and workers with disabilities, a package including adaptation of remote policy, extra days off for medical reasons, and psychological support
Flexible workplace policy offering both hybrid and office-based mode
Work from EU countries and the UK for up to 10 days per year
Reimbursement of public transportation

Fulltime

Site Reliability Engineer – Infrastructure

The Site Reliability Engineer (SRE) will ensure the reliability, scalability, an...

Location

United States , Atlanta

Salary:

Not provided

Tier4 Group

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience
Proven experience as a Site Reliability Engineer or Systems Engineer
Strong proficiency in Terraform and Ansible for infrastructure automation
Hands-on experience with Kubernetes, Docker, or other container orchestration tools
Proficiency in scripting languages such as Python or Bash
In-depth knowledge of Google Cloud Platform (GCP) services including compute, networking, storage, Kubernetes, and security
Solid understanding of VMware virtualization and enterprise storage systems (e.g., Pure Storage)
Experience with networking technologies including VLANs, VPNs, and routing protocols
Strong grasp of IT infrastructure and operations principles, including systems integration and automation best practices
Excellent communication and collaboration skills

Job Responsibility

Design, build, and maintain secure, compliant infrastructure using Infrastructure as Code tools such as Terraform and Ansible
Automate provisioning and management of servers, storage, networks, Kubernetes clusters, and related systems across cloud and on-premises environments
Develop tools and processes for automated deployment, configuration, monitoring, and alerting
Collaborate with cross-functional teams to implement scalable and reliable cloud and data center solutions
Participate in incident response, on-call rotations, and post-incident reviews to improve system resilience
Monitor system performance and availability using service-level agreements (SLAs), objectives (SLOs), and indicators (SLIs)
proactively troubleshoot and resolve reliability, performance, or security issues
Create and maintain disaster recovery and business continuity plans for critical systems
Continuously analyze and improve infrastructure efficiency, scalability, and performance
Stay current with emerging technologies and recommend tools or practices to enhance platform capabilities

Fulltime

New

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...

Location

Ireland , Dublin

Salary:

Not provided

General Motors

Expiration Date

Until further notice

Requirements

8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
Proficiency in Python, Go, or Java, with strong code review and readability standards
Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
Ability to think and act under pressure
Strong communication skills

Job Responsibility

Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling

Fulltime

Cloud Engineer / Site Reliability Engineer (SRE)

Location

United States , Orlando

Salary:

75.00 USD / Hour

Beacon Hill

Expiration Date

Until further notice

Requirements

Strong hands-on AWS experience with solid understanding of core AWS services
Experience supporting and troubleshooting AWS and Azure cloud environments
Terraform experience for Infrastructure as Code
Docker/containerization experience
Strong troubleshooting and problem-solving skills
Ability to translate requirements into technical execution
Experience performing cloud architecture and diagramming
Experience supporting deployments, environments, and site standups
Strong communication and collaboration skills

Job Responsibility

Support cloud infrastructure and deployments across AWS and Azure
Troubleshoot infrastructure and application-related cloud issues
Build and maintain Terraform-based infrastructure
Support Docker/containerized environments
Create architecture diagrams and technical documentation
Work closely with engineering and project teams to execute cloud initiatives
Assist with automation and operational improvement efforts

Fulltime

Select Country

Site Reliability & Infrastructure Automation Engineer

Job Description

Job Responsibility

Requirements

What we offer

Looking for more opportunities?