Site Reliability Engineer – Infrastructure Job at Tier4 Group (Atlanta)

Senior Site Reliability Engineer, Infrastructure Foundations

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to su...

Location

United States

Salary:

113082.00 - 175725.00 USD / Year

Wikimedia Foundation

Expiration Date

Until further notice

Requirements

6+ years of experience in an SRE/Operations/DevOps role as part of a team
Experience with shell and any scripting languages used in an SRE context (Python, Go, Bash, Ruby
we primarily use Python) and configuration management tools (Puppet, Ansible
we use Puppet)
Experience designing and managing infrastructure security for large fleets of diverse services
Experience with technical response during security incidents
Experience with package management on Linux systems (we use Debian)
Strong Linux system-level troubleshooting skills
History of automating tasks and processes, identifying process gaps, and finding automation opportunities
Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones

Job Responsibility

Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
Work closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure
Collaborating with a global, cross-functional team in an asynchronous communication environment
Mentoring peers in your areas of technical and operational strength
Ability and willingness to travel 1-2 times a year for in-person events and team meetings
Most importantly, share our values and work in accordance with them

Fulltime

Site Reliability Engineer - Infrastructure

We are actively looking for a talented Site Reliability Engineer to join the Inf...

Location

United States , San Mateo

Salary:

130000.00 - 280000.00 USD / Year

Verkada

Expiration Date

Until further notice

Requirements

Must have a BS, MS, or PhD in Computer Science, or similar technical field of study
Minimum of 1-2+ years of experience in a similar position
Experience in at least one scripting language (preferably Python)
Experience with one of the major cloud platforms (preferably AWS)
Experience with Kubernetes
Experience with Terraform
Enthusiasm for learning about new technologies and tooling

Job Responsibility

Keep our infrastructure up!
Improve infrastructure automation
Define infrastructure roadmap
Provide technical support for engineers on other teams

What we offer

Generous company paid medical, dental & vision insurance coverage
Unlimited paid time off & 11 companywide paid holidays
Wellness allowance
Commuter benefits
Healthy lunches and dinners provided daily
Generous paid parental leave policy & fertility benefits

Fulltime

Forward Deployed Engineer - Site Reliability / Infrastructure

We're looking for a Forward Deployed Engineer to embed directly with a strategic...

Location

United States , Bellevue, WA, San Francisco Office

Salary:

240000.00 - 425000.00 USD / Year

Lambda

Expiration Date

Until further notice

Requirements

6+ years of experience in a SRE, software engineer, or similar role, with a deep knowledge of running Linux clusters and systems
Strong programming skills in Go and Python
experience with GitOps (e.g., ArgoCD), Helm, and Kubernetes operators
Proven experience operating Kubernetes clusters in production environments (on-prem, EKS, GKE, or similar)
Hands-on experience with AI/ML workload management tools (Volcano, Kubeflow, or similar)
Can work either independently with limited direction or as part of a team
Familiarity with observability tools like Prometheus, Grafana, FluentBit, and CI/CD pipelines
Proven experience provisioning Kubernetes using tools such as kubeadm, Cluster API, or similar
Excellent communication skills with the ability to translate technical complexity for diverse audiences
Executive presence and ability to represent Lambda in customer-facing situations

Job Responsibility

Embed on-site with a named strategic customer, becoming an extension of their team
Act as the primary technical liaison between Lambda and the customer organization
Navigate ambiguous requirements to identify root problems and define clear technical solutions
Drive alignment across internal Lambda teams and customer stakeholders
Scope, sequence, and build full-stack solutions that deliver measurable business value
Design and implement infrastructure optimizations for AI/ML workloads at scale
Debug complex distributed systems issues across the infrastructure stack
Ship iteratively and learn fast, adjusting approach based on customer feedback and results
Identify reusable patterns from customer engagements that can scale across Lambda's customer base
Surface field intelligence that influences Lambda's product roadmap

What we offer

Generous cash & equity compensation
Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan

Fulltime

Product Infrastructure Engineer - Site Reliability

As a Infrastructure Engineer - Site Reliability, you’ll be responsible for desig...

Location

United States , Palo Alto

Salary:

Not provided

Zyphra

Expiration Date

Until further notice

Requirements

Experience in high-performance compute environments, such as ML clusters or GPU farms
Background in infrastructure as code (e.g., Ansible, Terraform)
Experience designing reliable environments for experimental workloads and reproducible runs
Knowledge of compliance and audit standards in deployment and system security
Experience with load testing, fault injection, and chaos engineering to harden systems under stress
Passion for building tooling that makes infrastructure invisible and reliable for end users

Job Responsibility

Designing and maintaining the systems that keep Zyphra’s infrastructure robust, observable, secure, and scalable
Building and improving observability systems (monitoring, logging, alerting)
Designing resilient build and deployment systems across research and production environments
Implementing secure release processes with strong auditability and rollback support
Collaborating closely with ML engineers, DevOps, and infra teams to improve system reliability and performance
Leading incident response, root-cause analysis, and postmortems with a focus on learning and prevention

What we offer

Comprehensive medical, dental, vision, and FSA plans
Competitive compensation and 401(k)
Relocation and immigration support on a case-by-case basis
On-site meals prepared by a dedicated culinary team
Thursday Happy Hours

Fulltime

Site Reliability & Infrastructure Automation Engineer

Piper Companies is hiring a Site Reliability & Infrastructure Automation Enginee...

Location

United States , Durham

Salary:

110000.00 - 140000.00 USD / Year

Piper Companies

Expiration Date

Until further notice

Requirements

2+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
Strong experience with AWS services (Lambda, S3, CloudFormation, etc.)
Hands-on experience with Terraform for Infrastructure-as-Code
Experience with Python scripting and automation
Knowledge of observability tools such as Dynatrace, AppDynamics, or similar platforms
Bachelor’s degree (preferably in Computer Science, Engineering, or related technical field)

Job Responsibility

Design and maintain synthetic monitoring for critical applications, services, and APIs
Build dashboards, alerts, and telemetry to improve system observability
Automate operational tasks using Python, PowerShell, or similar scripting languages
Develop and manage Infrastructure-as-Code using Terraform and cloud-native tools
Troubleshoot cloud and SaaS environments while improving reliability and performance
Collaborate across development, infrastructure, and application teams to enhance operational best practices

What we offer

Health
Vision
Dental
PTO
Paid Holidays
10% bonus
7.5% long-term incentive

Fulltime

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...

Location

Ireland , Dublin

Salary:

Not provided

General Motors

Expiration Date

Until further notice

Requirements

8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
Proficiency in Python, Go, or Java, with strong code review and readability standards
Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
Ability to think and act under pressure
Strong communication skills

Job Responsibility

Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling

Fulltime

Cloud Engineer / Site Reliability Engineer (SRE)

Location

United States , Orlando

Salary:

75.00 USD / Hour

Beacon Hill

Expiration Date

Until further notice

Requirements

Strong hands-on AWS experience with solid understanding of core AWS services
Experience supporting and troubleshooting AWS and Azure cloud environments
Terraform experience for Infrastructure as Code
Docker/containerization experience
Strong troubleshooting and problem-solving skills
Ability to translate requirements into technical execution
Experience performing cloud architecture and diagramming
Experience supporting deployments, environments, and site standups
Strong communication and collaboration skills

Job Responsibility

Support cloud infrastructure and deployments across AWS and Azure
Troubleshoot infrastructure and application-related cloud issues
Build and maintain Terraform-based infrastructure
Support Docker/containerized environments
Create architecture diagrams and technical documentation
Work closely with engineering and project teams to execute cloud initiatives
Assist with automation and operational improvement efforts

Fulltime

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...

Location

United States , San Francisco

Salary:

230000.00 - 345000.00 USD / Year

Lambda

Expiration Date

Until further notice

Requirements

7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
Strong understanding of Linux-based systems in a distributed environment
Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
Excellent problem-solving and troubleshooting skills
Strong communication and collaboration skills
Passion for continuous improvement and innovation

Job Responsibility

Define Fleet Health metrics and indicators to objectively measure and improve system availability
Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
Create runbooks and automated remediations for common failure scenarios
Build in automation and auditing to ensure compliance and improve efficiency and productivity
Participate in on-call rotations and provide support for incident response and resolution
Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc

What we offer

Generous cash & equity compensation
Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan

Fulltime

Select Country

Site Reliability Engineer – Infrastructure

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?

Site Reliability Engineer – Infrastructure

Senior Site Reliability Engineer, Infrastructure Foundations

Site Reliability Engineer - Infrastructure

Forward Deployed Engineer - Site Reliability / Infrastructure

Product Infrastructure Engineer - Site Reliability

Site Reliability & Infrastructure Automation Engineer

Staff Engineer, Site Reliability Engineer

Cloud Engineer / Site Reliability Engineer (SRE)

Senior Site Reliability Engineer - Fleet Reliability

Our AI answers in your language