CrawlJobs Logo

Site Reliability Engineer - Infrastructure

United States, San Mateo 130000.00 - 280000.00 USD / Year · Job Posted January 06, 2026
Apply Position
Job Link Share

Job Description

We are actively looking for a talented Site Reliability Engineer to join the Infrastructure team. As a member of the infrastructure team, your role will be to manage this infrastructure and continue to make it easier for our team to monitor and scale it, be it by adopting 3rd party tools or design your own. Example projects include optimizing our cluster cost efficiency, enforcing security requirements, improving monitoring and alerting, and adopting a service mesh.

Job Responsibility

  • Keep our infrastructure up!
  • Improve infrastructure automation
  • Define infrastructure roadmap
  • Provide technical support for engineers on other teams

Requirements

  • Must have a BS, MS, or PhD in Computer Science, or similar technical field of study
  • Minimum of 1-2+ years of experience in a similar position
  • Experience in at least one scripting language (preferably Python)
  • Experience with one of the major cloud platforms (preferably AWS)
  • Experience with Kubernetes
  • Experience with Terraform
  • Enthusiasm for learning about new technologies and tooling

Nice to have

  • Experience with ArgoCD
  • Experience writing Kubernetes controllers
  • Experience with service mesh

What we offer

  • Generous company paid medical, dental & vision insurance coverage
  • Unlimited paid time off & 11 companywide paid holidays
  • Wellness allowance
  • Commuter benefits
  • Healthy lunches and dinners provided daily
  • Generous paid parental leave policy & fertility benefits

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineer - Infrastructure

8 matching positions

Senior Site Reliability Engineer, Infrastructure Foundations

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to su...
Location
Location
United States
Salary
Salary:
113082.00 - 175725.00 USD / Year
wikimediafoundation.org Logo
Wikimedia Foundation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in an SRE/Operations/DevOps role as part of a team
  • Experience with shell and any scripting languages used in an SRE context (Python, Go, Bash, Ruby
  • we primarily use Python) and configuration management tools (Puppet, Ansible
  • we use Puppet)
  • Experience designing and managing infrastructure security for large fleets of diverse services
  • Experience with technical response during security incidents
  • Experience with package management on Linux systems (we use Debian)
  • Strong Linux system-level troubleshooting skills
  • History of automating tasks and processes, identifying process gaps, and finding automation opportunities
  • Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones
Job Responsibility
Job Responsibility
  • Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
  • Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
  • Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
  • Work closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
  • Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure
  • Collaborating with a global, cross-functional team in an asynchronous communication environment
  • Mentoring peers in your areas of technical and operational strength
  • Ability and willingness to travel 1-2 times a year for in-person events and team meetings
  • Most importantly, share our values and work in accordance with them
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer – Infrastructure

The Site Reliability Engineer (SRE) will ensure the reliability, scalability, an...
Location
Location
United States , Atlanta
Salary
Salary:
Not provided
tier4group.com Logo
Tier4 Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience
  • Proven experience as a Site Reliability Engineer or Systems Engineer
  • Strong proficiency in Terraform and Ansible for infrastructure automation
  • Hands-on experience with Kubernetes, Docker, or other container orchestration tools
  • Proficiency in scripting languages such as Python or Bash
  • In-depth knowledge of Google Cloud Platform (GCP) services including compute, networking, storage, Kubernetes, and security
  • Solid understanding of VMware virtualization and enterprise storage systems (e.g., Pure Storage)
  • Experience with networking technologies including VLANs, VPNs, and routing protocols
  • Strong grasp of IT infrastructure and operations principles, including systems integration and automation best practices
  • Excellent communication and collaboration skills
Job Responsibility
Job Responsibility
  • Design, build, and maintain secure, compliant infrastructure using Infrastructure as Code tools such as Terraform and Ansible
  • Automate provisioning and management of servers, storage, networks, Kubernetes clusters, and related systems across cloud and on-premises environments
  • Develop tools and processes for automated deployment, configuration, monitoring, and alerting
  • Collaborate with cross-functional teams to implement scalable and reliable cloud and data center solutions
  • Participate in incident response, on-call rotations, and post-incident reviews to improve system resilience
  • Monitor system performance and availability using service-level agreements (SLAs), objectives (SLOs), and indicators (SLIs)
  • proactively troubleshoot and resolve reliability, performance, or security issues
  • Create and maintain disaster recovery and business continuity plans for critical systems
  • Continuously analyze and improve infrastructure efficiency, scalability, and performance
  • Stay current with emerging technologies and recommend tools or practices to enhance platform capabilities
  • Fulltime
Read More
Arrow Right

Forward Deployed Engineer - Site Reliability / Infrastructure

We're looking for a Forward Deployed Engineer to embed directly with a strategic...
Location
Location
United States , Bellevue, WA, San Francisco Office
Salary
Salary:
240000.00 - 425000.00 USD / Year
lambda.ai Logo
Lambda
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in a SRE, software engineer, or similar role, with a deep knowledge of running Linux clusters and systems
  • Strong programming skills in Go and Python
  • experience with GitOps (e.g., ArgoCD), Helm, and Kubernetes operators
  • Proven experience operating Kubernetes clusters in production environments (on-prem, EKS, GKE, or similar)
  • Hands-on experience with AI/ML workload management tools (Volcano, Kubeflow, or similar)
  • Can work either independently with limited direction or as part of a team
  • Familiarity with observability tools like Prometheus, Grafana, FluentBit, and CI/CD pipelines
  • Proven experience provisioning Kubernetes using tools such as kubeadm, Cluster API, or similar
  • Excellent communication skills with the ability to translate technical complexity for diverse audiences
  • Executive presence and ability to represent Lambda in customer-facing situations
Job Responsibility
Job Responsibility
  • Embed on-site with a named strategic customer, becoming an extension of their team
  • Act as the primary technical liaison between Lambda and the customer organization
  • Navigate ambiguous requirements to identify root problems and define clear technical solutions
  • Drive alignment across internal Lambda teams and customer stakeholders
  • Scope, sequence, and build full-stack solutions that deliver measurable business value
  • Design and implement infrastructure optimizations for AI/ML workloads at scale
  • Debug complex distributed systems issues across the infrastructure stack
  • Ship iteratively and learn fast, adjusting approach based on customer feedback and results
  • Identify reusable patterns from customer engagements that can scale across Lambda's customer base
  • Surface field intelligence that influences Lambda's product roadmap
What we offer
What we offer
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan
  • Fulltime
Read More
Arrow Right

Product Infrastructure Engineer - Site Reliability

As a Infrastructure Engineer - Site Reliability, you’ll be responsible for desig...
Location
Location
United States , Palo Alto
Salary
Salary:
Not provided
zyphra.com Logo
Zyphra
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience in high-performance compute environments, such as ML clusters or GPU farms
  • Background in infrastructure as code (e.g., Ansible, Terraform)
  • Experience designing reliable environments for experimental workloads and reproducible runs
  • Knowledge of compliance and audit standards in deployment and system security
  • Experience with load testing, fault injection, and chaos engineering to harden systems under stress
  • Passion for building tooling that makes infrastructure invisible and reliable for end users
Job Responsibility
Job Responsibility
  • Designing and maintaining the systems that keep Zyphra’s infrastructure robust, observable, secure, and scalable
  • Building and improving observability systems (monitoring, logging, alerting)
  • Designing resilient build and deployment systems across research and production environments
  • Implementing secure release processes with strong auditability and rollback support
  • Collaborating closely with ML engineers, DevOps, and infra teams to improve system reliability and performance
  • Leading incident response, root-cause analysis, and postmortems with a focus on learning and prevention
What we offer
What we offer
  • Comprehensive medical, dental, vision, and FSA plans
  • Competitive compensation and 401(k)
  • Relocation and immigration support on a case-by-case basis
  • On-site meals prepared by a dedicated culinary team
  • Thursday Happy Hours
  • Fulltime
Read More
Arrow Right
New

Site Reliability & Infrastructure Automation Engineer

Piper Companies is hiring a Site Reliability & Infrastructure Automation Enginee...
Location
Location
United States , Durham
Salary
Salary:
110000.00 - 140000.00 USD / Year
pipercompanies.com Logo
Piper Companies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering
  • Strong experience with AWS services (Lambda, S3, CloudFormation, etc.)
  • Hands-on experience with Terraform for Infrastructure-as-Code
  • Experience with Python scripting and automation
  • Knowledge of observability tools such as Dynatrace, AppDynamics, or similar platforms
  • Bachelor’s degree (preferably in Computer Science, Engineering, or related technical field)
Job Responsibility
Job Responsibility
  • Design and maintain synthetic monitoring for critical applications, services, and APIs
  • Build dashboards, alerts, and telemetry to improve system observability
  • Automate operational tasks using Python, PowerShell, or similar scripting languages
  • Develop and manage Infrastructure-as-Code using Terraform and cloud-native tools
  • Troubleshoot cloud and SaaS environments while improving reliability and performance
  • Collaborate across development, infrastructure, and application teams to enhance operational best practices
What we offer
What we offer
  • Health
  • Vision
  • Dental
  • PTO
  • Paid Holidays
  • 10% bonus
  • 7.5% long-term incentive
  • Fulltime
Read More
Arrow Right
New

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
  • Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
  • Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
  • Proficiency in Python, Go, or Java, with strong code review and readability standards
  • Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
  • Ability to think and act under pressure
  • Strong communication skills
Job Responsibility
Job Responsibility
  • Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
  • Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
  • Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
  • Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
  • Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
  • Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
  • Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
  • Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling
  • Fulltime
Read More
Arrow Right

Cloud Engineer / Site Reliability Engineer (SRE)

Location
Location
United States , Orlando
Salary
Salary:
75.00 USD / Hour
bhsg.com Logo
Beacon Hill
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong hands-on AWS experience with solid understanding of core AWS services
  • Experience supporting and troubleshooting AWS and Azure cloud environments
  • Terraform experience for Infrastructure as Code
  • Docker/containerization experience
  • Strong troubleshooting and problem-solving skills
  • Ability to translate requirements into technical execution
  • Experience performing cloud architecture and diagramming
  • Experience supporting deployments, environments, and site standups
  • Strong communication and collaboration skills
Job Responsibility
Job Responsibility
  • Support cloud infrastructure and deployments across AWS and Azure
  • Troubleshoot infrastructure and application-related cloud issues
  • Build and maintain Terraform-based infrastructure
  • Support Docker/containerized environments
  • Create architecture diagrams and technical documentation
  • Work closely with engineering and project teams to execute cloud initiatives
  • Assist with automation and operational improvement efforts
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 345000.00 USD / Year
lambda.ai Logo
Lambda
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
  • Strong understanding of Linux-based systems in a distributed environment
  • Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
  • Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
  • Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Passion for continuous improvement and innovation
Job Responsibility
Job Responsibility
  • Define Fleet Health metrics and indicators to objectively measure and improve system availability
  • Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
  • Create runbooks and automated remediations for common failure scenarios
  • Build in automation and auditing to ensure compliance and improve efficiency and productivity
  • Participate in on-call rotations and provide support for incident response and resolution
  • Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc
What we offer
What we offer
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan
  • Fulltime
Read More
Arrow Right