CrawlJobs Logo

Product Infrastructure Engineer - Site Reliability

United States, Palo Alto · Job Posted January 13, 2026
Apply Position
Job Link Share

Job Description

As a Infrastructure Engineer - Site Reliability, you’ll be responsible for designing and maintaining the systems that keep Zyphra’s infrastructure robust, observable, secure, and scalable. Your work will be essential to ensuring the reliability and reproducibility of ML workloads, the safety and control of deployments, and the long-term maintainability of our compute environments. This role is ideal for someone who loves building systems that make other teams faster, safer, and more productive.

Job Responsibility

  • Designing and maintaining the systems that keep Zyphra’s infrastructure robust, observable, secure, and scalable
  • Building and improving observability systems (monitoring, logging, alerting)
  • Designing resilient build and deployment systems across research and production environments
  • Implementing secure release processes with strong auditability and rollback support
  • Collaborating closely with ML engineers, DevOps, and infra teams to improve system reliability and performance
  • Leading incident response, root-cause analysis, and postmortems with a focus on learning and prevention

Requirements

  • Experience in high-performance compute environments, such as ML clusters or GPU farms
  • Background in infrastructure as code (e.g., Ansible, Terraform)
  • Experience designing reliable environments for experimental workloads and reproducible runs
  • Knowledge of compliance and audit standards in deployment and system security
  • Experience with load testing, fault injection, and chaos engineering to harden systems under stress
  • Passion for building tooling that makes infrastructure invisible and reliable for end users

Nice to have

  • Familiarity with software release engineering with for ML/AI systems is a plus
  • Experience with infrastructure as code (e.g., Ansible, Terraform)
  • Prior work supporting ML/AI infrastructure, including GPU management and workload optimization
  • Exposure to backend development for ML model serving (e.g., vLLM, Ray, SGLang)
  • Experience working with cloud platforms such as AWS, Azure, or GCP
  • Familiarity with containers (Docker, Apptainer) and their integration with scheduling systems (Slurm, Kubernetes)

What we offer

  • Comprehensive medical, dental, vision, and FSA plans
  • Competitive compensation and 401(k)
  • Relocation and immigration support on a case-by-case basis
  • On-site meals prepared by a dedicated culinary team
  • Thursday Happy Hours

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Product Infrastructure Engineer - Site Reliability

8 matching positions

Senior Site Reliability Engineer, Infrastructure Foundations

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to su...
Location
Location
United States
Salary
Salary:
113082.00 - 175725.00 USD / Year
wikimediafoundation.org Logo
Wikimedia Foundation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in an SRE/Operations/DevOps role as part of a team
  • Experience with shell and any scripting languages used in an SRE context (Python, Go, Bash, Ruby
  • we primarily use Python) and configuration management tools (Puppet, Ansible
  • we use Puppet)
  • Experience designing and managing infrastructure security for large fleets of diverse services
  • Experience with technical response during security incidents
  • Experience with package management on Linux systems (we use Debian)
  • Strong Linux system-level troubleshooting skills
  • History of automating tasks and processes, identifying process gaps, and finding automation opportunities
  • Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones
Job Responsibility
Job Responsibility
  • Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
  • Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
  • Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
  • Work closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
  • Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure
  • Collaborating with a global, cross-functional team in an asynchronous communication environment
  • Mentoring peers in your areas of technical and operational strength
  • Ability and willingness to travel 1-2 times a year for in-person events and team meetings
  • Most importantly, share our values and work in accordance with them
  • Fulltime
Read More
Arrow Right

Forward Deployed Engineer - Site Reliability / Infrastructure

We're looking for a Forward Deployed Engineer to embed directly with a strategic...
Location
Location
United States , Bellevue, WA, San Francisco Office
Salary
Salary:
240000.00 - 425000.00 USD / Year
lambda.ai Logo
Lambda
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in a SRE, software engineer, or similar role, with a deep knowledge of running Linux clusters and systems
  • Strong programming skills in Go and Python
  • experience with GitOps (e.g., ArgoCD), Helm, and Kubernetes operators
  • Proven experience operating Kubernetes clusters in production environments (on-prem, EKS, GKE, or similar)
  • Hands-on experience with AI/ML workload management tools (Volcano, Kubeflow, or similar)
  • Can work either independently with limited direction or as part of a team
  • Familiarity with observability tools like Prometheus, Grafana, FluentBit, and CI/CD pipelines
  • Proven experience provisioning Kubernetes using tools such as kubeadm, Cluster API, or similar
  • Excellent communication skills with the ability to translate technical complexity for diverse audiences
  • Executive presence and ability to represent Lambda in customer-facing situations
Job Responsibility
Job Responsibility
  • Embed on-site with a named strategic customer, becoming an extension of their team
  • Act as the primary technical liaison between Lambda and the customer organization
  • Navigate ambiguous requirements to identify root problems and define clear technical solutions
  • Drive alignment across internal Lambda teams and customer stakeholders
  • Scope, sequence, and build full-stack solutions that deliver measurable business value
  • Design and implement infrastructure optimizations for AI/ML workloads at scale
  • Debug complex distributed systems issues across the infrastructure stack
  • Ship iteratively and learn fast, adjusting approach based on customer feedback and results
  • Identify reusable patterns from customer engagements that can scale across Lambda's customer base
  • Surface field intelligence that influences Lambda's product roadmap
What we offer
What we offer
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan
  • Fulltime
Read More
Arrow Right

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
gm.com Logo
General Motors
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
  • Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
  • Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
  • Proficiency in Python, Go, or Java, with strong code review and readability standards
  • Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
  • Ability to think and act under pressure
  • Strong communication skills
Job Responsibility
Job Responsibility
  • Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
  • Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
  • Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
  • Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
  • Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
  • Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
  • Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
  • Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 345000.00 USD / Year
lambda.ai Logo
Lambda
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
  • Strong understanding of Linux-based systems in a distributed environment
  • Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
  • Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
  • Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Passion for continuous improvement and innovation
Job Responsibility
Job Responsibility
  • Define Fleet Health metrics and indicators to objectively measure and improve system availability
  • Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
  • Create runbooks and automated remediations for common failure scenarios
  • Build in automation and auditing to ensure compliance and improve efficiency and productivity
  • Participate in on-call rotations and provide support for incident response and resolution
  • Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc
What we offer
What we offer
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan
  • Fulltime
Read More
Arrow Right

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...
Location
Location
Canada , Mississauga
Salary
Salary:
115000.00 - 128000.00 CAD / Year
pointclickcare.com Logo
PointClickCare
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years' experience in software engineering
  • Experience with SRE principles
  • Experience with AI/ML in production environments
  • A passion for automation, intelligent systems, and operational excellence
  • Strong debugging, problem-solving, and system design skills
  • Languages: Python, Java, Bash, Terraform
  • Platforms: Azure, Kubernetes, Docker
  • Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
  • ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
  • CI/CD: Jenkins, ArgoCD, Spinnaker
Job Responsibility
Job Responsibility
  • Build ML-based anomaly detection and pattern recognition systems
  • Enhance telemetry with smart tagging and metadata for better AI insights
  • Develop event-driven workflows and self-healing systems using AI triggers
  • Automate incident response with generative AI and custom AI agent orchestration
  • Use time-series forecasting and predictive modelling to anticipate failures
  • Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
  • Build scalable, fault-tolerant systems in a cloud-native environment
  • Participate in on-call rotations and lead incident response for critical systems
  • Skilled in API integration for streamlined data exchange and system connectivity
  • Run internal AIOps workshops and help teams adopt AI maturity models
What we offer
What we offer
  • Benefits starting from Day 1!
  • Retirement Plan Matching
  • Flexible Paid Time Off
  • Wellness Support Programs and Resources
  • Parental & Caregiver Leaves
  • Fertility & Adoption Support
  • Continuous Development Support Program
  • Employee Assistance Program
  • Allyship and Inclusion Communities
  • Employee Recognition … and more!
  • Fulltime
Read More
Arrow Right
New

Site Reliability Engineer

Location
Location
Malaysia , Kuala Lumpur
Salary
Salary:
15000.00 MYR / Month
https://www.randstad.com Logo
Randstad
Expiration Date
August 21, 2026
Flip Icon
Requirements
Requirements
  • Cloud Infrastructure: AWS
  • Containerization: Kubernetes and Docker
  • Operating Systems: Linux and Unix Systems
  • Database Systems: Oracle Database
  • Programming/Scripting: Python, Java, or Go for automation scripting
  • Automation & Infrastructure Tools: Ansible and Terraform
  • Monitoring & Observability: Prometheus, Grafana, and Nagios
  • Integration: API and Networking Integration
Job Responsibility
Job Responsibility
  • Maintain continuous system monitoring and configure active alerts to prevent failures
  • Automate manual operational tasks, system monitoring, and infrastructure provisioning
  • Participate in deep-dive troubleshooting and rigorous post-mortem analysis to minimize downtime
  • Manage the technical resumption of high-priority, Service at Risk (S@R), and medium/high severity incidents within SLAs
  • Direct second- and third-level support teams and perform Root Cause Analysis (RCA)
  • Review system dependencies and manage changes, releases, and rollouts for minimal stability impact
  • Lead the team to actively achieve the organization's strict conduct, compliance, and market principles
  • Take end-to-end accountability for incident, problem, change, and risk management related to the production platform
  • Surface operational/security risks and provide monthly governance dashboards outlining trends and Service Improvement Plans (SIP)
  • Fulltime
Read More
Arrow Right
New

Associate Site Reliability Engineer

The Associate Site Reliability Engineer helps keep Marketing Technology services...
Location
Location
United States , Santa Monica
Salary
Salary:
30.53 - 56.48 USD / Hour
activision.com Logo
Activision
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 1–3 years of experience in SRE, DevOps, cloud infrastructure, or software engineering internships / early-career roles
  • Familiarity with Linux, HTTP, DNS, containers, Kubernetes, Git-based workflows, and scripting in Bash, Python, or similar languages
  • Exposure to monitoring, logs, metrics, dashboards, and incident management practices
  • Strong troubleshooting mindset, clear communication, and willingness to learn in a production-support environment
Job Responsibility
Job Responsibility
  • Monitor service health, respond to alerts, and participate in incident response for cloud and application environments
  • Investigate reliability issues across Kubernetes, networking, DNS, application runtime behavior, and dependent services, escalating when appropriate
  • Build and maintain dashboards, alerting, runbooks, and operational documentation that improve detection and recovery speed
  • Contribute small automations and scripts that reduce repetitive operational work and improve environment hygiene
  • Support release and deployment reliability by validating changes, helping with rollback readiness, and improving change safety
  • Participate in post-incident follow-up and help close corrective actions that prevent recurrence
What we offer
What we offer
  • Medical, dental, vision, health savings account or health reimbursement account, healthcare spending accounts, dependent care spending accounts, life and AD&D insurance, disability insurance
  • 401(k) with Company match, tuition reimbursement, charitable donation matching
  • Paid holidays and vacation, paid sick time, floating holidays, compassion and bereavement leaves, parental leave
  • Mental health & wellbeing programs, fitness programs, free and discounted games, and a variety of other voluntary benefit programs like supplemental life & disability, legal service, ID protection, rental insurance, and others
  • If the Company requires that you move geographic locations for the job, then you may also be eligible for relocation assistance
  • Fulltime
Read More
Arrow Right
New

Site Reliability Engineer

We are looking for an experienced Site Reliability Engineer to strengthen the re...
Location
Location
United States , San Francisco
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Site Reliability Engineering, infrastructure engineering, or a closely related production environment role
  • Strong background operating, supporting, and troubleshooting distributed systems at scale
  • Hands-on experience with observability platforms such as Datadog, Grafana, OpenTelemetry, CloudWatch, or similar tools
  • Proven involvement in on-call operations, incident management, and reliability-focused problem resolution
  • Practical experience defining and using SLIs, SLOs, and error budgets to guide service reliability decisions
  • Familiarity with AWS environments, including serverless and container-based architectures
  • Experience working with relational databases such as Postgres and performance analysis in production systems
  • Ability to write automation scripts or lightweight tooling in languages such as Python or Bash, with strong judgment around failure modes and system design
Job Responsibility
Job Responsibility
  • Establish measurable reliability standards for critical services by creating and maintaining service indicators, objectives, and error budget practices
  • Take ownership of production stability by monitoring uptime, latency, and availability, and driving improvements that reduce operational risk
  • Lead live incident response efforts, coordinate troubleshooting during outages, and ensure issues are resolved efficiently and thoroughly
  • Run blameless post-incident reviews, document findings clearly, and track corrective actions through completion
  • Design and enhance observability across logs, metrics, and distributed tracing using tools such as Datadog, CloudWatch, Grafana, OpenTelemetry, and Sentry
  • Improve alert quality and dashboard design so engineering teams can quickly identify meaningful system issues without unnecessary noise
  • Evaluate system behavior under load, uncover performance constraints, and recommend changes that improve scalability and resource efficiency
  • Build automation and internal tooling that streamline operational work, strengthen deployment safety, and support incident management, debugging, and capacity planning
  • Contribute to infrastructure and delivery workflows across AWS, Terraform, Ansible, Linux, and GitHub Actions with a focus on dependable releases and resilient systems
  • Partner with security and compliance stakeholders to support operational standards, audit readiness, and the integration of monitoring into broader engineering practices
What we offer
What we offer
  • Medical, vision, dental, and life and disability insurance
  • Enrollment in company 401(k) plan
Read More
Arrow Right