Product Infrastructure Engineer - Site Reliability Job at Zyphra (Palo Alto)

Senior Site Reliability Engineer, Infrastructure Foundations

The Wikimedia Foundation is looking for a Senior Site Reliability Engineer to su...

Location

United States

Salary:

113082.00 - 175725.00 USD / Year

Wikimedia Foundation

Expiration Date

Until further notice

Requirements

6+ years of experience in an SRE/Operations/DevOps role as part of a team
Experience with shell and any scripting languages used in an SRE context (Python, Go, Bash, Ruby
we primarily use Python) and configuration management tools (Puppet, Ansible
we use Puppet)
Experience designing and managing infrastructure security for large fleets of diverse services
Experience with technical response during security incidents
Experience with package management on Linux systems (we use Debian)
Strong Linux system-level troubleshooting skills
History of automating tasks and processes, identifying process gaps, and finding automation opportunities
Strong English language skills (verbal and written) and ability to work independently, as an effective part of a globally distributed team working across multiple time zones

Job Responsibility

Performing day-to-day operational/DevOps tasks on Wikimedia’s public facing infrastructure (deployment, maintenance, configuration, troubleshooting)
Implementing and utilizing configuration management and deployment tools (Puppet, Kubernetes)
Leading continuous improvement, by automating the installation, configuration and maintenance of services on our platform
Work closely with product teams helping them bring scalable functionality to our users by assisting in the architectural design of new services and making them operate at scale
Participating in a 24/7 on-call rotation shared across the broader SRE team. This includes taking part in incident response, diagnosis and follow-up on system outages or alerts across Wikimedia’s production infrastructure
Collaborating with a global, cross-functional team in an asynchronous communication environment
Mentoring peers in your areas of technical and operational strength
Ability and willingness to travel 1-2 times a year for in-person events and team meetings
Most importantly, share our values and work in accordance with them

Fulltime

Forward Deployed Engineer - Site Reliability / Infrastructure

We're looking for a Forward Deployed Engineer to embed directly with a strategic...

Location

United States , Bellevue, WA, San Francisco Office

Salary:

240000.00 - 425000.00 USD / Year

Lambda

Expiration Date

Until further notice

Requirements

6+ years of experience in a SRE, software engineer, or similar role, with a deep knowledge of running Linux clusters and systems
Strong programming skills in Go and Python
experience with GitOps (e.g., ArgoCD), Helm, and Kubernetes operators
Proven experience operating Kubernetes clusters in production environments (on-prem, EKS, GKE, or similar)
Hands-on experience with AI/ML workload management tools (Volcano, Kubeflow, or similar)
Can work either independently with limited direction or as part of a team
Familiarity with observability tools like Prometheus, Grafana, FluentBit, and CI/CD pipelines
Proven experience provisioning Kubernetes using tools such as kubeadm, Cluster API, or similar
Excellent communication skills with the ability to translate technical complexity for diverse audiences
Executive presence and ability to represent Lambda in customer-facing situations

Job Responsibility

Embed on-site with a named strategic customer, becoming an extension of their team
Act as the primary technical liaison between Lambda and the customer organization
Navigate ambiguous requirements to identify root problems and define clear technical solutions
Drive alignment across internal Lambda teams and customer stakeholders
Scope, sequence, and build full-stack solutions that deliver measurable business value
Design and implement infrastructure optimizations for AI/ML workloads at scale
Debug complex distributed systems issues across the infrastructure stack
Ship iteratively and learn fast, adjusting approach based on customer feedback and results
Identify reusable patterns from customer engagements that can scale across Lambda's customer base
Surface field intelligence that influences Lambda's product roadmap

What we offer

Generous cash & equity compensation
Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan

Fulltime

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...

Location

Ireland , Dublin

Salary:

Not provided

General Motors

Expiration Date

Until further notice

Requirements

8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
Proficiency in Python, Go, or Java, with strong code review and readability standards
Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
Ability to think and act under pressure
Strong communication skills

Job Responsibility

Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling

Fulltime

Senior Site Reliability Engineer - Fleet Reliability

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serv...

Location

United States , San Francisco

Salary:

230000.00 - 345000.00 USD / Year

Lambda

Expiration Date

Until further notice

Requirements

7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
Strong understanding of Linux-based systems in a distributed environment
Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
Excellent problem-solving and troubleshooting skills
Strong communication and collaboration skills
Passion for continuous improvement and innovation

Job Responsibility

Define Fleet Health metrics and indicators to objectively measure and improve system availability
Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
Create runbooks and automated remediations for common failure scenarios
Build in automation and auditing to ensure compliance and improve efficiency and productivity
Participate in on-call rotations and provide support for incident response and resolution
Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc

What we offer

Generous cash & equity compensation
Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan

Fulltime

Intermediate Site Reliability Engineer SRE – AI Reliability & Automation

At PointClickCare our mission is simple: to help providers deliver exceptional c...

Location

Canada , Mississauga

Salary:

115000.00 - 128000.00 CAD / Year

PointClickCare

Expiration Date

Until further notice

Requirements

5+ years' experience in software engineering
Experience with SRE principles
Experience with AI/ML in production environments
A passion for automation, intelligent systems, and operational excellence
Strong debugging, problem-solving, and system design skills
Languages: Python, Java, Bash, Terraform
Platforms: Azure, Kubernetes, Docker
Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
CI/CD: Jenkins, ArgoCD, Spinnaker

Job Responsibility

Build ML-based anomaly detection and pattern recognition systems
Enhance telemetry with smart tagging and metadata for better AI insights
Develop event-driven workflows and self-healing systems using AI triggers
Automate incident response with generative AI and custom AI agent orchestration
Use time-series forecasting and predictive modelling to anticipate failures
Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
Build scalable, fault-tolerant systems in a cloud-native environment
Participate in on-call rotations and lead incident response for critical systems
Skilled in API integration for streamlined data exchange and system connectivity
Run internal AIOps workshops and help teams adopt AI maturity models

What we offer

Benefits starting from Day 1!
Retirement Plan Matching
Flexible Paid Time Off
Wellness Support Programs and Resources
Parental & Caregiver Leaves
Fertility & Adoption Support
Continuous Development Support Program
Employee Assistance Program
Allyship and Inclusion Communities
Employee Recognition … and more!

Fulltime

New

Site Reliability Engineer

Location

Malaysia , Kuala Lumpur

Salary:

15000.00 MYR / Month

Randstad

Expiration Date

August 21, 2026

Requirements

Cloud Infrastructure: AWS
Containerization: Kubernetes and Docker
Operating Systems: Linux and Unix Systems
Database Systems: Oracle Database
Programming/Scripting: Python, Java, or Go for automation scripting
Automation & Infrastructure Tools: Ansible and Terraform
Monitoring & Observability: Prometheus, Grafana, and Nagios
Integration: API and Networking Integration

Job Responsibility

Maintain continuous system monitoring and configure active alerts to prevent failures
Automate manual operational tasks, system monitoring, and infrastructure provisioning
Participate in deep-dive troubleshooting and rigorous post-mortem analysis to minimize downtime
Manage the technical resumption of high-priority, Service at Risk (S@R), and medium/high severity incidents within SLAs
Direct second- and third-level support teams and perform Root Cause Analysis (RCA)
Review system dependencies and manage changes, releases, and rollouts for minimal stability impact
Lead the team to actively achieve the organization's strict conduct, compliance, and market principles
Take end-to-end accountability for incident, problem, change, and risk management related to the production platform
Surface operational/security risks and provide monthly governance dashboards outlining trends and Service Improvement Plans (SIP)

Fulltime

New

Associate Site Reliability Engineer

The Associate Site Reliability Engineer helps keep Marketing Technology services...

Location

United States , Santa Monica

Salary:

30.53 - 56.48 USD / Hour

Activision

Expiration Date

Until further notice

Requirements

1–3 years of experience in SRE, DevOps, cloud infrastructure, or software engineering internships / early-career roles
Familiarity with Linux, HTTP, DNS, containers, Kubernetes, Git-based workflows, and scripting in Bash, Python, or similar languages
Exposure to monitoring, logs, metrics, dashboards, and incident management practices
Strong troubleshooting mindset, clear communication, and willingness to learn in a production-support environment

Job Responsibility

Monitor service health, respond to alerts, and participate in incident response for cloud and application environments
Investigate reliability issues across Kubernetes, networking, DNS, application runtime behavior, and dependent services, escalating when appropriate
Build and maintain dashboards, alerting, runbooks, and operational documentation that improve detection and recovery speed
Contribute small automations and scripts that reduce repetitive operational work and improve environment hygiene
Support release and deployment reliability by validating changes, helping with rollback readiness, and improving change safety
Participate in post-incident follow-up and help close corrective actions that prevent recurrence

What we offer

Medical, dental, vision, health savings account or health reimbursement account, healthcare spending accounts, dependent care spending accounts, life and AD&D insurance, disability insurance
401(k) with Company match, tuition reimbursement, charitable donation matching
Paid holidays and vacation, paid sick time, floating holidays, compassion and bereavement leaves, parental leave
Mental health & wellbeing programs, fitness programs, free and discounted games, and a variety of other voluntary benefit programs like supplemental life & disability, legal service, ID protection, rental insurance, and others
If the Company requires that you move geographic locations for the job, then you may also be eligible for relocation assistance

Fulltime

New

Site Reliability Engineer

We are looking for an experienced Site Reliability Engineer to strengthen the re...

Location

United States , San Francisco

Salary:

Not provided

Robert Half

Expiration Date

Until further notice

Requirements

7+ years of experience in Site Reliability Engineering, infrastructure engineering, or a closely related production environment role
Strong background operating, supporting, and troubleshooting distributed systems at scale
Hands-on experience with observability platforms such as Datadog, Grafana, OpenTelemetry, CloudWatch, or similar tools
Proven involvement in on-call operations, incident management, and reliability-focused problem resolution
Practical experience defining and using SLIs, SLOs, and error budgets to guide service reliability decisions
Familiarity with AWS environments, including serverless and container-based architectures
Experience working with relational databases such as Postgres and performance analysis in production systems
Ability to write automation scripts or lightweight tooling in languages such as Python or Bash, with strong judgment around failure modes and system design

Job Responsibility

Establish measurable reliability standards for critical services by creating and maintaining service indicators, objectives, and error budget practices
Take ownership of production stability by monitoring uptime, latency, and availability, and driving improvements that reduce operational risk
Lead live incident response efforts, coordinate troubleshooting during outages, and ensure issues are resolved efficiently and thoroughly
Run blameless post-incident reviews, document findings clearly, and track corrective actions through completion
Design and enhance observability across logs, metrics, and distributed tracing using tools such as Datadog, CloudWatch, Grafana, OpenTelemetry, and Sentry
Improve alert quality and dashboard design so engineering teams can quickly identify meaningful system issues without unnecessary noise
Evaluate system behavior under load, uncover performance constraints, and recommend changes that improve scalability and resource efficiency
Build automation and internal tooling that streamline operational work, strengthen deployment safety, and support incident management, debugging, and capacity planning
Contribute to infrastructure and delivery workflows across AWS, Terraform, Ansible, Linux, and GitHub Actions with a focus on dependable releases and resilient systems
Partner with security and compliance stakeholders to support operational standards, audit readiness, and the integration of monitoring into broader engineering practices

What we offer

Medical, vision, dental, and life and disability insurance
Enrollment in company 401(k) plan

Select Country

Product Infrastructure Engineer - Site Reliability

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?