Staff Site Reliability Engineer Job at Airwallex (Singapore)

Staff Site Reliability Engineer

Trimble is seeking a Staff Site Reliability Engineer (P4) to join our Corporate ...

Location

India , Chennai

Salary:

Not provided

Trimble Inc.

Expiration Date

Until further notice

Requirements

Bachelor’s Degree or equivalent in Computer Science, Engineering, Information Systems, or a related field
OR equivalent practical experience
Minimum of 10 years of experience in IT operations, including deep knowledge of networking, computing, and storage
Minimum of 5 years of experience with AWS and/or Azure cloud computing environments, with at least 2 years in an architect/design role
Windows and Linux deployment experience, including common services for each platform
Proficiency in at least one scripting language (preferably Python or Powershell/.NET) and proficiency utilizing Git as a source control system
Strong background in application operations, including Incident Management, Change Management, and Capacity Management
Excellent troubleshooting and problem-solving skills, knowledge of security best practices, a strong desire to learn independently, and exceptional written/verbal communication skills with a customer-service mindset

Job Responsibility

Cloud Architecture & Enhancement: Develop new and enhance current shared public cloud services with a strict focus on Availability, Operations, Performance, Capacity, Security, and User Experience
Technical Leadership: Provide input and expertise relating to cloud hosting solutions (full infrastructure design and management). Transform business requirements into scalable operational designs
Collaboration & Planning: Attend and provide input on product planning sessions with internal development teams. Act as an expert on Business System services to communicate the value of our platform
Automation & Documentation: Identify and implement automation solutions. Develop and maintain critical documentation, including architecture diagrams, service descriptions, build/deploy processes, and operations run books
Mentorship & Support: Provide technical escalation and mentoring to other team members. Train operations teams to provide Level 1/2 support for shared public cloud services, acting as the ultimate Level 3 escalation point
Standards & Governance: Manage AWS/Azure best practice expectations and ensure alignment with corporate standards
Global Collaboration: Work effectively within a global team framework. Strike a balance between Indian and US time zones to attend business stakeholder meetings, address production issues, and serve as a reliable escalation point (including off-hours tasks when necessary)

Fulltime

Staff Site Reliability Engineer

Fivetran is building data pipelines to power the modern data stack for thousands...

Location

United States , Oakland

Salary:

196033.00 - 245041.50 USD / Year

Fivetran

Expiration Date

Until further notice

Requirements

Expertise in managed Kubernetes (EKS, AKS, and GKE)
Expertise of Cloud Platforms and related tooling: AWS, Azure, GCP, Terraform, Ansible, Buildkite, Pulumi, and ArgoCD
Expertise in Python/Shell scripting
Expertise with Linux operating systems, internals, and administration
Expertise with cloud networking like VPNs, Privatelinks, and Private Service connect (GCP)
Experience with databases such as PostgreSQL

Job Responsibility

Responsible for ongoing reliability and robustness of Fivetran's production infrastructure by monitoring availability, capacity, and throughput
Evolve systems by adding reliability into our product roadmap
Coordinate the re-prioritize or fix critical bugs for support or sales requirements as needed
Make recommendations to production infrastructure by interfacing with engineering to ensure 100% availability
Ensure scalable artifacts deployment to all environments by automation scripts
Constantly monitor infrastructure vulnerabilities and remedy them by working with the security team

What we offer

100% employer-paid medical insurance
Generous paid time-off policy (PTO), plus paid sick time, inclusive parental leave policy, holidays, and volunteer days off
RSU stock grants
Professional development and training opportunities
Company virtual happy hours, free food, and fun team-building activities
Monthly cell phone stipend
Access to an innovative mental health support platform that offers personalized care and resources in areas such as: therapy, coaching, and self-guided mindfulness exercises for all covered employees and their covered dependents

Fulltime

Staff Site Reliability Engineer

Fivetran is looking for a high-performance engineer to join a team of Site Relia...

Location

Serbia , Novi Sad

Salary:

Not provided

Fivetran

Expiration Date

Until further notice

Requirements

7+ years of experience working with SaaS platforms at scale
Expertise in managed Kubernetes (EKS, AKS, and GKE)
Knowledge of Cloud Platforms and related tooling: AWS, Azure, GCP, Terraform, Ansible, Buildkite, Pulumi, and ArgoCD
Experience in Python, Shell scripting, and Go
Experience with Linux operating systems, internals, and administration
Experience with cloud networking like Managed NAT Gateways, VPNs, Privatelinks, and Private Service Connect (GCP)
Experience with databases such as PostgreSQL

Job Responsibility

Responsible for the ongoing reliability and robustness of Fivetran’s production infrastructure by monitoring availability, capacity, and throughput
Collaborate with engineering teams to integrate reliability best practices into the product roadmap
Support the prioritization and resolution of critical bugs identified by support or sales
Contribute to maintaining the high reliability and availability of production infrastructure by collaborating with engineering to implement automation for scalable deployments
Ensure scalable artifacts deployment to all environments through automation scripts
Proactively monitor infrastructure vulnerabilities and collaborate with the security team to promptly address them

What we offer

100% employer-paid medical insurance
Generous paid time-off policy (PTO), plus paid sick time, inclusive parental leave policy, holidays, and volunteer days off
RSU stock grants
Professional development and training opportunities
Company virtual happy hours, free food, and fun team-building activities
Monthly cell phone stipend
Access to an innovative mental health support platform that offers personalized care and resources in areas such as: therapy, coaching, and self-guided mindfulness exercises for all covered employees and their covered dependents

Staff Site Reliability Engineer

Ever since we started in 2007, Sunrun has been at the forefront of connecting pe...

Location

United States , Lehi

Salary:

242050.00 USD / Year

Sunrun

Expiration Date

Until further notice

Requirements

Bachelor’s in Computer Information Systems, Software Engineering or closely related
5 years of experience as a Software Developer using Microservices hosted in Azure
5 years of experience with Virtualization and cloud computing
5 years of experience with Object Oriented Design (OOD) & and Object-Oriented Programming (OOP)
5 years of experience building software solutions in an engineering environment using Python & Shell scripting
5 years of experience with Network analysis, debugging and troubleshooting with Wireshark & Git

Job Responsibility

Provide strategic leadership in designing, implementing, and managing the overall infrastructure strategy for our organization
Leverage cloud platforms (e.g., AWS, Azure) to design, deploy, and manage scalable infrastructure solutions
Spearhead the definition of advanced monitoring requirements and elevate SLAs
Collaborate with the engineering team and TPM to implement and enhance monitoring practices
Expertly convey intricate technical information to diverse stakeholders with clarity and precision
Provide leadership in integrating advanced SRE principles into applications and services
Lead the implementation of sophisticated system design measures for heightened security, performance, and resiliency
Develop strategic notification strategies for production outages
Leverage SLOs and SLIs to measure and optimize availability, latency, and response time
Lead and strategize emergency response efforts, conduct retrospectives with RCA, and manage on-call workloads effectively

What we offer

Medical/Dental/Vision Insurance
Life Insurance
Disability Insurance
401k Plan + Company Match
Stock Purchase Plan
Paid Vacations/Holidays
Paid Baby Bonding Leave
Employee Discounts
PowerU - 100% Funded Education Programs
Employee Donation Matching

Fulltime

Staff Site Reliability Engineer

Join our Site Reliability Engineering (SRE) team and help ensure the reliability...

Location

United States

Salary:

220000.00 - 325000.00 USD / Year

Replit

Expiration Date

Until further notice

Requirements

8-10 years of experience in Site Reliability Engineering or similar roles (e.g., DevOps, Systems Engineering, Infrastructure Engineering)
Strong programming skills in languages like Python or Go
Deep understanding of distributed systems
Deep experience with container orchestration platforms, specifically Kubernetes, and cloud-native technologies
Proven track record of designing, implementing, and maintaining sophisticated monitoring and observability solutions
Strong incident management skills with extensive experience leading incident response for complex systems
Experience with infrastructure as code (e.g., Terraform, Pulumi) and configuration management tools
Excellent written and verbal communication skills
Strong interpersonal skills, with experience working with and mentoring engineers
A willingness to dive into understanding, debugging, and improving any layer of the stack

Job Responsibility

Architect and Implement Observability
Define and Drive Reliability Standards
Lead Incident Management and Response
Drive Automation and Infrastructure as Code
Optimize Performance on Kubernetes
Debug and Harden Distributed Systems
Provide Staff-Level Guidance
Educate and Mentor
Build and Integrate

What we offer

Competitive Salary & Equity
401(k) Program with a 4% match
Health, Dental, Vision and Life Insurance
Short Term and Long Term Disability
Paid Parental, Medical, Caregiver Leave
Commuter Benefits
Monthly Wellness Stipend
Autonomous Work Environment
In Office Set-Up Reimbursement
Flexible Time Off (FTO) + Holidays

Fulltime

Staff Site Reliability Engineer

Our Site Reliability Engineering team is growing, and we are looking for a highl...

Location

Finland , Helsinki

Salary:

Not provided

AlphaSense

Expiration Date

Until further notice

Requirements

8+ years of experience in Site Reliability Engineering, DevOps, or a similar role
at least 3+ of those years operating in a Senior+ SRE position
Strong background in running production SaaS systems at scale
Proficiency in at least one programming/scripting language (Python, Go, or similar)
Hands-on expertise with cloud platforms (AWS, GCP, or Azure) and Kubernetes
Deep understanding of networking fundamentals (TCP/IP, DNS, HTTP/S, load balancing)
Experience with monitoring & alerting (Prometheus, Grafana, Datadog, ELK)
Familiarity with advanced observability (OTEL, continuous profiling)
Proven incident management experience, including leading high-severity incidents and postmortems
Strong troubleshooting skills across the full stack

Job Responsibility

Architect Reliability Paved Paths: Build frameworks and self-service tooling that let teams own the reliability of their services
Lead AI-Driven Reliability: Drive our AIOps strategy — automating diagnostics, remediation, and proactive failure prevention
Champion Reliability Culture: Embed SRE practices across engineering via design reviews, production readiness, and operational standards
Incident Leadership: Act as Incident Commander during critical events, modeling operational excellence, and ensuring blameless postmortems lead to lasting improvements
Advance Observability: Deliver end-to-end monitoring, tracing, and profiling (Prometheus, Grafana, OTEL, Continuous Profiling) to optimize performance proactively
Mentor & Multiply: Elevate engineers across SRE and product teams through mentorship, technical guidance, and knowledge sharing

Staff Engineer, Site Reliability Engineer

OnStar is a cornerstone of General Motors' connected services—bringing safety, s...

Location

Ireland , Dublin

Salary:

Not provided

General Motors

Expiration Date

Until further notice

Requirements

8+ years in SRE, DevOps, or systems engineering, including experience managing or mentoring high-impact teams
Track record of building and maintaining high-scale, cloud-native systems (preferably AWS, GCP, or Azure)
Expertise in container orchestration and deployment strategies using Kubernetes and CI/CD pipelines
Proficiency in Python, Go, or Java, with strong code review and readability standards
Experience leading cross-functional infrastructure projects, configuration strategy, or organizational tooling initiatives
Ability to think and act under pressure
Strong communication skills

Job Responsibility

Lead the design and implementation of scalable, fault-tolerant, and observable infrastructure supporting OnStar mobile and web experiences, in-vehicle services, and the backend platforms and integrations that power them
Champion configuration management, infrastructure refactoring, and testing frameworks to strengthen system resilience
Partner across SRE, development, and product teams to improve service reliability, deployment safety, and incident response practices
Drive internal consultation and strategic planning on reliability standards for new OnStar capabilities, customer-facing releases, and platform initiatives
Define and evolve observability strategy using tools such as Prometheus, Grafana, and Datadog, with automated alerting and actionable SLO dashboards
Own and improve on-call practices, manage blameless postmortems, and guide root cause analysis to eliminate recurring failures
Mentor engineers and help shape a high-performance culture rooted in extreme ownership and operational excellence
Support compliance and privacy-driven engineering initiatives across connected services, with potential crossover into areas like data retention and safety certification tooling

Fulltime

Staff Site Reliability Engineer - Incident Management & Reliability

We’re not just building better tech. We’re rewriting how data moves and what the...

Location

Canada

Salary:

225100.00 - 264500.00 CAD / Year

Confluent

Expiration Date

Until further notice

Requirements

10+ years of relevant experience in SRE, incident management, or reliability engineering
Cloud experience with at least one of AWS, GCP, or Azure
Experience navigating reliability/incident programs at 500+ engineer organizations
Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)
Strong understanding of distributed systems and failure modes at scale
Deep experience with observability: metrics, logging, tracing
Kubernetes and container orchestration experience
Understanding of CI/CD pipelines and release processes
Strong written communication (design docs, runbooks, post-mortems)
Experience driving org-wide process and cultural changes

Job Responsibility

Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence
Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack
Define and maintain SLO/SLA frameworks
use error budgets to guide reliability investments
Own standards, practices, and continuous improvement of incident response across engineering
Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity
Develop and deliver training programs
coach teams through post-mortems
Partner with engineering leaders to elevate reliability practices org-wide

What we offer

Remote-First Work
Robust Insurance Benefits
Flexible Time Away
The Best Teammates
Experience Ambassadors
Open and Honest Culture
Well-Being and Growth
Offers Equity

Fulltime

Select Country

Staff Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Engineer, Site Reliability Engineer

Staff Site Reliability Engineer - Incident Management & Reliability

Our AI answers in your language