Staff Site Reliability Engineer Job at Affirm

Staff Site Reliability Engineer

Fivetran is building data pipelines to power the modern data stack for thousands...

Location

United States , Oakland

Salary:

196033.00 - 245041.50 USD / Year

Fivetran

Expiration Date

Until further notice

Requirements

Expertise in managed Kubernetes (EKS, AKS, and GKE)
Expertise of Cloud Platforms and related tooling: AWS, Azure, GCP, Terraform, Ansible, Buildkite, Pulumi, and ArgoCD
Expertise in Python/Shell scripting
Expertise with Linux operating systems, internals, and administration
Expertise with cloud networking like VPNs, Privatelinks, and Private Service connect (GCP)
Experience with databases such as PostgreSQL

Job Responsibility

Responsible for ongoing reliability and robustness of Fivetran's production infrastructure by monitoring availability, capacity, and throughput
Evolve systems by adding reliability into our product roadmap
Coordinate the re-prioritize or fix critical bugs for support or sales requirements as needed
Make recommendations to production infrastructure by interfacing with engineering to ensure 100% availability
Ensure scalable artifacts deployment to all environments by automation scripts
Constantly monitor infrastructure vulnerabilities and remedy them by working with the security team

What we offer

100% employer-paid medical insurance
Generous paid time-off policy (PTO), plus paid sick time, inclusive parental leave policy, holidays, and volunteer days off
RSU stock grants
Professional development and training opportunities
Company virtual happy hours, free food, and fun team-building activities
Monthly cell phone stipend
Access to an innovative mental health support platform that offers personalized care and resources in areas such as: therapy, coaching, and self-guided mindfulness exercises for all covered employees and their covered dependents

Fulltime

Staff Site Reliability Engineer

Fivetran is looking for a high-performance engineer to join a team of Site Relia...

Location

Serbia , Novi Sad

Salary:

Not provided

Fivetran

Expiration Date

Until further notice

Requirements

7+ years of experience working with SaaS platforms at scale
Expertise in managed Kubernetes (EKS, AKS, and GKE)
Knowledge of Cloud Platforms and related tooling: AWS, Azure, GCP, Terraform, Ansible, Buildkite, Pulumi, and ArgoCD
Experience in Python, Shell scripting, and Go
Experience with Linux operating systems, internals, and administration
Experience with cloud networking like Managed NAT Gateways, VPNs, Privatelinks, and Private Service Connect (GCP)
Experience with databases such as PostgreSQL

Job Responsibility

Responsible for the ongoing reliability and robustness of Fivetran’s production infrastructure by monitoring availability, capacity, and throughput
Collaborate with engineering teams to integrate reliability best practices into the product roadmap
Support the prioritization and resolution of critical bugs identified by support or sales
Contribute to maintaining the high reliability and availability of production infrastructure by collaborating with engineering to implement automation for scalable deployments
Ensure scalable artifacts deployment to all environments through automation scripts
Proactively monitor infrastructure vulnerabilities and collaborate with the security team to promptly address them

What we offer

100% employer-paid medical insurance
Generous paid time-off policy (PTO), plus paid sick time, inclusive parental leave policy, holidays, and volunteer days off
RSU stock grants
Professional development and training opportunities
Company virtual happy hours, free food, and fun team-building activities
Monthly cell phone stipend
Access to an innovative mental health support platform that offers personalized care and resources in areas such as: therapy, coaching, and self-guided mindfulness exercises for all covered employees and their covered dependents

Staff Site Reliability Engineer

As a Staff Site Reliability Engineer, you will be a technical leader and strateg...

Location

Singapore; Australia , Singapore; Melbourne

Salary:

Not provided

Airwallex

Expiration Date

Until further notice

Requirements

10+ years of experience in SRE, DevOps, or infrastructure engineering roles, with progressive responsibility
Proven ability to lead SRE strategy and execution for large-scale, complex, cross-functional projects
Deep expertise with cloud platforms (AWS/GCP), Kubernetes, container orchestration, observability, and incident response frameworks
Strong experience supporting production systems with stringent high availability, compliance, and security requirements
Demonstrated leadership in mentoring and growing technical teams
Excellent collaboration and communication skills, able to influence stakeholders at all levels
Degree in Computer Science or related field

Job Responsibility

Drive the strategic vision and roadmap for Site Reliability Engineering at Airwallex, aligned with business objectives and product goals
Architect and oversee the implementation of highly scalable, secure, and resilient cloud infrastructure for new services and platform-wide initiatives
Lead and mentor senior engineers and cross-functional teams in reliability engineering best practices, automation, and incident management
Champion and evolve operational excellence through advanced observability, SLO management, runbooks, and proactive risk mitigation
Lead incident response for high-severity incidents, facilitating post-mortems and driving continuous improvements
Collaborate closely with Product, Engineering, Security, and DevOps leadership to ensure compliance, resilience, and alignment across functions
Influence and shape engineering culture around reliability, scalability, and DevOps principles across multiple teams
Advocate for innovation in tooling, automation, and infrastructure to improve developer productivity and service uptime

Fulltime

Staff Site Reliability Engineer

Ever since we started in 2007, Sunrun has been at the forefront of connecting pe...

Location

United States , Lehi

Salary:

242050.00 USD / Year

Sunrun

Expiration Date

Until further notice

Requirements

Bachelor’s in Computer Information Systems, Software Engineering or closely related
5 years of experience as a Software Developer using Microservices hosted in Azure
5 years of experience with Virtualization and cloud computing
5 years of experience with Object Oriented Design (OOD) & and Object-Oriented Programming (OOP)
5 years of experience building software solutions in an engineering environment using Python & Shell scripting
5 years of experience with Network analysis, debugging and troubleshooting with Wireshark & Git

Job Responsibility

Provide strategic leadership in designing, implementing, and managing the overall infrastructure strategy for our organization
Leverage cloud platforms (e.g., AWS, Azure) to design, deploy, and manage scalable infrastructure solutions
Spearhead the definition of advanced monitoring requirements and elevate SLAs
Collaborate with the engineering team and TPM to implement and enhance monitoring practices
Expertly convey intricate technical information to diverse stakeholders with clarity and precision
Provide leadership in integrating advanced SRE principles into applications and services
Lead the implementation of sophisticated system design measures for heightened security, performance, and resiliency
Develop strategic notification strategies for production outages
Leverage SLOs and SLIs to measure and optimize availability, latency, and response time
Lead and strategize emergency response efforts, conduct retrospectives with RCA, and manage on-call workloads effectively

What we offer

Medical/Dental/Vision Insurance
Life Insurance
Disability Insurance
401k Plan + Company Match
Stock Purchase Plan
Paid Vacations/Holidays
Paid Baby Bonding Leave
Employee Discounts
PowerU - 100% Funded Education Programs
Employee Donation Matching

Fulltime

Staff Site Reliability Engineer

Join our Site Reliability Engineering (SRE) team and help ensure the reliability...

Location

United States

Salary:

220000.00 - 325000.00 USD / Year

Replit

Expiration Date

Until further notice

Requirements

8-10 years of experience in Site Reliability Engineering or similar roles (e.g., DevOps, Systems Engineering, Infrastructure Engineering)
Strong programming skills in languages like Python or Go
Deep understanding of distributed systems
Deep experience with container orchestration platforms, specifically Kubernetes, and cloud-native technologies
Proven track record of designing, implementing, and maintaining sophisticated monitoring and observability solutions
Strong incident management skills with extensive experience leading incident response for complex systems
Experience with infrastructure as code (e.g., Terraform, Pulumi) and configuration management tools
Excellent written and verbal communication skills
Strong interpersonal skills, with experience working with and mentoring engineers
A willingness to dive into understanding, debugging, and improving any layer of the stack

Job Responsibility

Architect and Implement Observability
Define and Drive Reliability Standards
Lead Incident Management and Response
Drive Automation and Infrastructure as Code
Optimize Performance on Kubernetes
Debug and Harden Distributed Systems
Provide Staff-Level Guidance
Educate and Mentor
Build and Integrate

What we offer

Competitive Salary & Equity
401(k) Program with a 4% match
Health, Dental, Vision and Life Insurance
Short Term and Long Term Disability
Paid Parental, Medical, Caregiver Leave
Commuter Benefits
Monthly Wellness Stipend
Autonomous Work Environment
In Office Set-Up Reimbursement
Flexible Time Off (FTO) + Holidays

Fulltime

Staff Site Reliability Engineer

We are looking for a Site Reliability Engineer to own our internal systems infra...

Location

United States , Sunnyvale

Salary:

175000.00 - 250000.00 USD / Year

Figure

Expiration Date

Until further notice

Requirements

Strong experience with Linux/Unix systems administration
Proficiency in programming/scripting
Extensive experience with cloud platforms (Azure, AWS, GCP) and on-prem hardware architectures
Experience designing, deploying, and operating high-availability, fault-tolerant, and distributed systems
Mastery of infrastructure as code (Terraform, CloudFormation, Ansible…)
Familiarity with monitoring, logging, and alerting tools (Prometheus, Grafana, Datadog…)
Solid understanding of networking fundamentals (TCP/IP, DNS, HTTP, load balancers, firewalls)
Experience defining Service Level Objectives (SLO), developing runbooks/incident response plans, facilitating post-mortems and managing systems assets
Ability to work in cross-functional teams with developers, infra, and product teams
Excellent verbal and written communication skills

Job Responsibility

Be the go to person for mission critical infrastructure enabling critical operations such as Source Configuration Management, CI/CD systems, software distribution, supplier portals, manufacturing and more
Migrate SaaS to self-hosted solutions to enhance security and reliability
Implement monitoring and alerting systems, and define incident response plans and runbooks
Reduce human workload through automation to automate deployment and scaling
Establish strong relationships with stakeholders to identify infrastructure needs and establish Service Level Objectives
Use a data driven approach to demonstrate service robustness and track optimization work
Partner with the security team to ensure that security remediations and updates are applied in a timely manner

Fulltime

Staff Site Reliability Engineer

Site Reliability Engineering at Affirm is a small, yet crucial, team that helps ...

Location

Poland

Salary:

358000.00 - 458000.00 PLN / Year

Affirm

Expiration Date

Until further notice

Requirements

8+ years of experience designing, developing, advocating as a point subject of reference, and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin
Extensive track record of developing highly available distributed systems using technologies like AWS, MySQL, Spark and Kubernetes
Track record of managing, driving and improving the Incident Livecycle process from live incident management through retrospective and post-incident analysis to provide actional insights to enhance overall system reliability, resilience, and performance
7+ years experience in Site Reliability or Production Engineering teams
Demonstrate curiosity with empathy, and strong opinions loosely held
Experience delivering major features, system components or deprecating existing functionality in a system through the definition of a technical and execution plan
Write high quality code that is easily understood and used by others
Thrive in ambiguity, and are comfortable moving from low level language idioms all the way to the architecture of large systems to understand how they work
Growth and impact trajectory demonstrates that you have mastered gathering and iterating on feedback from your engineering and cross-functional peers
Strong verbal and written communication skills that support effective collaboration with our global engineering team and key stakeholders of an organization

Job Responsibility

Set technical strategy vision for your team on a multi year-long time scale, and help your team tie it together with critical, business-impacting projects
Collaborate across teams in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics to ensure technical sustainability, risks and trade-offs are well understood and managed
Act as a force-multiplier for your team through your definition and advocacy of technical solutions and operational processes
Take ownership of your team’s operations and availability by ensuring you have the right monitoring, triage rotations, playbooks, policies, testing and alerting in place to support “keep the lights on” & on-call efforts
Foster a culture of quality and ownership on your team by setting code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks
Help develop talent on your team by providing feedback and guidance, and leading by example

What we offer

Flexible Spending Wallets for tech, food and lifestyle
Away Days - wellness days to take off work and recharge
Learning & Development programs
Parental leave
Employee Resource & Community Groups
Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount

Fulltime

Staff Site Reliability Engineer

We’re seeking a Staff Site Reliability Engineer to serve as a technical leader w...

Location

United States

Salary:

151040.00 - 188800.00 USD / Year

Bugcrowd

Expiration Date

Until further notice

Requirements

5+ years of experience in SRE, DevOps, or systems engineering, with demonstrated technical leadership
Expert-level knowledge of Terraform, including module design, state management, and scaling IaC across teams
Deep expertise in AWS architecture and services at scale, with strong focus on ECS
Proven experience designing and operating containerized workloads on ECS, including capacity planning, service scaling, and task placement strategies
Strong experience designing and implementing CI/CD systems with GitHub Actions or similar tools
Track record of leading complex, cross-team technical initiatives
Advanced proficiency in Python, Ruby, Javascript, or similar languages
Strong understanding of distributed systems principles
Excellent written and verbal communication skills
Proven ability to balance long-term technical strategy with immediate operational needs

Job Responsibility

Define and drive the technical vision for infrastructure reliability across the organization
Architect large-scale, fault-tolerant systems on AWS using Terraform
Lead cross-functional initiatives to improve system reliability, scalability, and efficiency
Establish standards for infrastructure-as-code, CI/CD, and deployment practices
Design and implement solutions for our most complex operational challenges
Lead incident response for critical outages and drive systemic improvements
Mentor senior engineers and help grow the SRE team’s capabilities
Evaluate and introduce new technologies that improve operational excellence
Influence engineering culture around reliability, observability, and operational maturity

What we offer

Discretionary bonus program or commission plan

Fulltime

Select Country

Staff Site Reliability Engineer

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Our AI answers in your language