CrawlJobs Logo

Senior Site Reliability Engineer

https://checkr.com Logo

Checkr

Location Icon

Location:
United States , Denver

Category Icon
Category:
IT - Software Development

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

138000.00 - 191000.00 USD / Year

Job Description:

As a Senior Site Reliability Engineer on the Platform team, you will identify issues and technical challenges across the engineering teams and platforms, develop innovative solutions to resolve them, and drive their adoption. You will strive towards the right balance between enforcing standardization and accommodating tailored workflows. The person in this role will have the opportunity to demonstrate a degree of autonomy in their work and help with complex support requests, while having an impact across engineering.

Job Responsibility:

  • Collaborate, drive, and execute architectural discussions with cross-functional teams
  • Lead cross-team projects and SREs' technical roadmap to enable engineering and help Checkr customers
  • Design, build, ship, and maintain the core observability libraries, tools, and patterns used by all of Checkr’s engineering teams
  • Proactively engage across teams to foster service reliability, efficiency, and scalability
  • Troubleshoot complex production issues across the stack, with respect to performance, availability, and data quality
  • Present detailed technical information and benefits of the Checkr platform to a wide array of customers, including operations, developers, technical architects, and executives

Requirements:

  • Degree in Computer Science (or related field)
  • 6+ years of experience in building tools with Python (preferred), GoLang, or Ruby
  • 6+ years of experience in maintaining and observing production customer-facing environments in AWS or Azure
  • 6+ years of experience as a member of an incident response team
  • Deep understanding of the fundamental infrastructure and platform concepts behind a micro-service architecture, REST APIs, and asynchronous queueing models
  • Experience with observability platforms and frameworks like Datadog, Splunk, Grafana, Prometheus, or OpenTelemetry
  • Strong collaboration, documentation, communication, and project management skills
  • Experience with container orchestration using Kubernetes/Docker/Terraform
  • Experience driving platform adoption across engineering teams, guided by a self-service and product-first approach
  • A passion for customer-centricity and building relationships with other teams
  • Unwavering commitment to operational security and best practices
What we offer:
  • A fast-paced and collaborative environment
  • Learning and development allowance
  • Competitive cash and equity compensation and opportunities for advancement
  • 100% medical, dental, and vision coverage
  • Up to $25K reimbursement for fertility, adoption, and parental planning services
  • Flexible PTO policy
  • Monthly wellness stipend, home office stipend
  • In-office perks such as lunch four times a week, commuter stipend, and an abundance of snacks and beverages

Additional Information:

Job Posted:
September 26, 2025

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Site Reliability Engineer

New

Senior AI Site Reliability Engineer

At Schwab, you will build a rewarding career while making a difference in the li...
Location
Location
United States , San Francisco
Salary
Salary:
190000.00 - 270000.00 USD / Year
schwab.com Logo
Charles Schwab
Expiration Date
January 20, 2026
Flip Icon
Requirements
Requirements
  • 8+ years of software development or reliability engineering experience, with 4+ years as a hands-on senior engineer in startups and/or large organizations
  • Bachelor’s degree in Computer Science or related field
  • 5+ years of experience building and operating complex products from scratch and running them in production
  • 3+ years of experience supporting applications that use Artificial Intelligence (AI) models to deliver real business impact
  • 3+ years of experience building and maintaining data pipelines and infrastructure for large datasets
  • 3+ years of experience with containers and cloud-native applications, and the ability to operationalize them in the public cloud with infrastructure as code
  • Experience implementing monitoring, alerting, and incident response for large-scale distributed systems
  • Proven track record in driving reliability, scalability, and performance improvements for production AI systems
Job Responsibility
Job Responsibility
  • Design, implement, and manage the reliability and operational excellence of GenAI applications and platforms
  • Work closely with architects, engineers, and business leaders to align reliability practices with Schwab’s enterprise strategy
  • Mentor and coach junior engineers, helping to build strong operational practices and foster a culture of continuous improvement
  • Lead by example in solving complex reliability challenges, advancing SRE standards, and driving rapid iteration from concept to production
What we offer
What we offer
  • 401(k) with company match and Employee stock purchase plan
  • Paid time for vacation, volunteering, and 28-day sabbatical after every 5 years of service for eligible positions
  • Paid parental leave and family building benefits
  • Tuition reimbursement
  • Health, dental, and vision insurance
  • Bonus or incentive opportunities
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Affirm is reinventing credit to make it more honest and friendly, giving consume...
Location
Location
Spain
Salary
Salary:
85000.00 - 115000.00 EUR / Year
affirm.com Logo
Affirm
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience designing, developing and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin
  • A track record of developing highly available distributed systems using technologies like AWS, MySQL and Kubernetes
  • Meaningful experience contributing in or driving parts of the Incident Lifecycle process, enabling actionable insights that improve the quality culture, reliability, resilience, and system performance
  • 4+ years working in a Site Reliability or Production Engineering team
  • Experience defining a technical plan for the delivery of a significant feature or system component with an elegant, simple and extensible design
  • Experience in making impactful changes in a large code base, and have developed a suite of tools and practices that enable you and your team to do so safely
  • Strong verbal and written communication skills that support effective collaboration with our global engineering team
  • On-Call Rotation - There would be an on-call rotation for this role as a requirement
Job Responsibility
Job Responsibility
  • You will be responsible for owning and delivering quarterly goals for your team, leading engineers on your team through ambiguity to solve open-ended problems, and ensuring that everyone is supported throughout delivery
  • You will support your peers and stakeholders in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics by participating in ideation, articulating technical constraints, and partnering on decisions that properly consider risks and trade-offs
  • You will proactively identify technical solutions and operational processes that strengthen incident readiness, response, and post-incident analysis
  • You will support the operations and availability of your team’s artifacts by creating and monitoring metrics, escalating when needed, and supporting “keep the lights on” & on-call efforts
  • You will foster a culture of quality and ownership on your team by setting or improving code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks
  • You will help develop talent on your team by providing feedback and guidance, and leading by example
What we offer
What we offer
  • Flexible Spending Wallets for tech, food and lifestyle
  • Away Days - wellness days to take off work and recharge
  • Learning & Development programs
  • Parental benefit
  • Employee Resource & Community Groups
  • Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
  • Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
  • Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
  • ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Site Reliability Engineering at Affirm is a small, yet crucial, team that helps ...
Location
Location
Poland
Salary
Salary:
301000.00 - 401000.00 PLN / Year
affirm.com Logo
Affirm
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience designing, developing and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin
  • A track record of developing highly available distributed systems using technologies like AWS, MySQL and Kubernetes
  • Meaningful experience contributing in or driving parts of the Incident Lifecycle process, enabling actionable insights that improve the quality culture, reliability, resilience, and system performance
  • 4+ years working in a Site Reliability or Production Engineering team
  • Experience defining a technical plan for the delivery of a significant feature or system component with an elegant, simple and extensible design
  • Experience in making impactful changes in a large code base, and have developed a suite of tools and practices that enable you and your team to do so safely
  • Strong verbal and written communication skills that support effective collaboration with our global engineering team
Job Responsibility
Job Responsibility
  • Own and deliver quarterly goals for your team, lead engineers on your team through ambiguity to solve open-ended problems, and ensure that everyone is supported throughout delivery
  • Support your peers and stakeholders in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics by participating in ideation, articulating technical constraints, and partnering on decisions that properly consider risks and trade-offs
  • Proactively identify technical solutions and operational processes that strengthen incident readiness, response, and post-incident analysis
  • Support the operations and availability of your team’s artifacts by creating and monitoring metrics, escalating when needed, and supporting “keep the lights on” & on-call efforts
  • Foster a culture of quality and ownership on your team by setting or improving code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks
  • Help develop talent on your team by providing feedback and guidance, and leading by example
What we offer
What we offer
  • Flexible Spending Wallets for tech, food and lifestyle
  • Away Days - wellness days to take off work and recharge
  • Learning & Development programs
  • Parental benefits
  • Employee Resource & Community Groups
  • Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
  • Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
  • Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
  • ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer Cloud Platform

Zilliz is a fast-growing startup developing the industry’s leading vector databa...
Location
Location
Salary
Salary:
175000.00 - 225000.00 USD / Year
zilliz.com Logo
Zilliz
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience in site reliability engineering or similar roles with a focus on cloud-native systems
  • Proficiency in scripting languages such as Python, Go, or Java
  • Strong knowledge of container orchestration technologies like Kubernetes and Docker
  • Expertise with cloud platforms such as AWS, GCP, or Azure, and their respective monitoring and management tools
  • Experience with infrastructure as code tools such as Terraform or Ansible
  • Familiarity with CI/CD tools such as Jenkins, GitLab CI, or Argo
  • Proven ability to troubleshoot complex distributed systems and resolve issues promptly
  • Bachelor’s degree or above in computer science, software engineering, or other relevant disciplines
  • Ability to thrive in a fast-paced, startup environment and handle multiple projects simultaneously
Job Responsibility
Job Responsibility
  • Work at the intersection of development and site reliability. Creating SRE tools and systems, as well as supporting existing infrastructure and platforms
  • Ensure the reliability, availability, and performance of Zilliz’s distributed database systems
  • Develop and implement strategies for monitoring, incident management, and disaster recovery
  • Automate system operations and maintenance tasks to improve efficiency and reduce manual intervention
  • Design and build tools to manage and monitor infrastructure, ensuring scalability and robustness
  • Collaborate with software engineers to enhance system reliability, scalability, and performance
  • Maintain and improve the CI/CD pipeline to ensure smooth and rapid deployment of changes
  • Actively contribute to the Milvus Vector Database open-source community, focusing on improving reliability and operational efficiency
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

You'll join the team primarily responsible for making our self-hosted product of...
Location
Location
United States
Salary
Salary:
200000.00 - 220000.00 USD / Year
tines.com Logo
Tines
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5-8 years in an SRE or similar role
  • Experience architecting, maintaining, and supporting systems with containerized applications, ideally k8s
  • Experience with troubleshooting deployment issues, creating clear documentation, and designing robust escalation paths
  • Comfortable learning new technologies
  • Experience with Ruby, Rails, React, TypeScript, Postgres, Redis and Docker
  • Customer obsessed and willing to go deep into unfamiliar stacks to find root causes
  • Authorized to work for any employer in the U.S.
Job Responsibility
Job Responsibility
  • Making our self-hosted product offering as easy as possible for customers to install and operate
  • Owning all of the supporting services and tools that our self-hosted customers rely on
  • Identifying and fixing availability risks and monitoring gaps
  • Enabling software engineers to build new product features that work seamlessly across cloud and self-hosted environments
  • Using our own product extensively to automate infrastructure maintenance and to build DevOps tooling for customer deployments
  • Identifying areas for improvement in our containerized architecture and deployment strategies
  • Mentoring other engineers in container orchestration and Kubernetes best practices
  • Act as a subject matter expert for critical self-hosted customer issues
What we offer
What we offer
  • Competitive salary
  • Startup equity & extended exercise window
  • Matching retirement plans
  • Home office setup
  • Private healthcare plans
  • 25 days annual leave
  • Extra company holidays
  • Generous parental leave programs
  • Flexibility in how and where you work
  • Phone and home Internet allowance
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

As a Site Reliability Engineer, you will focus on ensuring that the Prolific pla...
Location
Location
United Kingdom
Salary
Salary:
Not provided
prolific.com Logo
Prolific
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years with Google Cloud Platform, GKE, and the Kubernetes ecosystem with experience with Terraform and Terragrunt
  • Strong programming skills in Python
  • Strong experience in observability principles and tooling
  • Experience in GitOps flows and platforms for Kubernetes, such as ArgoCD
  • Deep understanding of system architecture and scalability principles
  • Strong collaboration and communication skills to work with cross-functional teams
Job Responsibility
Job Responsibility
  • Develop and maintain highly available infrastructure using modern infra-as-code techniques, with a focus on terragrunt and terraform
  • Manage and optimise Kubernetes clusters and their workloads with a focus on reliability and performance
  • Participate in incident response and remediation, working with relevant product teams and stakeholders to resolve production issues efficiently, including creating and maintaining runbooks
  • Review and optimise other areas of our tooling stack, such as CICD or release strategies
  • Foster a culture of continuous improvement, such as enhancing documentation and upskilling teams in cloud architecture and kubernetes
  • Improve observability and alerting systems across our application and infrastructure, ensuring proactive detection of system degradation
  • Collaborate with Engineering teams to foster an SRE culture, including contributing defining SLO’s, SLA’s and error budgets
  • Design and implement automation strategies to ensure managed services remain up-to-date, secure, and performant
  • Lead and support initiatives that automate processes to improve system efficiency, resilience and reduce toil
  • Organising, supporting and responding to on-call incidents
What we offer
What we offer
  • competitive salary
  • benefits
  • remote working
  • impactful, mission-driven culture
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

What will you be doing at Miniclip? Participate in an on-call rotation with the ...
Location
Location
Portugal , Lisbon
Salary
Salary:
Not provided
miniclip.com Logo
Miniclip
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on experience with AWS in both development and operations contexts
  • Strong Linux system administration skills, including performance tuning and debugging
  • Software development background and strong coding skills in one or more of the following: Go, Python, Ruby
  • Experience with Infrastructure as Code, particularly Terraform
  • Familiarity with CI/CD pipelines and artifact management tools
  • A mindset for resilient systems design, thinking about edge cases, failure modes, and graceful degradation
  • Excellent communication skills in English, both written and spoken
  • Comfortable in a fast-paced environment and adaptable to shifting priorities
Job Responsibility
Job Responsibility
  • Participate in an on-call rotation with the Cloud Engineering team to respond to production incidents and outages
  • Operate and evolve infrastructure using Infrastructure as Code (Terraform), configuration management tools, and containerized platforms on AWS
  • Build and maintain observability tooling to detect symptoms before they lead to outages
  • Automate repetitive tasks and processes to reduce operational toil
  • Collaborate with Engineering and Product teams to design resilient systems that meet performance and reliability goals
  • Troubleshoot production issues across application, network, and infrastructure layers
  • Document systems, processes, and runbooks to improve team transparency and onboarding
Read More
Arrow Right

Senior Site Reliability Engineer

HiveWatch is seeking a Staff Site Reliability Engineer to join our Platform Team...
Location
Location
United States , El Segundo
Salary
Salary:
183000.00 - 235000.00 USD / Year
hivewatch.com Logo
HiveWatch
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of software engineering experience with strong coding skills in production environments
  • 5+ years of SRE, DevOps, or production operations experience
  • Expertise with cloud platforms (AWS preferred) and containerized applications (Docker, Kubernetes)
  • Experience with Infrastructure as Code (Terraform, CloudFormation, or similar)
  • Proficiency in at least one object oriented programming language in our tech stack (Java, Kotlin, Python)
  • Hands-on experience with relational databases and SQL performance optimization
  • Experience with monitoring and observability tools (Prometheus, Grafana, DataDog, or equivalent)
  • Strong debugging skills across distributed systems and microservices architectures
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
Job Responsibility
Job Responsibility
  • Own the reliability of mission-critical systems including production monitoring, alerting, and capacity planning
  • Debug and resolve complex production issues across the full stack, from infrastructure to application code
  • Participate in a regular on-call rotation to provide 24/7 coverage for critical systems
  • Perform root cause analysis requiring deep code-level investigation and implement preventive measures
  • Build automation and tooling to reduce operational toil and improve system reliability
  • Maintain CI/CD pipelines, observability infrastructure, and database performance optimization
  • Increase the resiliency, scalability, and maintainability of production environments
  • Establish on-call procedures and disaster recovery processes
  • Provide technical leadership and mentorship to foster engineering excellence and reliability culture
What we offer
What we offer
  • Comprehensive health coverage: medical, dental, vision, and life insurance
  • Cutting-edge work in an emerging field with huge growth potential
  • Competitive compensation packages designed to reward top talent
  • A modern, newly renovated HQ right on Main Street in El Segundo, CA
  • 401(k) with a 4% company match to help you invest in your future (match launches in 2026)
  • Flexible paid time off so you can recharge when you need it
  • Additional benefits include ClassPass credits and a discount on pet insurance
  • A family-friendly, compassionate culture that values balance and belonging
  • Eligible to participate in HiveWatch Equity Incentive Plan
  • Fulltime
Read More
Arrow Right
Welcome to CrawlJobs.com
Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.