CrawlJobs Logo

Site Reliability Engineer

United States, Miami · Job Posted May 16, 2026
Apply Position
Job Link Share

Job Description

Beacon Hill is looking for an SRE to support one of our clients. In this senior-level role, you will ensure the reliability, performance, and operational excellence of large-scale digital commerce platforms, with a focus on GraphQL/RESTful services. You will lead incident response, root‑cause analysis, and monitoring efforts while supporting production environments across web, mobile, and customer‑facing applications. The position requires strong SRE experience, deep knowledge of observability tools such as Splunk and Dynatrace, AWS cloud exposure, and the ability to collaborate closely with engineering teams to enhance system stability and performance. You will also analyze Voice of Customer feedback, work within CI/CD environments, and remain open to learning new AI tools and emerging technologies. A background in software engineering, Java, or GraphQL is highly valuable, and certifications in SRE or related operations disciplines are a plus. This role is ideal for someone with a strong technical foundation, a proactive mindset, and a willingness to continuously learn and adapt.

Job Responsibility

  • Ensure the reliability, performance, and operational excellence of large-scale digital commerce platforms, with a focus on GraphQL/RESTful services
  • Lead incident response, root‑cause analysis, and monitoring efforts while supporting production environments across web, mobile, and customer‑facing applications
  • Analyze Voice of Customer feedback
  • Work within CI/CD environments
  • Remain open to learning new AI tools and emerging technologies

Requirements

  • 8+ years of experience in Site Reliability Engineering
  • Strong SRE background
  • Proficiency with GraphQL services or strong understanding of RESTful services
  • Experience supporting digital applications (web, mobile, customer‑facing platforms)
  • Cloud exposure, specifically AWS
  • Strong observability and monitoring experience (e.g., Splunk, Dynatrace)
  • Ability to perform incident response, root‑cause analysis, and production support
  • Understanding of CI/CD pipelines (no need to build, but must understand workflows)
  • Ability to analyze VOC (Voice of Customer) feedback and translate into operational insights
  • Strong communication skills and ability to partner with engineering teams
  • Openness to learning new AI tools and evolving technologies
  • Positive attitude, adaptability, and willingness to learn-valued more than tool‑by‑tool expertise

Nice to have

  • SRE or operations‑related certifications
  • Background in software engineering
  • Experience optimizing digital commerce platforms
  • Familiarity with automation and observability best practices
  • Experience supporting GraphQL‑based APIs at scale
  • Exposure to digital commerce ecosystems (web, mobile, customer interaction flows)
  • Experience working in environments focused on reliability, performance, and operational excellence

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineer

8 matching positions

New

Site Reliability Engineer

Join us as a Site Reliability Engineer at Barclays, where you will play a pivota...
Location
Location
United Kingdom , Glasgow
Salary
Salary:
Not provided
barclays.co.uk Logo
Barclays
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience on Lunix/Unix, Java Applications, Oracle, SQL (PLQSP/MySQL) and RDBMS with exposure to writing queries, tuning concepts, data analysis etc.
  • Experience of at least two of the Middleware technologies amongst JBOSS Apache Tomcat, Glassfish.
  • Experience of Cloud technologies, AWS, OpenShift/APIs.
  • Experience of monitoring tools AppDynamics. ITRS and scheduling Tools – TWS and/or Autosys.
Job Responsibility
Job Responsibility
  • Apply software engineering techniques, automation, and best practices in incident response, to ensure the reliability, availability, and scalability of the systems, platforms, and technology through them.
  • Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning.
  • Resolution, analysis and response to system outages and disruptions, and implement measures to prevent similar incidents from recurring.
  • Development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience.
  • Monitoring and optimisation of system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning.
  • Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and work closely with other teams to ensure smooth and efficient operations.
  • Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities to foster a culture of technical excellence and growth.
What we offer
What we offer
  • Competitive holiday allowance
  • Life assurance
  • Private medical care
  • Pension contribution
  • Fulltime
Read More
Arrow Right
New

Site Reliability Engineer

We are looking for an experienced Site Reliability Engineer to strengthen the re...
Location
Location
United States , San Francisco
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in Site Reliability Engineering, infrastructure engineering, or a closely related production environment role
  • Strong background operating, supporting, and troubleshooting distributed systems at scale
  • Hands-on experience with observability platforms such as Datadog, Grafana, OpenTelemetry, CloudWatch, or similar tools
  • Proven involvement in on-call operations, incident management, and reliability-focused problem resolution
  • Practical experience defining and using SLIs, SLOs, and error budgets to guide service reliability decisions
  • Familiarity with AWS environments, including serverless and container-based architectures
  • Experience working with relational databases such as Postgres and performance analysis in production systems
  • Ability to write automation scripts or lightweight tooling in languages such as Python or Bash, with strong judgment around failure modes and system design
Job Responsibility
Job Responsibility
  • Establish measurable reliability standards for critical services by creating and maintaining service indicators, objectives, and error budget practices
  • Take ownership of production stability by monitoring uptime, latency, and availability, and driving improvements that reduce operational risk
  • Lead live incident response efforts, coordinate troubleshooting during outages, and ensure issues are resolved efficiently and thoroughly
  • Run blameless post-incident reviews, document findings clearly, and track corrective actions through completion
  • Design and enhance observability across logs, metrics, and distributed tracing using tools such as Datadog, CloudWatch, Grafana, OpenTelemetry, and Sentry
  • Improve alert quality and dashboard design so engineering teams can quickly identify meaningful system issues without unnecessary noise
  • Evaluate system behavior under load, uncover performance constraints, and recommend changes that improve scalability and resource efficiency
  • Build automation and internal tooling that streamline operational work, strengthen deployment safety, and support incident management, debugging, and capacity planning
  • Contribute to infrastructure and delivery workflows across AWS, Terraform, Ansible, Linux, and GitHub Actions with a focus on dependable releases and resilient systems
  • Partner with security and compliance stakeholders to support operational standards, audit readiness, and the integration of monitoring into broader engineering practices
What we offer
What we offer
  • Medical, vision, dental, and life and disability insurance
  • Enrollment in company 401(k) plan
Read More
Arrow Right
New

Site Reliability Engineer

We’re hiring an SRE to help improve the availability, performance, scalability, ...
Location
Location
Israel , Netanya/Tel Aviv
Salary
Salary:
Not provided
jfrog.com Logo
JFrog
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 2-4 years of experience in SRE, Production Engineering, DevOps, or a similar role with hands-on production exposure
  • Strong troubleshooting and analytical skills, with the ability to investigate production issues in a structured and methodical way
  • Hands-on experience with Kubernetes-based containerized workloads
  • Experience with at least one public cloud provider: AWS, GCP, or Azure
  • Experience developing backend services, internal platforms, automation, or production engineering tools using Python, Go, or another programming language
  • Practical understanding of Linux fundamentals, networking concepts, HTTP, DNS, service connectivity, and production troubleshooting
  • Familiarity with CI/CD tools such as Jenkins, ArgoCD, GitHub Actions, or similar
  • Exposure to observability tools covering metrics, logs, and traces, such as Prometheus, Grafana, Coralogix, New Relic, or similar platforms
  • Understanding of incident management processes, alerting systems, and production support workflows
  • Ability to learn quickly, take ownership, communicate clearly, and work well in a collaborative production environment
Job Responsibility
Job Responsibility
  • Support the reliability, availability, performance, and scalability of JFrog’s large-scale, multi-cloud, Kubernetes-based SaaS environments
  • Investigate and troubleshoot production issues across distributed systems, infrastructure, Kubernetes, and cloud environments in close collaboration with Engineering teams
  • Design and develop backend services, internal platforms, and production engineering tools using Python, Go, or similar technologies
  • Improve reliability, observability, and operational readiness through SRE practices, monitoring and alerting, capacity awareness, postmortems, and safer CI/CD and production change processes
  • Evaluate and contribute to AI-assisted and agentic automation solutions that improve operational efficiency, troubleshooting, and production workflows
  • Support resilience initiatives, including disaster recovery validation, service readiness, health checks, and production readiness reviews
  • Participate in on-call rotations, lead incident response when needed, and drive follow-up actions to prevent recurrence
  • Continuously learn and evaluate new technologies that can improve reliability, automation, and operational excellence
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

Microsoft Substrate is the foundational cloud platform that powers many of Micro...
Location
Location
United States , Redmond
Salary
Salary:
102100.00 - 202200.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • 4+ years technical experience in software engineering, network engineering, or systems administration
  • ability to meet Microsoft, customer and/or government security screening requirements
  • ability to obtain and maintain favorably adjudicated Tier 3 (T3) background investigation
  • ability to meet Criminal Justice Information Services (CJIS) eligibility requirements
  • must pass Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Own reliability and operational health for one or more Substrate components or services in highly regulated environments
  • Serve as an actively engaged on-call engineer (OCE), participating in an on-call rotation and independently responding to incidents for owned services
  • Respond to, diagnose, and resolve production incidents with minimal supervision
  • Design and implement automation to reduce operational toil and improve service stability
  • Develop and maintain monitoring, alerting, and telemetry to support SLOs and operational metrics
  • Lead post-incident reviews for owned incidents, focusing on root cause analysis and durable fixes
  • Collaborate with software engineering teams to embed reliability and operability into service design
  • Write and maintain production-quality code and automation that improves reliability, scalability, and operational efficiency
What we offer
What we offer
  • Benefits and other compensation may be eligible
  • additional benefits and pay information available at https://careers.microsoft.com/us/en/us-corporate-pay
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

An Elite FinTech Firm is looking for a highly talented DevOps Engineer/Systems S...
Location
Location
Canada , Montreal
Salary
Salary:
200000.00 CAD / Year
hunterbond.com Logo
Hunter Bond
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Genuine passion in Linux & Open-source
  • Excellent knowledge of Python
  • Use of CI/CD, Docker, Ansible, Chef, Puppet
  • Knowledge of large-scale storage systems (on-prem)
Job Responsibility
Job Responsibility
  • Help architect a resilient, multi-petabyte storage solutions & build new data centres
  • Automate anything and everything with Python & config tools
  • Innovate whilst bringing in new ideas
What we offer
What we offer
  • Flexible hours/work options
  • Working in one of the world’s most elite teams
  • Invest heavily in cutting-edge and next-gen tech
  • Technologists only report to other technologists
  • Brand new skyline Manhattan office
  • Start-up style environment
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

We are currently seeking a Site Reliability Engineer to join our team in Guadala...
Location
Location
Mexico , Guadalajara
Salary
Salary:
Not provided
nttdata.com Logo
NTT DATA
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Perform L1.5 activities such as monitoring, deployment, rollback
  • Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
  • Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
  • Understand the Microsoft Azure Cloud - ideally Azure Fundamentals certified OR Computer Science/Information Systems Management degree
  • Familiar with PaaS and IaaS - VMs, Storage, EventHub, Service Fabric Cluster (SFC), Azure Kubernetes Service (AKS), CosmosDB, SQL Server, IoT Hub, Databricks, KeyVault, Datalake
  • Understand the concept of Internet of Things (IoT) - telemetry, ingestion, processing, data storage, reporting
  • Understand the concept tools - Octopus, Bamboo, Terraform, Azure DevOps, Jenkins, Github, Ansible
  • Understand the concept of container orchestration platforms (e.g. Kubernetes)
  • Understand the concept of scripts: Powershell, Python
  • Understand the difference between NoSQL and SQL databases, and how to maintain them
Job Responsibility
Job Responsibility
  • Perform L1.5 activities such as monitoring, deployment, rollback
  • Monitor the efficiency of the Azure cloud systems to prevent outages and initiate an Incident Management bridge in case of an outage
  • Troubleshoot Azure resources, escalate to Level 3 (Software Development Team)
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

We are looking for a Lead Site Reliability Engineer (SRE) with strong experience...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
karix.com Logo
Karix
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in SRE / DevOps / Production Engineering roles
  • Strong expertise in troubleshooting distributed systems and microservices architecture
  • Hands-on experience with Kafka, RabbitMQ, and Redis
  • Strong knowledge of Kubernetes and container orchestration
  • Experience with CI/CD pipelines and deployment automation
  • Solid understanding of Linux, networking, and cloud platforms (AWS / Azure / GCP)
  • Experience with Infrastructure as Code (Terraform, Ansible)
  • Strong scripting skills (Python, Bash, or similar)
  • Database experience: MySQL / Oracle / MongoDB
  • Strong problem-solving, ownership mindset, and ability to lead initiatives
Job Responsibility
Job Responsibility
  • Lead troubleshooting and resolution of complex production issues in distributed systems
  • Drive reliability engineering practices, ensuring high availability and performance of systems
  • Manage and optimize messaging systems like Apache Kafka, RabbitMQ, and Redis
  • Architect, manage, and optimize Kubernetes clusters for scalability and resilience
  • Manage CI/CD pipelines and drive deployment automation
  • Implement and maintain monitoring, alerting, and observability using Prometheus, Grafana, and ELK stack
  • Lead incident management, root cause analysis (RCA), and post-mortem reviews
  • Mentor junior engineers and collaborate with cross-functional teams to improve system design and reliability
What we offer
What we offer
  • Impactful Work: Play a key role in ensuring reliability and scalability of platforms that handle large-scale, real-time communication systems
  • Tremendous Growth Opportunities: Accelerate your career by leading critical reliability initiatives and working on high-scale distributed systems
  • Innovative Environment: Work in a fast-paced ecosystem that embraces automation, cloud-native technologies, and continuous improvement
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

Microsoft is a company where passionate innovators come to collaborate, envision...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Work with all aspects of a high throughput and multi-tenant service
  • Collaborate effectively within the team and with partner teams across Microsoft
  • Be part of the on-call rotation for maintaining service health
  • Design, implement, and refine chosen solutions in close partnership with Product Management and partner teams
  • Champion operational excellence via established metrics, process governance, and policy controls for regular assessment and improvement
  • Document and define existing data engineering processes, data and technology, while evaluating them for optimization
  • System Reliability & Uptime – Ensuring high availability of services
  • Incident Management – Detecting, responding to, and mitigating system failures
  • Performance Monitoring – Tracking system health and resolving bottlenecks
  • Automation & Tooling – Reducing manual work through scripts and automation
  • Fulltime
Read More
Arrow Right