CrawlJobs Logo

Site Reliability Engineer III

United States 148320.00 - 185400.00 USD / Year · Job Posted April 19, 2026
Apply Position
Job Link Share

Job Description

We're looking for a senior Site Reliability Engineer to join our small, high-ownership SRE team. In this hands-on individual contributor role, you'll own the reliability, scalability, and security of AbsenceSoft's production infrastructure on AWS — supporting a B2B SaaS platform that processes sensitive employee leave data for enterprise customers. You'll work closely with infrastructure, application engineering, product leadership, and cross-functional partners in Security and Compliance, with a clear path to grow toward a Tech Lead opportunity as our team and platform continue to mature.

Job Responsibility

  • Architect, implement, and operate scalable, resilient, and secure AWS infrastructure
  • Lead infrastructure-as-code initiatives to ensure all environments are reproducible, auditable, and consistently configured
  • Design, maintain, and improve CI/CD pipelines using Jenkins and GitHub
  • Own the Datadog observability platform, including dashboards, monitors, alerting thresholds, and log management
  • Define and maintain SLOs, SLIs, and error budgets
  • Serve as a senior technical responder across the full incident lifecycle within a shared on-call rotation
  • Lead blameless postmortems
  • Refine, implement, and test disaster recovery plans to meet RTO/RPO objectives
  • Contribute to SOC 2 audit readiness with a focus on access controls, incident response, and risk mitigation
  • Mentor junior SREs through code reviews, incident pairing, and documentation

Requirements

  • 5+ years of experience in SRE, DevOps, or a related engineering role
  • Advanced hands-on expertise in AWS production environments and core services including Lambda, ECS, S3, ALB, and GuardDuty
  • Strong proficiency in infrastructure-as-code tooling such as Terraform, CloudFormation, or CDK
  • Experience building and operating CI/CD pipelines using Jenkins and GitHub
  • Proficiency in Python, Go, or Bash for automation
  • Hands-on experience with Datadog or a comparable observability platform for monitoring, alerting, and log management
  • Demonstrated experience leading incident response in complex, distributed systems
  • Working knowledge of SLO/SLI frameworks, error budgets, and disaster recovery planning against defined RTO/RPO objectives
  • Familiarity with SOC 2 compliance frameworks and experience contributing to audit readiness, access controls, and security control evidence collection
  • A collaborative, ownership-driven mindset with strong communication skills
  • A passion for mentoring junior engineers
  • A commitment to reducing toil through automation and AI-assisted tooling

What we offer

  • Impact that matters
  • Flexibility and trust
  • Remote-first and results driven
  • Growth and development
  • Access to learning resources, leadership programs, and real opportunities to take on new challenges
  • Competitive rewards
  • Comprehensive benefits
  • Performance-based bonus program
  • Equity opportunities
  • Time for life
  • Flexible time off
  • Paid holidays
  • Flexible leave programs
  • Belonging and balance
  • Inclusive culture

Looking for more opportunities?

Search for other job offers that match your skills and interests.

Similar Jobs for

Site Reliability Engineer III

8 matching positions

Site Reliability Engineer III

The Site Reliability Engineer is responsible for designing, developing, and main...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
amgen.com Logo
Amgen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate degree OR 6 to 10 years of Computer Science, IT or related field experience OR
  • Master’s degree and 7 to 10 years of Computer Science, IT or related field experience OR
  • Bachelor’s degree and 8 to 12 years of Computer Science, IT or related field experience
  • Working experience with various cloud services on AWS (Azure, GCP) and containerization technologies (Docker, Kubernetes)
  • Strong programing skills in languages such as Python
  • Working experience of infrastructure as code (IaC) tools (Terraform, CloudFormation)
  • Working experience with monitoring and alerting tools (Prometheus, Grafana, etc.)
  • Working experience with DevOps/MLOps practice and CI/CD pipelines
  • Proficiency in automated testing tools and frameworks (e.g., Selenium, JUnit, pytest), Incident Management, Production Issue Root Cause Analysis and Improve System Quality
Job Responsibility
Job Responsibility
  • Design and implement systems and processes to improve the reliability, scalability, and performance of applications
  • Automate routine operational tasks, such as deployments, monitoring, and incident response, to improve efficiency and reduce human error
  • Develop and maintain monitoring tools and dashboards to track system health, performance, and availability
  • Respond to and resolve incidents promptly, conducting root cause analysis and implementing preventive measures
  • Provide ongoing maintenance and support for existing systems, ensuring that they are secure, efficient, and reliable
  • Work on integrating various software applications and platforms to ensure seamless operation across the organization
  • Implement and maintain security measures to protect systems from unauthorized access and other threats
What we offer
What we offer
  • Competitive and comprehensive Total Rewards Plans that are aligned with local industry standards
Read More
Arrow Right

Site Reliability Engineer III

Under limited supervision, the Site Reliability Engineer III is responsible for ...
Location
Location
United States , Birmingham
Salary
Salary:
Not provided
allianceautomotive.co.uk Logo
Alliance Automotive UK LV Ltd
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Typically requires a bachelor's degree and five (5) or more years of related experience or an equivalent combination
  • Understanding of Kubernetes, containers, clusters, and elastic scalability
  • Expertise in SRE principles
  • Mindset of continually finding ways to drive scalability, stability, and performance
  • Cloud Services experience with Google Cloud Platform (GCP)
  • Experience with API, service-based or microservice-based architecture
  • Proficiency in infrastructure, network, database, operating systems, or security troubleshooting and remediation
  • Architecture-level knowledge of Windows and Linux and Infrastructure systems
  • Experience with production deployment, monitoring, and operational support for enterprise-class applications (Dynatrace a plus)
  • Experience working with Continuous Integration/ Continuous Deployment tools
Job Responsibility
Job Responsibility
  • Gathers and analyzes metrics from monitoring platforms to assist in performance tuning and fault tolerance
  • Partners with development teams to improve services through testing and release procedures
  • Participates in system design, platform management and capacity planning
  • Balances feature development speed and reliability with service-level objectives
  • Works closely with the incident response team and restoring service to normal operation
  • Understands debugging and applying troubleshooting skills
  • Investigates, blocks and rate-limits unwanted traffic
  • Utilizes monitoring systems and dashboards for proactive changes and alerting
  • Establishes continuous process improvement cycles where the process, performance, and supporting technologies are reviewed and enhanced where applicable
  • Performs other duties as assigned
What we offer
What we offer
  • options for healthcare coverage, 401(k), tuition reimbursement, vacation, sick, and holiday pay
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer III

Under limited supervision, the Site Reliability Engineer III is responsible for ...
Location
Location
United States , Birmingham, Alabama
Salary
Salary:
Not provided
genpt.com Logo
Genuine Parts Company
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Typically requires a bachelor's degree and five (5) or more years of related experience or an equivalent combination
  • Understanding of Kubernetes, containers, clusters, and elastic scalability
  • Expertise in SRE principles
  • Mindset of continually finding ways to drive scalability, stability, and performance
  • Cloud Services experience with Google Cloud Platform (GCP)
  • Experience with API, service-based or microservice-based architecture
  • Proficiency in infrastructure, network, database, operating systems, or security troubleshooting and remediation
  • Architecture-level knowledge of Windows and Linux and Infrastructure systems
  • Experience with production deployment, monitoring, and operational support for enterprise-class applications (Dynatrace a plus)
  • Experience working with Continuous Integration/ Continuous Deployment tools
Job Responsibility
Job Responsibility
  • Gathers and analyzes metrics from monitoring platforms to assist in performance tuning and fault tolerance
  • Partners with development teams to improve services through testing and release procedures
  • Participates in system design, platform management and capacity planning
  • Balances feature development speed and reliability with service-level objectives
  • Works closely with the incident response team and restoring service to normal operation
  • Understands debugging and applying troubleshooting skills
  • Investigates, blocks and rate-limits unwanted traffic
  • Utilizes monitoring systems and dashboards for proactive changes and alerting
  • Establishes continuous process improvement cycles where the process, performance, and supporting technologies are reviewed and enhanced where applicable
  • Performs other duties as assigned.
What we offer
What we offer
  • Options for healthcare coverage, 401(k), tuition reimbursement, vacation, sick, and holiday pay.
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer III

Zuora’s Cloud Engineering teams are responsible for Cloud infrastructures, monit...
Location
Location
India , Chennai
Salary
Salary:
Not provided
zuora.com Logo
Zuora
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6-8 years of relevant experience on SRE/DevOps
  • Proven hands-on working experience with core AWS services (e.g., EC2, VPC, S3, RDS, IAM, CloudWatch, EKS/ECS)
  • Deep expertise in infrastructure-as-code principles using Terraform for provisioning and state management
  • Expert-level knowledge and practical experience with configuration management tools such as Puppet and/or Ansible
  • Strong experience setting up, maintaining, and enhancing Continuous Integration/Continuous Deployment pipelines using Jenkins
  • Proficiency in scripting languages, particularly Python and/or Shell scripting, for developing automation tools and performing system administration tasks
  • Advanced knowledge of Linux operating systems, including performance tuning, troubleshooting, security, and networking fundamentals
  • Working knowledge and operational experience with distributed messaging queues, specifically Kafka
Job Responsibility
Job Responsibility
  • Maintain and improve the reliability, scalability, and performance of our production systems, targeting a high-availability environment
  • Design, implement, and maintain automation solutions for infrastructure provisioning, deployment, configuration management, and monitoring using Terraform and Jenkins
  • Administer, manage, and optimize our cloud infrastructure primarily hosted on AWS, focusing on cost efficiency and secure operations
  • Develop and maintain infrastructure-as-code using Puppet and/or Ansible to ensure consistent and reproducible environments
  • Participate in on-call rotation, troubleshoot and resolve critical production incidents, and conduct comprehensive post-mortems to prevent recurrence
  • Apply strong Linux administration skills to manage, patch, and secure operating systems and underlying infrastructure
  • Manage and optimize distributed messaging systems, specifically Kafka, ensuring high throughput and data integrity
What we offer
What we offer
  • Competitive compensation, variable bonus and performance reward opportunities, and retirement programs
  • Medical Insurance
  • Generous, flexible time off
  • Paid holidays, “wellness” days and company wide end of year break
  • Learning & Development stipend
  • Opportunities to volunteer and give back, including charitable donation match
  • Free resources and support for your mental wellbeing
Read More
Arrow Right

Site Reliability Operations III

The Command & Control Center is the nerve center for Walmart Global Technology. ...
Location
Location
United States of America , Bentonville
Salary
Salary:
80000.00 - 155000.00 USD / Year
walmart.com Logo
Walmart
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong communication and interpersonal skills
  • Experience with Jira, Looper, and Kubernetes
  • Familiarity with Grafana and ability to write queries (PromQL)
  • GitHub experience
  • Database knowledge is preferable but not required
  • Ability to work independently and make decisions with guidance
  • Comprehension of changes to methodologies and resources, and ability to articulate the same
  • Experience with cloud applications and ability to pull logs
  • Strong analytical and problem-solving skills
  • Ability to work collaboratively with cross-functional teams
Job Responsibility
Job Responsibility
  • Monitor and alert on software or system performance, determining thresholds for monitoring metrics and triggers alerts based on thresholds
  • Supervise specific procedures to proactively check the health of applications and infrastructure, including a variety of operating systems, hardware, and software
  • Investigate and diagnose incidents to restore a failed IT service as quickly as possible and within specified SLAs
  • Document troubleshooting steps and service restoration details for knowledge management
  • Liaison between Tech and external support to resolve escalated incidents and ensure timely closure
  • Record and classify received incidents and undertake immediate corrective action for moderate complexity queries under moderate supervision
  • Research and recommend alternative actions for incident resolution
  • Contribute to command-and-control related activities focused on restoration of complex outages
  • Conduct complex maintenance procedures for applications independently
  • Monitor and evaluate the performance of the application by tracking and analyzing appropriate metrics
What we offer
What we offer
  • Multiple health plan options, including vision & dental plans for you & dependents
  • Financial benefits including 401(k), stock purchase plans, life insurance and more
  • Associate discounts in-store and online
  • Education assistance for Associate and dependents
  • Parental Leave
  • Pay during military service
  • Paid Time off - to include vacation, sick, parental
  • Short-term and long-term disability for when you can't work because of injury, illness, or childbirth
  • incentive awards for your performance
  • maternity and parental leave, PTO, health benefits
  • Fulltime
Read More
Arrow Right
New

Controls Engineer III – Robotic Arm

As a Controls Engineer – Robotics, you will own the full lifecycle engineering o...
Location
Location
United States , Mendon
Salary
Salary:
110000.00 - 130000.00 USD / Year
asirobots.com Logo
Autonomous Solutions
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Robotics, Mechatronics, Mechanical Engineering, Electrical Engineering, or related field
  • 5+ years of hands-on experience in robotic systems integration, with demonstrated end-to-end ownership - from concept and design through commissioning and field support
  • Strong experience with FANUC programming (TP/KAREL), system configuration, and integration
  • Meaningful, hands-on experience with vision systems (2D and/or 3D) - not just academic exposure. Must be able to configure, calibrate, troubleshoot, and adapt vision systems in real-world deployments
  • Proven ability to design or collaborate on custom EOAT and mechanical interface solutions
  • Demonstrated experience working across hardware/software boundaries, including sensors, PLCs, and system interfaces
  • Strong troubleshooting skills across mechanical, electrical, and software domains
  • Ability to operate independently in fast-paced, ambiguous, field-deployed environments
  • Strong communication and collaboration skills across cross-functional teams
Job Responsibility
Job Responsibility
  • Own end-to-end integration of FANUC robotic arm systems - from system architecture and motion planning through field deployment and ongoing optimization
  • Lead vision system integration and problem-solving, with particular focus on adapting 2D/3D vision to outdoor, variable-lighting environments including direct sun, glare, shadow, and low-light conditions
  • Partner with mechanical and electrical teams to design, iterate, and validate end-of-arm tooling (EOAT) and interface mechanisms - including custom tooling solutions for novel connection challenges
  • Develop robust motion strategies, error handling, and recovery routines capable of performing reliably in real-world, uncontrolled conditions
  • Troubleshoot complex field issues across mechanical, electrical, and software domains using structured root cause analysis
  • Collaborate with third-party vision system vendors and integrate their solutions into ASI's robotics platform
  • Drive testing, validation, and performance benchmarking against reliability, safety, and uptime requirements
  • Support on-site deployments and work directly with operations teams to ensure sustained system performance
  • Mentor junior engineers and provide technical leadership within the robotics team
  • Contribute to engineering standards, documentation, and continuous improvement processes
  • Fulltime
Read More
Arrow Right

Network Engineer III

We are seeking talented, experienced Network Engineering professionals to join t...
Location
Location
United States , Huntsville
Salary
Salary:
79119.18 - 190122.20 USD / Year
arcfield.com Logo
Arcfield
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Network Engineering, Computer Engineering, Computer Science, or a related technical field with 5-7 years of experience, MS 3-5 years of experience, PhD 0-2 years of experience
  • Valid Security+ or DoD Directive 8570.01 IAT Level II certification
  • Valid Cisco Certified Network Associate (CCNA) certification or higher
  • 3+ years of general work experience designing, developing, and implementing network architecture
  • 3+ years of experience configuring and troubleshooting routers, switches, and associated network protocols such as OSPF, EIGRP, and Rapid PVST
  • 3+ years of experience supporting network security hardware and solutions like Cisco ASA firewalls, IPSEC, Access Control Lists (ACLs), and Network Address Translation (NAT)
  • Experience with network analysis tools such as SolarWinds and Wireshark
  • Experience with Visio
  • Possess and maintain a Secret clearance
Job Responsibility
Job Responsibility
  • Develop and deploy communications architectures to meet dynamic mission requirements across multiple ranges
  • Design, install, maintain, and repair deployable communications sites, ensuring optimal performance and reliability
  • Route data with type 1 encryption between various ranges and sites, ensuring secure and reliable communication
  • Configure and secure enterprise services, including routers, switches, firewalls, and access control solutions
  • Troubleshoot and optimize routing protocols such as OSPF and EIGRP, along with QoS, VPNs, and Spanning Tree Protocol
  • Interface with analog voice communications systems across multiple locations to ensure seamless integration
  • Maintain and update systems with patches to comply with Risk Management Framework (RMF) Information Assurance Vulnerability Management (IAVM) requirements
  • Collaborate with systems administrators on other standalone systems to improve their network architecture
  • Develop comprehensive system documentation, including standard operating procedures and network drawings
  • Regularly report project status to the Lead Network Engineer and Senior Management
What we offer
What we offer
  • Health Insurance
  • Life Insurance
  • Paid Time Off
  • Holiday Pay
  • Short Term and Long-Term Disability
  • Retirement and Savings
  • Learning and Development opportunities
  • wellness programs
  • Fulltime
Read More
Arrow Right

Network Engineer III

We are seeking talented, experienced Network Engineering professionals to join t...
Location
Location
United States , Huntsville
Salary
Salary:
Not provided
arcfield.com Logo
Arcfield
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Network Engineering, Computer Engineering, Computer Science, or a related technical field with 5-7 years of experience, MS 3-5 years of experience, PhD 0-2 years of experience
  • Valid Security+ or DoD Directive 8570.01 IAT Level II certification
  • Valid Cisco Certified Network Associate (CCNA) certification or higher
  • 3+ years of general work experience designing, developing, and implementing network architecture
  • 3+ years of experience configuring and troubleshooting routers, switches, and associated network protocols such as OSPF, EIGRP, and Rapid PVST
  • 3+ years of experience supporting network security hardware and solutions like Cisco ASA firewalls, IPSEC, Access Control Lists (ACLs), and Network Address Translation (NAT)
  • Experience with network analysis tools such as SolarWinds and Wireshark
  • Experience with Visio
  • Possess and maintain a Secret clearance
Job Responsibility
Job Responsibility
  • Develop and deploy communications architectures to meet dynamic mission requirements across multiple ranges
  • Design, install, maintain, and repair deployable communications sites, ensuring optimal performance and reliability
  • Route data with type 1 encryption between various ranges and sites, ensuring secure and reliable communication
  • Configure and secure enterprise services, including routers, switches, firewalls, and access control solutions
  • Troubleshoot and optimize routing protocols such as OSPF and EIGRP, along with QoS, VPNs, and Spanning Tree Protocol
  • Interface with analog voice communications systems across multiple locations to ensure seamless integration
  • Maintain and update systems with patches to comply with Risk Management Framework (RMF) Information Assurance Vulnerability Management (IAVM) requirements
  • Collaborate with systems administrators on other standalone systems to improve their network architecture
  • Develop comprehensive system documentation, including standard operating procedures and network drawings
  • Regularly report project status to the Lead Network Engineer and Senior Management
  • Fulltime
Read More
Arrow Right