CrawlJobs Logo

Site Reliability Operations III

walmart.com Logo

Walmart

Location Icon

Location:
United States of America , Bentonville

Category Icon

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

80000.00 - 155000.00 USD / Year

Job Description:

The Command & Control Center is the nerve center for Walmart Global Technology. On the Logistics Support team, we proactively monitor critical supply chain applications and infrastructure, providing early warnings and rapid response to potential disruptions. Our team ensures seamless operations by swiftly mitigating incidents and leveraging advanced automation and AI-driven monitoring to keep Walmart’s supply chain resilient and efficient.

Job Responsibility:

  • Monitor and alert on software or system performance, determining thresholds for monitoring metrics and triggers alerts based on thresholds
  • Supervise specific procedures to proactively check the health of applications and infrastructure, including a variety of operating systems, hardware, and software
  • Investigate and diagnose incidents to restore a failed IT service as quickly as possible and within specified SLAs
  • Document troubleshooting steps and service restoration details for knowledge management
  • Liaison between Tech and external support to resolve escalated incidents and ensure timely closure
  • Record and classify received incidents and undertake immediate corrective action for moderate complexity queries under moderate supervision
  • Research and recommend alternative actions for incident resolution
  • Contribute to command-and-control related activities focused on restoration of complex outages
  • Conduct complex maintenance procedures for applications independently
  • Monitor and evaluate the performance of the application by tracking and analyzing appropriate metrics
  • Perform maintenance (corrective, adaptive, perfective) and re-engineering activities
  • Analyze application logs, maintenance activity data, performance data, and provide analysis
  • Evaluate change requests to identify those which are valid and feasible
  • Troubleshoot performance and availability bottlenecks for assigned application independently
  • Triage to detect and determine symptom versus cause of defects
  • Actively provide data for and participate in RCA
  • Build, maintain, and enhance effective internal and external partnerships
  • Influence technical outcomes and assist in communicating shared goals with diverse groups and parties
  • Identify and address additional partner technical needs and educate them on value creation
  • Communicate with other individuals or teams to solve shared business problems cooperatively
  • Bring ideas and technical solutions proactively to business partners and stakeholders

Requirements:

  • Strong communication and interpersonal skills
  • Experience with Jira, Looper, and Kubernetes
  • Familiarity with Grafana and ability to write queries (PromQL)
  • GitHub experience
  • Database knowledge is preferable but not required
  • Ability to work independently and make decisions with guidance
  • Comprehension of changes to methodologies and resources, and ability to articulate the same
  • Experience with cloud applications and ability to pull logs
  • Strong analytical and problem-solving skills
  • Ability to work collaboratively with cross-functional teams
  • Experience with incident management and troubleshooting
  • Strong technical skills, including proficiency in monitoring and alerting, incident management, and DevOps orientation
  • Immigration sponsorship is not available for this role

Nice to have:

  • Experience in site reliability operations, site and system administration, infrastructure management, or related area
  • Master's degree in site reliability operations, site and system administration, infrastructure management, or related area.
  • SRE certification (for example, IBM Cloud Site Reliability Engineer).
  • We value candidates with a background in creating inclusive digital experiences, demonstrating knowledge in implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards, assistive technologies, and integrating digital accessibility seamlessly. The ideal candidate would have knowledge of accessibility best practices and join us as we continue to create accessible products and services following Walmart’s accessibility standards and guidelines for supporting an inclusive culture.
What we offer:
  • Multiple health plan options, including vision & dental plans for you & dependents
  • Financial benefits including 401(k), stock purchase plans, life insurance and more
  • Associate discounts in-store and online
  • Education assistance for Associate and dependents
  • Parental Leave
  • Pay during military service
  • Paid Time off - to include vacation, sick, parental
  • Short-term and long-term disability for when you can't work because of injury, illness, or childbirth
  • incentive awards for your performance
  • maternity and parental leave, PTO, health benefits
  • performance-based bonus awards
  • company discounts
  • adoption and surrogacy expense reimbursement

Additional Information:

Job Posted:
January 07, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Site Reliability Operations III

Site Reliability Engineer III

The Site Reliability Engineer is responsible for designing, developing, and main...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
amgen.com Logo
Amgen
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Doctorate degree OR 6 to 10 years of Computer Science, IT or related field experience OR
  • Master’s degree and 7 to 10 years of Computer Science, IT or related field experience OR
  • Bachelor’s degree and 8 to 12 years of Computer Science, IT or related field experience
  • Working experience with various cloud services on AWS (Azure, GCP) and containerization technologies (Docker, Kubernetes)
  • Strong programing skills in languages such as Python
  • Working experience of infrastructure as code (IaC) tools (Terraform, CloudFormation)
  • Working experience with monitoring and alerting tools (Prometheus, Grafana, etc.)
  • Working experience with DevOps/MLOps practice and CI/CD pipelines
  • Proficiency in automated testing tools and frameworks (e.g., Selenium, JUnit, pytest), Incident Management, Production Issue Root Cause Analysis and Improve System Quality
Job Responsibility
Job Responsibility
  • Design and implement systems and processes to improve the reliability, scalability, and performance of applications
  • Automate routine operational tasks, such as deployments, monitoring, and incident response, to improve efficiency and reduce human error
  • Develop and maintain monitoring tools and dashboards to track system health, performance, and availability
  • Respond to and resolve incidents promptly, conducting root cause analysis and implementing preventive measures
  • Provide ongoing maintenance and support for existing systems, ensuring that they are secure, efficient, and reliable
  • Work on integrating various software applications and platforms to ensure seamless operation across the organization
  • Implement and maintain security measures to protect systems from unauthorized access and other threats
What we offer
What we offer
  • Competitive and comprehensive Total Rewards Plans that are aligned with local industry standards
Read More
Arrow Right

Site Reliability Engineer III

Under limited supervision, the Site Reliability Engineer III is responsible for ...
Location
Location
United States , Birmingham
Salary
Salary:
Not provided
allianceautomotive.co.uk Logo
Alliance Automotive UK LV Ltd
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Typically requires a bachelor's degree and five (5) or more years of related experience or an equivalent combination
  • Understanding of Kubernetes, containers, clusters, and elastic scalability
  • Expertise in SRE principles
  • Mindset of continually finding ways to drive scalability, stability, and performance
  • Cloud Services experience with Google Cloud Platform (GCP)
  • Experience with API, service-based or microservice-based architecture
  • Proficiency in infrastructure, network, database, operating systems, or security troubleshooting and remediation
  • Architecture-level knowledge of Windows and Linux and Infrastructure systems
  • Experience with production deployment, monitoring, and operational support for enterprise-class applications (Dynatrace a plus)
  • Experience working with Continuous Integration/ Continuous Deployment tools
Job Responsibility
Job Responsibility
  • Gathers and analyzes metrics from monitoring platforms to assist in performance tuning and fault tolerance
  • Partners with development teams to improve services through testing and release procedures
  • Participates in system design, platform management and capacity planning
  • Balances feature development speed and reliability with service-level objectives
  • Works closely with the incident response team and restoring service to normal operation
  • Understands debugging and applying troubleshooting skills
  • Investigates, blocks and rate-limits unwanted traffic
  • Utilizes monitoring systems and dashboards for proactive changes and alerting
  • Establishes continuous process improvement cycles where the process, performance, and supporting technologies are reviewed and enhanced where applicable
  • Performs other duties as assigned
What we offer
What we offer
  • options for healthcare coverage, 401(k), tuition reimbursement, vacation, sick, and holiday pay
  • Fulltime
Read More
Arrow Right
New

Project/Program Manager III

The Project / Program Manager III is responsible for coordinating and delivering...
Location
Location
United States , Hanover
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 6+ years of experience in project or program management
  • Strong program management skills with significant coordination across operations and technical teams
  • Technical background with hands-on engineering exposure in complex environments
  • Experience with networking systems, testing, or technology integrations
  • Proven ability to manage vendors and external partners in live operational settings
  • Strong organizational, communication, and stakeholder management skills
  • Ability to operate effectively in a fast-paced, high-volume work environment
  • Willingness and flexibility to support evening work windows as required
  • Bachelor's degree preferred, or equivalent combination of education and experience
Job Responsibility
Job Responsibility
  • Oversee end-to-end project and program execution, including planning, scheduling, scope control, and milestone tracking
  • Coordinate and manage retrofit projects across multiple active sites, including conveyor system and automated material handling upgrades
  • Manage vendor performance on-site through deployment phases, ensuring adherence to scope, quality standards, and timelines
  • Proactively identify and mitigate project risks to prevent delays and operational disruption
  • Travel regularly to assigned sites to monitor progress, validate quality of work, and confirm milestone completion
  • Partner with on-site stakeholders, including engineering, reliability, maintenance, and operations teams, to ensure alignment and smooth execution
  • Review engineering documentation and technical deliverables to support successful system integration
  • Manage temporary systems and transition plans during retrofit and deployment activities
  • Prepare and deliver regular status updates and reporting for management and leadership
  • Coordinate activities across up to 12 sites, ensuring execution targets are met by year-end
What we offer
What we offer
  • medical
  • vision
  • dental
  • life and disability insurance
  • 401(k) plan
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer III

Zuora’s Cloud Engineering teams are responsible for Cloud infrastructures, monit...
Location
Location
India , Chennai
Salary
Salary:
Not provided
zuora.com Logo
Zuora
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6-8 years of relevant experience on SRE/DevOps
  • Proven hands-on working experience with core AWS services (e.g., EC2, VPC, S3, RDS, IAM, CloudWatch, EKS/ECS)
  • Deep expertise in infrastructure-as-code principles using Terraform for provisioning and state management
  • Expert-level knowledge and practical experience with configuration management tools such as Puppet and/or Ansible
  • Strong experience setting up, maintaining, and enhancing Continuous Integration/Continuous Deployment pipelines using Jenkins
  • Proficiency in scripting languages, particularly Python and/or Shell scripting, for developing automation tools and performing system administration tasks
  • Advanced knowledge of Linux operating systems, including performance tuning, troubleshooting, security, and networking fundamentals
  • Working knowledge and operational experience with distributed messaging queues, specifically Kafka
Job Responsibility
Job Responsibility
  • Maintain and improve the reliability, scalability, and performance of our production systems, targeting a high-availability environment
  • Design, implement, and maintain automation solutions for infrastructure provisioning, deployment, configuration management, and monitoring using Terraform and Jenkins
  • Administer, manage, and optimize our cloud infrastructure primarily hosted on AWS, focusing on cost efficiency and secure operations
  • Develop and maintain infrastructure-as-code using Puppet and/or Ansible to ensure consistent and reproducible environments
  • Participate in on-call rotation, troubleshoot and resolve critical production incidents, and conduct comprehensive post-mortems to prevent recurrence
  • Apply strong Linux administration skills to manage, patch, and secure operating systems and underlying infrastructure
  • Manage and optimize distributed messaging systems, specifically Kafka, ensuring high throughput and data integrity
What we offer
What we offer
  • Competitive compensation, variable bonus and performance reward opportunities, and retirement programs
  • Medical Insurance
  • Generous, flexible time off
  • Paid holidays, “wellness” days and company wide end of year break
  • Learning & Development stipend
  • Opportunities to volunteer and give back, including charitable donation match
  • Free resources and support for your mental wellbeing
Read More
Arrow Right

Site Reliability Engineer III

Under limited supervision, the Site Reliability Engineer III is responsible for ...
Location
Location
United States , Birmingham, Alabama
Salary
Salary:
Not provided
genpt.com Logo
Genuine Parts Company
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Typically requires a bachelor's degree and five (5) or more years of related experience or an equivalent combination
  • Understanding of Kubernetes, containers, clusters, and elastic scalability
  • Expertise in SRE principles
  • Mindset of continually finding ways to drive scalability, stability, and performance
  • Cloud Services experience with Google Cloud Platform (GCP)
  • Experience with API, service-based or microservice-based architecture
  • Proficiency in infrastructure, network, database, operating systems, or security troubleshooting and remediation
  • Architecture-level knowledge of Windows and Linux and Infrastructure systems
  • Experience with production deployment, monitoring, and operational support for enterprise-class applications (Dynatrace a plus)
  • Experience working with Continuous Integration/ Continuous Deployment tools
Job Responsibility
Job Responsibility
  • Gathers and analyzes metrics from monitoring platforms to assist in performance tuning and fault tolerance
  • Partners with development teams to improve services through testing and release procedures
  • Participates in system design, platform management and capacity planning
  • Balances feature development speed and reliability with service-level objectives
  • Works closely with the incident response team and restoring service to normal operation
  • Understands debugging and applying troubleshooting skills
  • Investigates, blocks and rate-limits unwanted traffic
  • Utilizes monitoring systems and dashboards for proactive changes and alerting
  • Establishes continuous process improvement cycles where the process, performance, and supporting technologies are reviewed and enhanced where applicable
  • Performs other duties as assigned.
What we offer
What we offer
  • Options for healthcare coverage, 401(k), tuition reimbursement, vacation, sick, and holiday pay.
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer III

We're looking for a senior Site Reliability Engineer to join our small, high-own...
Location
Location
United States
Salary
Salary:
148320.00 - 185400.00 USD / Year
absencesoft.com Logo
AbsenceSoft
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in SRE, DevOps, or a related engineering role
  • Advanced hands-on expertise in AWS production environments and core services including Lambda, ECS, S3, ALB, and GuardDuty
  • Strong proficiency in infrastructure-as-code tooling such as Terraform, CloudFormation, or CDK
  • Experience building and operating CI/CD pipelines using Jenkins and GitHub
  • Proficiency in Python, Go, or Bash for automation
  • Hands-on experience with Datadog or a comparable observability platform for monitoring, alerting, and log management
  • Demonstrated experience leading incident response in complex, distributed systems
  • Working knowledge of SLO/SLI frameworks, error budgets, and disaster recovery planning against defined RTO/RPO objectives
  • Familiarity with SOC 2 compliance frameworks and experience contributing to audit readiness, access controls, and security control evidence collection
  • A collaborative, ownership-driven mindset with strong communication skills
Job Responsibility
Job Responsibility
  • Architect, implement, and operate scalable, resilient, and secure AWS infrastructure
  • Lead infrastructure-as-code initiatives to ensure all environments are reproducible, auditable, and consistently configured
  • Design, maintain, and improve CI/CD pipelines using Jenkins and GitHub
  • Own the Datadog observability platform, including dashboards, monitors, alerting thresholds, and log management
  • Define and maintain SLOs, SLIs, and error budgets
  • Serve as a senior technical responder across the full incident lifecycle within a shared on-call rotation
  • Lead blameless postmortems
  • Refine, implement, and test disaster recovery plans to meet RTO/RPO objectives
  • Contribute to SOC 2 audit readiness with a focus on access controls, incident response, and risk mitigation
  • Mentor junior SREs through code reviews, incident pairing, and documentation
What we offer
What we offer
  • Impact that matters
  • Flexibility and trust
  • Remote-first and results driven
  • Growth and development
  • Access to learning resources, leadership programs, and real opportunities to take on new challenges
  • Competitive rewards
  • Comprehensive benefits
  • Performance-based bonus program
  • Equity opportunities
  • Time for life
  • Fulltime
Read More
Arrow Right

Maintenance Technician III

Summit Skilled Solutions is seeking an experienced Maintenance Technician III to...
Location
Location
United States , Orlando
Salary
Salary:
28.00 - 35.00 USD / Hour
summitskilledsolutions.com Logo
Summit Skilled Solutions
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • High School Diploma or equivalent required
  • Minimum of two (2) years of training through an accredited trade school or college
  • certificate or degree preferred
  • Minimum of 7 years of experience in mechanical, electrical, plumbing, carpentry, or industrial maintenance
  • Valid state driver’s license
  • Forklift operator certification
  • Scissor lift and aerial lift (JLG) certification
  • Universal CFC certification required
  • Any required state or national trade licenses must be obtained and maintained by the employee
  • Proficient in the use of hand tools and small and large power tools
Job Responsibility
Job Responsibility
  • Comply with all safety policies and procedures, including OSHA regulations and lockout/tagout requirements
  • Conduct routine “shift rounds” to inspect systems and equipment, identify issues, and document performance data
  • Maintain, troubleshoot, and repair facility equipment, including: Electrical installation, repair, and maintenance of equipment and controls
  • Installation, maintenance, and repair of plumbing and piping systems and related components
  • Installation, repair, and maintenance of mechanical and electrical operating equipment and machinery
  • Perform routine and ongoing assessments of building system operations
  • Conduct testing and data analysis to verify proper operation of site equipment
  • Monitor mechanical, electrical, and other facility systems to ensure reliable operation
  • Perform work in accordance with manufacturing standards and approved change-management processes
  • Complete administrative tasks including parts ordering, purchase order creation, vendor coordination, and participation in job and project meetings
  • Fulltime
Read More
Arrow Right

Electronics Technician III

We are seeking a AV Software Engineer to join our Security and Electronic System...
Location
Location
Germany , Stuttgart
Salary
Salary:
Not provided
mcdean.com Logo
M.C. Dean, Inc
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Active Secret Clearance Required
  • U.S. Citizenship
  • Ability to travel up to 25%
  • HS diploma or GED
  • Military Electronics Training (minimum 720 classroom hours) or Graduation from an accredited Electronics Technician program or Graduation from an Electrical Apprenticeship program
  • An additional three (3) years of electronics installation and/or maintenance activities
  • 6+ years of electronics installation and/or maintenance activities on multiple systems and with multiple customer programs
  • Strong Oral, Written and Presentation Skills
  • Demonstrated background working with multidisciplinary teams
  • Demonstrated time management and organization skills to meet deadlines and quality objectives
Job Responsibility
Job Responsibility
  • Executes various technical tasks and responsibilities within field operations
  • Performs on-site installations, maintenance, troubleshooting, and repairs of equipment and systems
  • Ensures the functionality and reliability of various technologies
  • Conducting site surveys, configuring hardware and software, testing systems for proper operation, and providing technical support to customers
  • Utilizes and comprehends project Safety plan (JHA, AHA, PFW), enforcing M.C. Dean handbook and policies
  • Participates in quality reviews of M.C. Dean design documentation, analyzing and interpreting drawing packages to evaluate constructability
  • Tracks project metrics and participates in weekly resource allocation meetings
  • Verifies correct charges in timesheets on projects
  • Executes installation and maintenance activities within planned durations, ensuring completion of detailed documentation
  • Tracks and inventories tools, conducts tool inspections, and organizes material ordering and receiving
What we offer
What we offer
  • A collaborative team inspired by the way engineering and innovation enhance customer outcomes, improve lives, and change the world for the better
  • An opportunity to lead and build a business with the support of an industry-leading firm that has been in business for 75 years
  • Investment in your skills and expertise through a combination of professional and technical training programs, including leadership training and tuition reimbursement
  • Open and transparent communication with senior leadership as well as local office management
  • Fulltime
Read More
Arrow Right