Site Reliability Operations III Job at Walmart (Bentonville)

Job Description

The Command & Control Center is the nerve center for Walmart Global Technology. On the Logistics Support team, we proactively monitor critical supply chain applications and infrastructure, providing early warnings and rapid response to potential disruptions. Our team ensures seamless operations by swiftly mitigating incidents and leveraging advanced automation and AI-driven monitoring to keep Walmart’s supply chain resilient and efficient.

Job Responsibility

Monitor and alert on software or system performance, determining thresholds for monitoring metrics and triggers alerts based on thresholds
Supervise specific procedures to proactively check the health of applications and infrastructure, including a variety of operating systems, hardware, and software
Investigate and diagnose incidents to restore a failed IT service as quickly as possible and within specified SLAs
Document troubleshooting steps and service restoration details for knowledge management
Liaison between Tech and external support to resolve escalated incidents and ensure timely closure
Record and classify received incidents and undertake immediate corrective action for moderate complexity queries under moderate supervision
Research and recommend alternative actions for incident resolution
Contribute to command-and-control related activities focused on restoration of complex outages
Conduct complex maintenance procedures for applications independently
Monitor and evaluate the performance of the application by tracking and analyzing appropriate metrics
Perform maintenance (corrective, adaptive, perfective) and re-engineering activities
Analyze application logs, maintenance activity data, performance data, and provide analysis
Evaluate change requests to identify those which are valid and feasible
Troubleshoot performance and availability bottlenecks for assigned application independently
Triage to detect and determine symptom versus cause of defects
Actively provide data for and participate in RCA
Build, maintain, and enhance effective internal and external partnerships
Influence technical outcomes and assist in communicating shared goals with diverse groups and parties
Identify and address additional partner technical needs and educate them on value creation
Communicate with other individuals or teams to solve shared business problems cooperatively
Bring ideas and technical solutions proactively to business partners and stakeholders

Requirements

Strong communication and interpersonal skills
Experience with Jira, Looper, and Kubernetes
Familiarity with Grafana and ability to write queries (PromQL)
GitHub experience
Database knowledge is preferable but not required
Ability to work independently and make decisions with guidance
Comprehension of changes to methodologies and resources, and ability to articulate the same
Experience with cloud applications and ability to pull logs
Strong analytical and problem-solving skills
Ability to work collaboratively with cross-functional teams
Experience with incident management and troubleshooting
Strong technical skills, including proficiency in monitoring and alerting, incident management, and DevOps orientation
Immigration sponsorship is not available for this role

Nice to have

Experience in site reliability operations, site and system administration, infrastructure management, or related area
Master's degree in site reliability operations, site and system administration, infrastructure management, or related area.
SRE certification (for example, IBM Cloud Site Reliability Engineer).
We value candidates with a background in creating inclusive digital experiences, demonstrating knowledge in implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards, assistive technologies, and integrating digital accessibility seamlessly. The ideal candidate would have knowledge of accessibility best practices and join us as we continue to create accessible products and services following Walmart’s accessibility standards and guidelines for supporting an inclusive culture.

What we offer

Multiple health plan options, including vision & dental plans for you & dependents
Financial benefits including 401(k), stock purchase plans, life insurance and more
Associate discounts in-store and online
Education assistance for Associate and dependents
Parental Leave
Pay during military service
Paid Time off - to include vacation, sick, parental
Short-term and long-term disability for when you can't work because of injury, illness, or childbirth
incentive awards for your performance
maternity and parental leave, PTO, health benefits
performance-based bonus awards
company discounts
adoption and surrogacy expense reimbursement

Walmart - All Job Offers

Select Country

Site Reliability Operations III

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Site Reliability Operations III

Site Reliability Engineer III

Site Reliability Engineer III

Site Reliability Engineer III

Site Reliability Engineer III

Site Reliability Engineer III

Phlebotomist III Site Lead

O&M Site Technician III

Electric Operations Resource Coordinator III

Our AI answers in your language