Principal, Data Scientist Job at Walmart (Sunnyvale)

Job Description

The Reliability Engineering group at Walmart Global Tech builds intelligent, data-driven platforms that ensure the availability, performance, and efficiency of Walmartʼs enterprise and e-commerce systems at massive scale. The team leverages large-scale telemetry, automation, and machine learning to enable proactive optimization, faster incident detection, and resilient system behavior across thousands of services.

Job Responsibility

Architect and implement end-to-end ML systems (data pipelines, feature engineering, model training, deployment, and monitoring)
Design scalable, low-latency model serving infrastructure integrated with Kubernetes and cloud-native systems
Build intelligent automation solutions including predictive autoscaling, anomaly detection, seasonality-aware forecasting, and capacity optimization
Engineer safe and reliable ML-driven automation that operates in high-availability environments
Own model lifecycle management, including validation, experiment tracking, model registry, monitoring, drift detection, and rollback strategies
Collaborate closely with platform, SRE, and infrastructure teams to embed ML capabilities into production systems
Drive engineering best practices around ML system reliability, observability, testing, and performance
Contribute to architectural decisions and mentor engineers on ML systems design

Requirements

10+ years of experience in software engineering with applied machine learning
Strong track record of building and operating ML systems in production
Experience owning systems end-to-end in distributed, high-availability environments
Experience leading technical initiatives or driving architectural decisions
Strong proficiency in one or more programming languages commonly used in ML engineering, such as Python, Go, or Java
Strong experience with ML frameworks such as Scikit-learn, PyTorch, TensorFlow, or similar
Strong SQL skills and experience working with large-scale datasets
Hands-on experience training, validating, and deploying machine learning models in production across domains such as recommendation systems, forecasting, anomaly detection, classification, or similar applied ML use cases
Experience building and maintaining end-to-end ML pipelines (data ingestion, feature engineering, training, evaluation, deployment, monitoring)
Experience with model serving architectures (REST/gRPC inference services, batch inference, streaming inference)
Hands-on experience with ML lifecycle platforms such as MLflow, Ray, Kubeflow, Airflow, or similar orchestration systems
Experience with experiment tracking, model registry, CI/CD for ML, feature management, and automated retraining workflows
Experience designing robust evaluation frameworks for traditional ML systems (offline validation, backtesting, shadow testing, A/B testing, and production performance monitoring)
Strong experience working with observability data (metrics, logs, traces) and time-series analysis in distributed systems
Hands-on experience deploying and operating ML systems on Kubernetes, including containerization using Docker
Experience working with major cloud platforms (AWS, GCP, or Azure) and cloud-native services
Strong understanding of distributed systems behavior (latency, throughput, failure modes, cascading effects)
Ability to design ML systems that balance accuracy, latency, reliability, and safety
Experience designing fault-tolerant, observable, and scalable ML-driven automation systems
Solid understanding of cloud-native architecture and infrastructure patterns
Option 1: Bachelors degree in Statistics, Economics, Analytics, Mathematics, Computer Science, Information Technology or related field and 5 years' experience in an analytics related field
Option 2: Masters degree in Statistics, Economics, Analytics, Mathematics, Computer Science, Information Technology or related field and 3 years' experience in an analytics related field
Option 3: 7 years' experience in an analytics or related field

Nice to have

Experience building predictive or adaptive autoscaling systems
Experience with AIOps, demand prediction, anomaly detection, or incident prediction in distributed environments
Familiarity with streaming ML or online learning systems
Experience developing or integrating agentic systems, including orchestration, tool use, evaluation, and safety considerations in production environments
Familiarity with distributed training or large-scale data processing frameworks
Master’s degree in computer science, computer engineering, computer information systems, software engineering, or related area and 3 years' experience in software engineering or related area

What we offer

medical, vision and dental coverage
401(k)
stock purchase
company-paid life insurance
PTO (including sick leave)
parental leave
family care leave
bereavement
jury duty
voting
short-term and long-term disability
company discounts
Military Leave Pay
adoption and surrogacy expense reimbursement
Live Better U education benefit program
annual or quarterly performance bonuses
Stock

Walmart - All Job Offers

Select Country

Principal, Data Scientist

Job Description

Job Responsibility

Requirements

Nice to have

What we offer

Looking for more opportunities?

Principal, Data Scientist

Principal Data Scientist

Senior Principal Data Scientist

Principal Data Scientist

GenAI Model Risk Data Scientist

Senior Principal Scientist – Target Discovery & Biology

Principal Scientist – Synthetic Separations

Senior / Principal Machine Learning Scientist

Senior / Principal Machine Learning Scientist

Our AI answers in your language