System Engineer Job at Randstad (上海)

System Engineer

Randstad

Location:
China , 上海

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

360000.00 - 480000.00 CNY / Year

Save Job

Apply Position

Job Description:

The company has almost 100 million customers based in Japan and 1 billion globally as well, providing more than 70 services in a variety such as e-commerce, payment services, financial services, telecommunication, media, sports, etc. GPUOD team. Optimize Kubernetes (K8s) for GPU workloads, including scheduling policies, autoscaling, and multi-tenant resource isolation. Deploy and maintain inference serving platforms (e.g., NVIDIA Triton, vLLM, SGlang) for high-throughput and low-latency model deployment. Automate cluster provisioning, monitoring, and recovery to maximize uptime and GPU utilization. Collaborate with ML engineers to troubleshoot GPU-related issues in training jobs (e.g., NCCL errors, OOM) and inference bottlenecks. Implement observability tools (Prometheus, Grafana) to track GPU utilization, job performance, and cluster health. Develop infrastructure-as-code (IaC) solutions for reproducible GPU environments (e.g., Terraform, Ansible).

Job Responsibility:

Optimize Kubernetes (K8s) for GPU workloads, including scheduling policies, autoscaling, and multi-tenant resource isolation
Deploy and maintain inference serving platforms (e.g., NVIDIA Triton, vLLM, SGlang) for high-throughput and low-latency model deployment
Automate cluster provisioning, monitoring, and recovery to maximize uptime and GPU utilization
Collaborate with ML engineers to troubleshoot GPU-related issues in training jobs (e.g., NCCL errors, OOM) and inference bottlenecks
Implement observability tools (Prometheus, Grafana) to track GPU utilization, job performance, and cluster health
Develop infrastructure-as-code (IaC) solutions for reproducible GPU environments (e.g., Terraform, Ansible)

Requirements:

3+ years of experience in DevOps/MLOps, GPU infrastructure, or distributed computing
Deep expertise in Kubernetes (K8s) for GPU workload orchestration (e.g., KubeFlow, Volcano, custom schedulers)
Strong programming skills in Go or Python for platform development, automation and tooling
Proficiency in Linux system administration, performance tuning, and networking (e.g., RDMA, InfiniBand)
Experience with IaC tools (Terraform, Ansible) and CI/CD pipelines (GitHub Actions, Jenkins)
Bachelor's or higher degree in Computer Science, Engineering, or a related field
Strong teamwork and communication skills, with a passion for solving infrastructure challenges
Familiarity with distributed training frameworks (e.g., PyTorch DDP, FSDP, DeepSpeed)
Familiarity with Nvidia Triton serving framework or similar framework, and serving parameter tuning to make a good trade off between latency and throughput
Hands-on experience with GPU clusters, including troubleshooting NVIDIA drivers, CUDA, and NCCL issues
Knowledge of high-performance storage (Lustre, WekaFS) for large-scale training data
Experience with LLM training/inference stacks (e.g., Megatron-LM, TensorRT-LLM)

Nice to have:

Familiarity with distributed training frameworks (e.g., PyTorch DDP, FSDP, DeepSpeed)
Familiarity with Nvidia Triton serving framework or similar framework, and serving parameter tuning to make a good trade off between latency and throughput
Hands-on experience with GPU clusters, including troubleshooting NVIDIA drivers, CUDA, and NCCL issues
Knowledge of high-performance storage (Lustre, WekaFS) for large-scale training data
Experience with LLM training/inference stacks (e.g., Megatron-LM, TensorRT-LLM)

Additional Information:

Job Posted:
May 16, 2026

Expiration:
July 27, 2026

Employment Type:

Fulltime

Work Type:

On-site work

Randstad - All Job Offers

Job Link Share:

System Engineer

Randstad

Location:
China , 上海

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:
May 16, 2026

Expiration:
July 27, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for System Engineer

Lead Systems Engineer (Model-Based Systems Engineering)

Director Systems Engineering, infusion systems, medical fluid management

HPC & AI Systems Engineer for Integrated Systems Test

Systems Engineer

Senior Distributed Systems Engineer - Platform Engineering

Senior Systems Engineer

MES Application Owner - Systems Engineer

Systems Engineer

Our AI answers in your language

System Engineer

Randstad

Location:China , 上海

Category:IT - Software Development

Contract Type:Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:May 16, 2026

Expiration:July 27, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for System Engineer

Lead Systems Engineer (Model-Based Systems Engineering)

Director Systems Engineering, infusion systems, medical fluid management

HPC & AI Systems Engineer for Integrated Systems Test

Systems Engineer

Senior Distributed Systems Engineer - Platform Engineering

Senior Systems Engineer

MES Application Owner - Systems Engineer

Systems Engineer

Location:
China , 上海

Category:
IT - Software Development

Contract Type:
Not provided

Job Posted:
May 16, 2026

Expiration:
July 27, 2026