This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The company has almost 100 million customers based in Japan and 1 billion globally as well, providing more than 70 services in a variety such as e-commerce, payment services, financial services, telecommunication, media, sports, etc. GPUOD team. Optimize Kubernetes (K8s) for GPU workloads, including scheduling policies, autoscaling, and multi-tenant resource isolation. Deploy and maintain inference serving platforms (e.g., NVIDIA Triton, vLLM, SGlang) for high-throughput and low-latency model deployment. Automate cluster provisioning, monitoring, and recovery to maximize uptime and GPU utilization. Collaborate with ML engineers to troubleshoot GPU-related issues in training jobs (e.g., NCCL errors, OOM) and inference bottlenecks. Implement observability tools (Prometheus, Grafana) to track GPU utilization, job performance, and cluster health. Develop infrastructure-as-code (IaC) solutions for reproducible GPU environments (e.g., Terraform, Ansible).
Job Responsibility:
Optimize Kubernetes (K8s) for GPU workloads, including scheduling policies, autoscaling, and multi-tenant resource isolation
Deploy and maintain inference serving platforms (e.g., NVIDIA Triton, vLLM, SGlang) for high-throughput and low-latency model deployment
Automate cluster provisioning, monitoring, and recovery to maximize uptime and GPU utilization
Collaborate with ML engineers to troubleshoot GPU-related issues in training jobs (e.g., NCCL errors, OOM) and inference bottlenecks
Implement observability tools (Prometheus, Grafana) to track GPU utilization, job performance, and cluster health
3+ years of experience in DevOps/MLOps, GPU infrastructure, or distributed computing
Deep expertise in Kubernetes (K8s) for GPU workload orchestration (e.g., KubeFlow, Volcano, custom schedulers)
Strong programming skills in Go or Python for platform development, automation and tooling
Proficiency in Linux system administration, performance tuning, and networking (e.g., RDMA, InfiniBand)
Experience with IaC tools (Terraform, Ansible) and CI/CD pipelines (GitHub Actions, Jenkins)
Bachelor's or higher degree in Computer Science, Engineering, or a related field
Strong teamwork and communication skills, with a passion for solving infrastructure challenges
Familiarity with distributed training frameworks (e.g., PyTorch DDP, FSDP, DeepSpeed)
Familiarity with Nvidia Triton serving framework or similar framework, and serving parameter tuning to make a good trade off between latency and throughput
Hands-on experience with GPU clusters, including troubleshooting NVIDIA drivers, CUDA, and NCCL issues
Knowledge of high-performance storage (Lustre, WekaFS) for large-scale training data
Experience with LLM training/inference stacks (e.g., Megatron-LM, TensorRT-LLM)
Nice to have:
Familiarity with distributed training frameworks (e.g., PyTorch DDP, FSDP, DeepSpeed)
Familiarity with Nvidia Triton serving framework or similar framework, and serving parameter tuning to make a good trade off between latency and throughput
Hands-on experience with GPU clusters, including troubleshooting NVIDIA drivers, CUDA, and NCCL issues
Knowledge of high-performance storage (Lustre, WekaFS) for large-scale training data
Experience with LLM training/inference stacks (e.g., Megatron-LM, TensorRT-LLM)