This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The company has almost 100 million customers based in Japan and 1 billion globally as well, providing more than 70 services in a variety such as e-commerce, payment services, financial services, telecommunication, media, sports, etc.
Job Responsibility:
Optimize LLM training frameworks (e.g., PyTorch, DeepSpeed, Megatron-LM, FSDP) to maximize GPU utilization and reduce training time
Profile and optimize distributed training bottlenecks (e.g., NCCL issues, CUDA kernel efficiency, communication overhead)
Implement and tune inference optimizations (e.g., quantization, dynamic batching, KV caching) for low-latency, high-throughput LLM serving (vLLM, TensorRT-LLM, Triton, SGLang)
Collaborate with infrastructure teams to improve GPU cluster scheduling, resource allocation, and fault tolerance for large-scale training jobs
Develop benchmarking tools to measure and improve training throughput, memory efficiency, and inference latency
Research and apply cutting-edge techniques (e.g., mixture-of-experts, speculative decoding) to optimize LLM performance.
Requirements:
3+ years of hands-on experience in GPU-accelerated ML training & inference optimization, preferably for LLMs or large-scale deep learning models
Deep expertise in PyTorch, DeepSpeed, FSDP, or Megatron-LM, with experience in distributed training optimizations