This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As a member of the Computing Product Line, Heterogeneous Memory Software Lab, you will engage in the research, design, and implementation of software components that enable tiered memory usage within end-to-end solutions. Your work will focus on advancing the AI ecosystem by designing and implementing libraries that leverage the capabilities of new Ascend AI hardware. You will contribute to extending compiler infrastructure to support Triton on hardware, enabling high- performance kernel generation for AI workloads. You will also research new memory management techniques to optimize performance and efficiency. Collaborating closely with research teams across the company, you will drive innovation behind cutting-edge AI solutions.
Job Responsibility:
Lead performance optimization of AI models on Ascend NPUs, including performance analysis, bottleneck identification, and optimization implementation for both training and inference workloads
Analyze performance bottlenecks of multimodal models and large language models (LLMs) on the Ascend platform, covering operators, kernels, memory access patterns, and scheduling
design and implement optimization strategies
Develop and optimize critical operators/kernels, continuously improving execution efficiency, memory access patterns, parallelization strategies, and hardware resource utilization
Research and apply advanced techniques such as auto-tuning, operator fusion, graph optimization, and scheduling optimization in real-world production scenarios
Build and lead an NPU performance optimization team
communicate findings to cross-functional teams and leadership, and contribute to the evolution of next-generation Ascend NPU architecture
Requirements:
Deep understanding of GPU or NPU architecture, including execution units, memory hierarchy, interconnects, and thread scheduling, as well as performance bottleneck analysis methodologies
Familiarity with mainstream deep learning frameworks such as PyTorch, TensorFlow, or JAX
Hands-on experience in deep learning operator/kernel development and performance tuning, with the ability to implement and optimize complex operators
Proficiency with performance analysis and profiling tools (e.g., Nsight Compute, nvprof, torch.profiler), and ability to conduct quantitative analysis and performance modeling
Strong system design and software engineering skills, with the ability to balance performance, maintainability, and generality in complex systems
Master’s or Ph.D. degree in Computer Architecture, Compiler Design, High Performance Computing, or a related field