This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The R&D of Search Ads aims to build an online advertising ecosystem of users, advertisers, and the search engine. Bing Search Ads Understanding team is chartered to deliver world class algorithm using web scale data. Our mission is to drive user satisfaction, advertiser ROI and Bing revenue. A core challenge is to match advertisers' "Ad display" and users' "query" by build an intelligent system to really understand the users need. This is a very hard problem that demands the most advanced AI models and sophisticated engineering systems. Join us to work on projects highly strategic to Bing search in a fun and fast-paced environment! We are hiring a Senior Software Engineer (GPU Inference Optimization) to work on GPU inference optimization of language models to support the GPU serving of the models for Ads tasks including query rewrite, Ad relevance and Ad creative generation, etc. As a member of this team, you will have the opportunity to work on the fundamental abstractions, programming models, runtimes, libraries and APIs to enable large scale inferencing and online serving of models on novel AI hardware. This is a technical role focused on GPU inference optimization of language models: it requires hands-on software development skills. We’re looking for someone who has a demonstrated history of solving hard technical problems and is motivated to tackle the hardest problems in building a full end-to-end AI stack. An entrepreneurial approach and ability to take initiative and move fast are essential.
Job Responsibility:
Design, develop, and maintain high-performance software in C/C++ and Python, including GPU programming with CUDA, ROCm, or Triton
Optimize model inference and training pipelines for speed, throughput, memory efficiency, and cost across GPU platforms
Collaborate with platform teams to integrate and tune solutions on emerging accelerator stacks and rapidly evolving toolchains
Profile workloads end-to-end, identify bottlenecks, and implement kernel-level and system-level performance improvements
Partner with internal and external stakeholders to translate requirements into scalable performance features and optimizations for state-of-the-art models
Validate performance, stability, and correctness through benchmarking, automated testing, and production readiness reviews
Requirements:
Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, Python, CUDA, or ROCm OR equivalent experience
3+ years' practical experience working on applications that use GPUs, experience in optimizing their performance
Practical Experience writing new GPU kernels, going beyond experience of GPU workloads with existing library kernels
Cross-team collaboration skills and the desire to collaborate in a team of researchers and developers
Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C/C++, CUDA, or ROCm OR Master's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C/C++, CUDA, or ROCm OR equivalent experience
Experience in low-level performance analysis and optimization, including proficiency using GPU profiling tools such as NVIDIA Visual Profiler, and NVIDIA Nsight Compute
Technical background and solid foundation in software engineering principles and architecture design
Familiar with inference optimization, experience in developing popular inference framework such as TensorRT-LLM, SGLang, vLLM
Exposure to Deep Neural Network inference and experience in one or more deep learning frameworks such as PyTorch, Tensorflow, or ONNX Runtime