This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
About the Mission: GM’s vision of Zero Crashes, Zero Emissions, and Zero Congestion guides everything we do in autonomous and assisted driving. The AV organization is building advanced automated driving technologies, including Level 4–capable fully self-driving systems, to move us toward safer, more sustainable, and more accessible mobility. For the AI Kernels & Compilers team, that mission shows up in the details: turning cutting‑edge perception, prediction, and planning research into production‑grade software that can run efficiently and reliably on real vehicles at scale. We pioneer new approaches to model export, kernel development, and performance engineering so that every cycle on our accelerators translates into better situational awareness, faster reaction times, and more robust behavior on the road. If you want your compiler and kernels work to directly influence how automated vehicles understand and react to the world — while operating at the safety, reliability and scale of a company like GM — this is where that impact becomes real. About the Team: The AI Kernels team builds high‑performance GPU kernels and custom libraries that sit at the heart of our on‑vehicle ML inference for ADAS and autonomous driving. We own making core AI workloads faster, more reliable, and easier to maintain and deploy on real cars, under real‑world constraints. That means: Designing and implementing custom operators when vendor libraries hit their limits; Integrating those kernels deep into our ML runtime stack; Debugging and tuning GPU performance across the AV software stack, often on hardware‑in‑the‑loop. We partner closely with AI Solutions, AI Compilers, AI Architecture, and AI Tooling to ensure models deploy efficiently to the car while consistently meeting strict latency, throughput, and reliability targets. If you enjoy pushing GPUs to their limits and seeing your work directly impact how autonomous vehicles perceive and act in the world, this is the team for you.
Job Responsibility:
Design, implement, benchmark, and iterate on CUDA-based kernels and custom operators to squeeze every last drop of performance out of on-vehicle inference workloads
Build and improve tooling and infrastructure that make it easier to profile, debug, and validate CUDA kernels and accelerator-backend code across the AV stack
Partner with AI Solutions, Compilers, and Architecture to translate model and system requirements into concrete kernel roadmaps, priorities, and project plans
Collaborate with cross-functional teams (compiler, performance tooling, runtime, deployment solutions) to deliver reusable, reliable, high-performance libraries into production
Maintain high technology standards, methodologies, processes, and guidelines for GPU kernel development and performance engineering through code review
Manage relationships with internal customers to ensure our kernels and libraries meet real-world needs
Requirements:
Minimum 3+ years of relevant industry experience or equivalent experience
BS, MS or PhD in CS, or related technical field
Excellent GPU programming skills in CUDA, with a thorough understanding of parallel programming patterns and GPU architecture
Hands-on experience benchmarking, profiling, debugging and optimizing accelerator libraries and kernels to extract optimal performance using the NSight suite of tools or similar
Strong background in software architecture, library design, and design patterns
Strong C++ programming skills with the ability to feel comfortable in large codebases
Solid background in system performance, high performance computing and/or architecture-aware optimizations
Strong communication skills and the ability to work collaboratively within a team
Excellent analytical and problem-solving skills
Nice to have:
Experience with tensor core programming, CUTLASS and/or CuTe
Experience with ML model architectures, in particular transformer-based
Experience with low latency or real time systems
Experience with lower levels of an accelerator software stack (i.e. drivers, runtimes, and compilers)