Engineering Manager, GPU Kernel Job at Wayve (London)

Engineering Manager, Kernel Reliability

We're looking for a deeply technical, hands-on engineering leader for our on-fie...

Location

United States; Canada , Sunnyvale; Toronto

Salary:

Not provided

Cerebras Systems

Expiration Date

Until further notice

Requirements

6+ years in software engineering
3+ years leading teams in SW/HW reliability, debug, diagnostic, failure analysis or related fields
Expertise in parallel and distributed programming (message passing, multicore, GPU, embedded, etc.)
Expertise in debug and diagnostic tool development or expert usage (debuggers, core dump handling, code sanitizers, etc.)
Experience debugging distributed and parallel applications (deadlocks, livelocks, race conditions, etc.)
Deep understanding of computer architectures (instruction pipelining, multithreading, networking, etc.)
Strong background in monitoring and reliability engineering (incident response, post-mortem analysis, etc.)
Demonstrated ability to recruit and retain high-performing teams, mentor engineers, and partner cross-functionally to deliver customer-facing products.

Job Responsibility

Provide hands-on technical leadership, owning the technical vision and roadmap for the kernel-centric reliability of our internal and customer-facing systems
Assist System and Cluster Operations teams on reducing system and service downtime after failure by providing tooling and manual intervention for failure analysis and diagnostic
Work with the Debug Team to enhance debug tools with the goal of speeding up failure analysis
Collaborate with SW teams to improve the software stack, including Kernels, to improve on-field debugging and failure analysis
Work with the ASIC and HW architecture teams to codesign the next generation architectures with reliability and ease of debug in mind
Lead, mentor, and grow a high-caliber team of engineers, fostering a culture of technical excellence and rapid execution.

What we offer

Build a breakthrough AI platform beyond the constraints of the GPU
Publish and open source their cutting-edge AI research
Work on one of the fastest AI supercomputers in the world
Enjoy job stability with startup vitality
Simple, non-corporate work culture that respects individual beliefs.

New

Senior Engineering Manager, Data Plane Systems

Crusoe is seeking a Senior Engineering Manager, Data Plane Systems to lead the t...

Location

United States , San Francisco

Salary:

237600.00 - 288000.00 USD / Year

Crusoe

Expiration Date

Until further notice

Requirements

At least 10+ years in high-performance networking or systems engineering, with 5-7+ years specifically managing senior/staff-level talent
Deep knowledge of Linux networking internals, kernel architecture, and experience with XDP/eBPF, AF_XDP, and DPDK
Hands-on experience with DPU integration and migrating networking functions to hardware accelerators
A strong understanding of low-latency networking, performance tuning, and benchmarking
The ability to resolve complex technical challenges in a fast-moving, execution-heavy environment

Job Responsibility

Define the roadmap for SDN data plane systems and lead the integration of DPUs (such as NVIDIA BlueField) and hardware accelerators
Oversee the development of Linux kernel networking components, XDP/eBPF data paths, and DPDK-based fast paths while driving the migration of networking functions to hardware offload architectures
Lead performance benchmarking, regression prevention, and incident response, ensuring operational excellence within 3-6 month execution cycles
Mentor and grow a team of senior and staff-level systems engineers, setting technical standards and fostering a high-performance culture of accountability
Partner closely with control-plane teams (OVN/OVS) to optimize throughput and latency for multi-tenant GPU clusters

What we offer

Competitive compensation
Restricted Stock Units
Paid time off & paid holidays
Comprehensive health, dental & vision insurance
Employer contributions to HSA account
Paid parental leave
Paid life insurance, short-term and long-term disability
Professional development & tuition reimbursement
Mental health & wellness support
Commuter benefits (parking & transit)

Fulltime

Slt Product Development Engineering Manager

We are a results-driven Central Engineering team delivering System Level Test (S...

Location

Singapore , Singapore

Salary:

Not provided

AMD

Expiration Date

Until further notice

Requirements

Bachelor's degree in Electrical and Electronic Engineering/Computer Engineering
Master's degree in Electrical and Electronic Engineering/Computer Engineering
Experience with CPU, GPU, system architectures
Strong people, communication & stakeholder management skills
Test & characterization, Programming & automation/AI assisted skills & experience preferred
Hand on experience in a team management role
Technical and specialized knowledge in Test, Characterization, Platform Engineering areas such as OS kernel, Driver, BIOS firmware development, Diagnostics or System Debug with semiconductor industry experience in product development for computing/graphics SOCs
Demonstrated object-oriented programming experience in scripting and/or programming languages such as python and java

Job Responsibility

Lead efforts in developing System Level Test (SLT) solutions within the stipulated cost, quality, yield and organizational strategic constraints
Collaborate with Design, Validation, Platform Engineering, Diagnostic and Tools teams to determine SLT coverage, content, characterization requirements and root cause resolution for SLT device failures
Lead efforts in the New Product Introduction phase to validate new product features, test conditions, Test methodologies and Test content
Drive quality & margin improvement in the Sustaining phase
Lead bounding box performance characterization efforts and innovation
Product health and cost target attainment
Drive yield and quality improvement activities through data analysis, debug, root-cause and implementation into production
Drive unit-cost reductions through Test-Time reduction and content optimization methodologies
Lead efforts in defining and developing SLT Equipment, software Infrastructure and test content within required Product timelines and quality goals
Manage resources for agility, focus and execution efficiency to deliver outcomes/results aligned to organization goals

Fulltime

Engineering Manager - Inference

We are looking for an Inference Engineering Manager to lead our AI Inference tea...

Location

United States , San Francisco

Salary:

300000.00 - 385000.00 USD / Year

Perplexity

Expiration Date

Until further notice

Requirements

5+ years of engineering experience with 2+ years in a technical leadership or management role
Deep experience with ML systems and inference frameworks (PyTorch, TensorFlow, ONNX, TensorRT, vLLM)
Strong understanding of LLM architecture: Multi-Head Attention, Multi/Grouped-Query Attention, and common layers
Experience with inference optimizations: batching, quantization, kernel fusion, FlashAttention
Familiarity with GPU characteristics, roofline models, and performance analysis
Experience deploying reliable, distributed, real-time systems at scale
Track record of building and leading high-performing engineering teams
Experience with parallelism strategies: tensor parallelism, pipeline parallelism, expert parallelism
Strong technical communication and cross-functional collaboration skills

Job Responsibility

Lead and grow a high-performing team of AI inference engineers
Develop APIs for AI inference used by both internal and external customers
Architect and scale our inference infrastructure for reliability and efficiency
Benchmark and eliminate bottlenecks throughout our inference stack
Drive large sparse/MoE model inference at rack scale, including sharding strategies for massive models
Push the frontier with building inference systems to support sparse attention, disaggregated pre-fill/decoding serving, etc.
Improve the reliability and observability of our systems and lead incident response
Own technical decisions around batching, throughput, latency, and GPU utilization
Partner with ML research teams on model optimization and deployment
Recruit, mentor, and develop engineering talent

What we offer

Equity
Health
Dental
Vision
Retirement
Fitness
Commuter and dependent care accounts

Fulltime

New

Senior Software Systems Designer

We are looking for a highly skilled Senior System Software Designer to design an...

Location

India , Bangalore

Salary:

Not provided

AMD

Expiration Date

Until further notice

Requirements

Hands-on experience with performance profiling tools (e.g., AMD uProf, perf, VTune, rocProfiler)
Strong understanding of microarchitecture concepts (pipelines, caches, branch prediction, memory hierarchy)
Experience working with hardware performance counters (PMC), IBS, or similar sampling techniques
Familiarity with OS internals (Linux kernel, schedulers, memory management, tracing frameworks)
Experience with distributed/HPC workloads (MPI, OpenMP, large-scale systems)
Exposure to trace analysis, call stacks, and sampling-based profiling models
Knowledge of container environments and system-level debugging is a plus
Experience contributing to cross-platform tools and frameworks
Bachelors or master's degree in electrical or computer engineering.

Job Responsibility

Design and develop system-level profiling tools spanning CPU, memory, IO, and power analysis
Build and optimize data collection frameworks leveraging hardware counters (PMC), IBS, and OS tracing
Develop low-overhead profiling infrastructure for large-scale and long-running workloads
Enhance performance analysis pipelines including data processing, correlation, and visualization
Enable cross-platform profiling support across Linux, Windows, and emerging OS ecosystems (e.g., FreeBSD)
Work on advanced analysis techniques such as top-down microarchitecture analysis, pipeline utilization, and bottleneck detection
Contribute to CLI and GUI-based tools for performance debugging and visualization
Integrate support for runtime and framework-level tracing (OpenMP, MPI, Java, Python, etc.)
Collaborate with CPU, GPU, kernel, and compiler teams to enable new hardware features in profiling tools
Drive automation and intelligent analysis, including AI/ML-assisted performance insights

Fulltime

New

Baseband AI technology-Co-op

Baseband Technology under Ericsson AI and technology foresight department leads ...

Location

Canada , Ottawa

Salary:

Not provided

Ericsson

Expiration Date

Until further notice

Requirements

Competence and experience in C/C++ and Python
Quick learner, self-driven person with open mind is a must
MSc degree in Electrical engineering or Computer Science or equivalent
Excellent written and verbal communication, interpersonal, time management, and multitasking skills
Willingness to work with an international, agile and multi-site teams both collaboratively and autonomously
An enthusiastic demeanor, eager to continue growing and learning
Innovating, adapting, and responding to change
A strong can-do attitude
Any relevant experience for wireless mobile network, understanding of 3GPP NR and ORAN standard protocol
Experience with ML frameworks (TensorFlow, PyTorch, JAX) and cloud platforms, especially AWS (SageMaker, Lambda, Step Functions, S3, etc.)

Job Responsibility

Being active and learning and practice agent AI workflow in the daily operation
Be actively contribution and support Cloud infrastructure CICD pipeline, such as AWS and on-prem
As your interests drive you, you are also very welcomed to be part of our research team on the technology scotting on new algorithm, in particular focus on Neural Network and other deep learning algorithm to enhance implementation of NR/6G baseband functions running on the Ericsson purpose build RAN compute as well as COTS (Commercial off the shelf) General compute platform, i.e x86/ARM with various accelerator, such as GPU programming

New

Sr Staff Engineer - Core Infrastructure

We are seeking a Senior Staff Engineer (L6) to lead the technical strategy and e...

Location

United States , New York; Seattle; San Francisco; Sunnyvale

Salary:

267000.00 - 297000.00 USD / Year

Uber

Expiration Date

Until further notice

Requirements

12+ years of software engineering experience, with a focus on massive-scale distributed systems or infrastructure
Proven Track Record at Scale: Experience managing infrastructure that supports millions of concurrent users or petabyte-scale data processing
Deep Systems Expertise: Mastery of Kubernetes internals, container runtimes, and the Linux kernel, with the ability to debug impossible performance bottlenecks
Cloud-Native Fluency: Deep experience with cloud-native networking (Envoy, CNI, Service Mesh) and multi-cloud (AWS/GCP) architecture
Coding Proficiency: Expert-level proficiency in Go, Java, or C++
Leadership: Demonstrated ability to lead 40+ person technical initiatives and influence VPs and GMs on infrastructure investment

Job Responsibility

Architect Strategic Efficiency: Own the technical vision to drive fleet-wide CPU utilization and unit-cost optimization through ARM adoption (targeting XM+ cores) and silicon diversity
Scale AI & ML Infrastructure: Define the architecture for shared GPU pools and high-performance clusters to support 300x larger ranking models and Autonomous Vehicle data ingestion
Modernize the Data Plane: Drive the convergence of Uber’s networking stack toward industry standards (Kubernetes, Envoy, CNI) while enhancing SkyEdge for active-active multi-cloud resilience
Enforce Foundations & Reliability: Lead the 100% Done-Done initiative, ensuring every service follows standardized safe-deployment (Starship) and reaches 100% zero-trust authorization
Agentic Augmentation: Integrate AI-driven Minions and AIOps into the infrastructure to automate 80% of alerts and unlock thousands of years of developer productivity
Cross-Org Influence: Partner with Delivery, Rides, and AV teams to ensure the infrastructure isn't just a container, but a competitive advantage that accelerates their time-to-market
Mentor Staff+ Engineers: Act as a force multiplier by coaching the next generation of technical leaders and influencing company-wide engineering standards

What we offer

Eligible to participate in Uber's bonus program
May be offered an equity award & other types of comp
Eligible to participate in a 401(k) plan
Various benefits

Fulltime

Virtual Software Modeling Engineer

Bring leading-edge SoCs to life by building and evolving the infrastructure that...

Location

United Kingdom , Cambridge

Salary:

Not provided

AMD

Expiration Date

Until further notice

Requirements

High-performance systems and application development in C/C++ on Windows and/or Linux
Hardware system architecture and subsystem interface protocols
x86, ARM, or GPU architecture, drivers, and applications
Linux and/or Windows kernel debugging
Functional modeling, architecture simulation, or hypervisor development
Experience with tools such as QEMU, VirtualBox, or SIMICS
Degree in Computer Science, Computer Engineering, Electrical Engineering, or a related field preferred

Job Responsibility

Evolve the simulator’s core infrastructure, with a focus on scalability, maintainability, and developer productivity
Maintain and improve dependency management and build systems to increase reliability, reproducibility, and performance
Develop and enhance tooling for packaging, deployment, and consumption across multiple environments
Modernize the simulator codebase using current C++ standards and best practices to improve readability, structure, and long-term sustainability
Design and implement infrastructure to support simulation as a cloud-hosted service
Build infrastructure for distributed, multi-host simulation, including coordination, synchronization, and performance optimization
Create tools and frameworks to debug multi-threaded simulation execution effectively
Define processes and infrastructure to simplify integration, validation, and long-term maintenance of third-party and external models
Collaborate with model developers to ensure infrastructure evolves with modeling needs without tightly coupling to specific implementations
Improve simulator stability, observability, and debuggability through enhanced logging, diagnostics, and tooling

Fulltime

Select Country

Engineering Manager, GPU Kernel

Job Description

Job Responsibility

Requirements

Nice to have

Looking for more opportunities?