CrawlJobs Logo

Engineering Manager, Kernel Reliability

cerebras.net Logo

Cerebras Systems

Location Icon

Location:
United States; Canada , Sunnyvale

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

We're looking for a deeply technical, hands-on engineering leader for our on-field Kernel Reliability team. You will lead a high performing team to tackle a critical challenge: improving the reliability of our advanced compute clusters and the underlying inference, training, and internal production services. In this role, you'll set the technical vision while staying close to the code and designing solutions that will scale to our exponentially growing system production and software service offerings.

Job Responsibility:

  • Provide hands-on technical leadership, owning the technical vision and roadmap for the kernel-centric reliability of our internal and customer-facing systems
  • Assist System and Cluster Operations teams on reducing system and service downtime after failure by providing tooling and manual intervention for failure analysis and diagnostic
  • Work with the Debug Team to enhance debug tools with the goal of speeding up failure analysis
  • Collaborate with SW teams to improve the software stack, including Kernels, to improve on-field debugging and failure analysis
  • Work with the ASIC and HW architecture teams to codesign the next generation architectures with reliability and ease of debug in mind
  • Lead, mentor, and grow a high-caliber team of engineers, fostering a culture of technical excellence and rapid execution.

Requirements:

  • 6+ years in software engineering
  • 3+ years leading teams in SW/HW reliability, debug, diagnostic, failure analysis or related fields
  • Expertise in parallel and distributed programming (message passing, multicore, GPU, embedded, etc.)
  • Expertise in debug and diagnostic tool development or expert usage (debuggers, core dump handling, code sanitizers, etc.)
  • Experience debugging distributed and parallel applications (deadlocks, livelocks, race conditions, etc.)
  • Deep understanding of computer architectures (instruction pipelining, multithreading, networking, etc.)
  • Strong background in monitoring and reliability engineering (incident response, post-mortem analysis, etc.)
  • Demonstrated ability to recruit and retain high-performing teams, mentor engineers, and partner cross-functionally to deliver customer-facing products.
What we offer:
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Simple, non-corporate work culture that respects individual beliefs.

Additional Information:

Job Posted:
February 17, 2026

Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Engineering Manager, Kernel Reliability

Associate Director of Embedded Software Engineering

Silvus is seeking an Associate Director of Embedded Software Engineering to join...
Location
Location
United States , Los Angeles
Salary
Salary:
200000.00 - 250000.00 USD / Year
silvustechnologies.com Logo
Silvus Technologies (International)
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Demonstrated experience leading a team of engineers with hands-on development
  • Bachelor of Science degree in Electrical Engineering, Computer Science, or relevant engineering fields
  • 8+ years of relevant embedded system software development experience
  • Strong expertise in C programming
  • Expertise in board support package and secure boot in AMD UltraScale+ MPSoC and/or Microchip Polarfire SoC based products
  • Linux kernel driver development expertise
  • Expertise in network configurations and programming
  • Must be a U.S. Citizen due to clients under U.S. government contracts
Job Responsibility
Job Responsibility
  • Lead a team of engineers and be responsible for the team’s success on assigned projects
  • Work with the Director of Software Engineering and the rest of the engineering team to improve engineering processes, product quality, reliability, and performance
  • Develop device drivers and board support packages
  • Develop the software portion of MAC (Medium Access Control) and mobile ad-hoc networking routing protocols
  • Develop efficient wireless multicast protocols for mobile ad-hoc networking
  • Develop network management software and user interfaces
  • Develop audio streaming and push-to-talk voice applications
  • Perform system level design and implement security protocols and encryption algorithms on StreamCaster radios and other products
  • Support product security effort and regulatory compliance requirements such as NIST FIPS 140-3 and NIAP Common Criteria
  • Engage with and support customers as needed
  • Fulltime
Read More
Arrow Right

Principal Embedded Software Engineer

Silvus is seeking a full-time Principal Embedded Software Engineer to join our E...
Location
Location
United States , Irvine
Salary
Salary:
165000.00 - 215000.00 USD / Year
silvustechnologies.com Logo
Silvus Technologies (International)
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor of Science degree in Electrical Engineering, Computer Science, or relevant engineering fields
  • 8+ years of relevant embedded system software development experience
  • Expertise in C programming and experience in Linux kernel driver development
Job Responsibility
Job Responsibility
  • Implementation of the software portion of MAC (Medium Access Control) and mobile ad-hoc networking routing protocols
  • Network management software and web interface implementation
  • Implementation of different security protocols and encryption algorithms
  • Audio streaming and push-to-talk voice application implementation
  • Analyzing and improving product security and robustness to meet certain regulatory requirements such as NIST FIPS 140-3 and NIAP Common Criteria
  • Implementation of testing software for product performance and reliability testing
  • Device driver and board support package development and maintenance for both ARM and RISC-V based systems
  • Linux system customization and scripting
  • Fulltime
Read More
Arrow Right

Senior Embedded Software Engineer

Silvus is recruiting a Senior Embedded Software Engineer reporting to the Direct...
Location
Location
United States , Los Angeles
Salary
Salary:
135000.00 - 200000.00 USD / Year
silvustechnologies.com Logo
Silvus Technologies (International)
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor of Science degree in Electrical Engineering, Computer Science, or related fields
  • Minimum 5 years of relevant embedded system software development experience
  • Expertise in C programming and experience in Linux kernel driver development
  • Must be a U.S. Citizen due to clients under U.S. government contracts
  • All employment is contingent upon the successful clearance of a background check
Job Responsibility
Job Responsibility
  • Implementation of software portion of MAC (Medium Access Control) and mobile ad-hoc networking routing protocols
  • Network management software and web interface implementation
  • Implementation of different security protocols and encryption algorithms
  • Audio streaming and push to talk voice application implementation
  • Analyze and improve product security and robustness to meet certain regulatory requirements such as NIST FIPS 140-3 and NIAP Common Criteria
  • Implementation of testing software for product performance and reliability testing
  • Device driver and board support package development and maintenance for both ARM and RISC-V based systems
  • Linux system customization and scripting
  • Fulltime
Read More
Arrow Right

Senior Embedded Software Engineer

Silvus is seeking a full-time Senior Embedded Software Engineer to join our Rese...
Location
Location
United States , Los Angeles
Salary
Salary:
140000.00 - 200000.00 USD / Year
silvustechnologies.com Logo
Silvus Technologies (International)
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum Bachelor of Science degree in Electrical, Computer, or Communications Engineering, Computer Science, or relevant engineering fields
  • Minimum 5 years of relevant embedded system software development experience
  • 3 years of relevant embedded system software development experience with an advanced STEM degree
  • Expertise in C programming and experience in Linux kernel driver development
Job Responsibility
Job Responsibility
  • Implementation of software portion of MAC (Medium Access Control) and mobile ad-hoc networking routing protocols
  • Network management software and web interface implementation
  • Implementation of different security protocols and encryption algorithms
  • Audio streaming and push-to-talk voice application implementation
  • Analyze and improve product security and robustness to meet certain regulatory requirements such as NIST FIPS 140-3 and NIAP Common Criteria
  • Implementation of testing software for product performance and reliability testing
  • Device driver and board support package development and maintenance for both ARM and RISC-V based systems
  • Linux system customization and scripting
  • Fulltime
Read More
Arrow Right

Kernel Driver Software Engineer

Etched is building the world’s first AI inference system purpose-built for trans...
Location
Location
United States , San Jose
Salary
Salary:
150000.00 - 275000.00 USD / Year
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proficiency in C/C++
  • Strong understanding of kernel-mode driver development and debugging
  • Deep understanding of operating system internals (Linux preferred)
  • Experience with hardware/software interfacing and device drivers
  • Experience with memory management and synchronization in kernel environments
  • Strong understanding of PCIe and other hardware interfaces
  • Experience with device virtualization technologies, including SR-IOV and VFIO
  • Strong understanding of kernel memory mapping, page table configuration, and IOMMU
  • Familiarity with hardware-software co-design principles
  • Proven ability to analyze complex technical problems and provide effective solutions
Job Responsibility
Job Responsibility
  • Kernel-Mode Driver Development: Design, develop, and maintain kernel-mode drivers ensuring high reliability, informative debug, and optimal performance
  • Performance Optimization: Analyze and optimize driver performance for demanding AI workloads, focusing on minimizing latency and maximizing throughput
  • Hardware Integration and Co-Design: Collaborate closely with hardware engineers throughout the ASIC design process
  • Virtualization Support: Implement driver support for device virtualization technologies, including SR-IOV, VFIO, and para-virtualization
  • Memory Management: Implement efficient memory management strategies considering kernel memory mapping, page tables configuration, NUMA awareness for device data caching, and IOMMU configuration
  • Security: Build kernel drivers fundamentally designed to support and maintain security across host processes, physical memory spaces, and device attestation
  • Debugging and Troubleshooting: Diagnose and resolve complex driver-related issues, using common kernel debugging tools and techniques (ftrace, dmesg, etc.) to identify and fix bugs
  • Synchronization and Concurrency: Design and implement synchronization mechanisms to handle concurrent access to multiple accelerators
  • System Validation and Testing: Develop and execute comprehensive test plans to validate driver functionality, stability, and performance in manufacturing and in general production environments
  • Collaboration and Troubleshooting: Collaborate with software and hardware teams to diagnose and resolve complex system-level issues
What we offer
What we offer
  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • Housing subsidy of $2k per month for those living within walking distance of the office
  • Relocation support for those moving to San Jose (Santana Row)
  • Various wellness benefits covering fitness, mental health, and more
  • Daily lunch + dinner in our office
  • Fulltime
Read More
Arrow Right

Software Engineer, Fleet Management

The Fleet team at OpenAI supports the computing environment that powers our cutt...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 490000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering skills with experience in large-scale infrastructure environments
  • Broad knowledge of cluster-level systems (e.g., Kubernetes, CI/CD pipelines, Terraform, cloud providers)
  • Deep expertise in server-level systems (e.g., systems, containerization, Chef, Linux kernels, firmware management, host routing)
  • Passionate about optimizing the performance and reliability of large compute fleets
  • Thrive in dynamic environments and are eager to solve complex infrastructure challenges
  • Value automation, efficiency, and continuous improvement in everything you build
Job Responsibility
Job Responsibility
  • Design and build systems to manage both cloud and bare-metal fleets at scale
  • Develop tools that integrate low-level hardware metrics with high-level job scheduling and cluster management algorithms
  • Leverage LLMs to coordinate vendor operations and optimize infrastructure workflows
  • Automate infrastructure processes, reducing repetitive toil and improving system reliability
  • Collaborate with hardware, infrastructure, and research teams to ensure seamless integration across the stack
  • Continuously improve tools, automation, processes, and documentation to enhance operational efficiency
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

Senior Software Engineer Linux Kernel & Embedded Systems

Randstad Digital, the specialized division of Randstad Italia in the search, sel...
Location
Location
Italy , Pisa
Salary
Salary:
50000.00 - 65000.00 EUR / Year
https://www.randstad.com Logo
Randstad
Expiration Date
May 03, 2026
Flip Icon
Requirements
Requirements
  • Master’s degree or above in Computer Science, Electrical Engineering, or related fields
  • experience in Operating System kernel design and development (Linux or equivalent)
  • Proficient in kernel subsystems such as scheduling, memory management, device drivers, and IPC mechanisms
  • Strong capability in performance analysis, real‐time optimization, and troubleshooting complex system issues
  • Familiarity with automotive functional safety standards (ISO 26262, ASIL levels)
Job Responsibility
Job Responsibility
  • Develop safe and efficient kernel module to improve application performance and reliability
  • Develop hypervisor on ARM/RISC-v to finish cross-domain virtualization/partition isolation
  • Perform deep analysis and do optimization across kernel scheduling, memory, bandwidth, and inter‐core communication
Read More
Arrow Right

Senior Systems Engineer

We are looking for a versatile and driven Senior Systems Engineer to join our En...
Location
Location
United States , Chicago
Salary
Salary:
130000.00 USD / Year
akunacapital.com Logo
AKUNA CAPITAL
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Degree in Computer Science, Information Systems, or a related field
  • 5-7 years of systems engineering experience
  • Advanced Linux knowledge including kernel bypass, kernel tuning, and customizing kernels
  • Deep understanding of virtualization and containerization technologies
  • Extensive experience with a variety of Linux distributions (RedHat, Ubuntu, etc.)
  • Deep understanding of system monitoring and configuration management tools (Ansible, Foreman, Prometheus and Icinga/Nagios)
  • Proficiency in scripting and using automation and orchestration tools such as Python and Bash
  • Expertise in troubleshooting multicast and TCP related performance issues
  • Experience automating daily software and hardware related tasks
  • Demonstrated ability to lead large technical projects
Job Responsibility
Job Responsibility
  • Analyze complex technical problems and collaborate on designing solutions for Akuna’s global Infrastructure platform
  • Drive projects and solutions to completion in a fast-paced environment
  • Design, develop and maintain orchestration and configuration solutions
  • Collaborate with developers and other infrastructure engineers to research new products and techniques that drive innovation and improve efficiency and performance in the environment
  • Architect and maintain multi-vendor, tier-based storage solutions
  • Build out a test automation framework for systems performance testing and tuning
  • Create and institute process enforcement across environments
  • Create tools that assist teams to optimize the available infrastructure
  • Develop and maintain comprehensive technical documentation, including system configurations, procedures, and troubleshooting guides
  • Lead knowledge transfer sessions and mentor team members to ensure continuity and operational excellence
What we offer
What we offer
  • Discretionary performance bonus
  • Comprehensive benefits package that may encompass employer-paid medical, dental, vision, retirement contributions, paid time off, and other benefits
  • Fulltime
Read More
Arrow Right