CrawlJobs Logo

Engineering Manager, Kernel Reliability

cerebras.net Logo

Cerebras Systems

Location Icon

Location:
United States; Canada , Sunnyvale

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

We're looking for a deeply technical, hands-on engineering leader for our on-field Kernel Reliability team. You will lead a high performing team to tackle a critical challenge: improving the reliability of our advanced compute clusters and the underlying inference, training, and internal production services. In this role, you'll set the technical vision while staying close to the code and designing solutions that will scale to our exponentially growing system production and software service offerings.

Job Responsibility:

  • Provide hands-on technical leadership, owning the technical vision and roadmap for the kernel-centric reliability of our internal and customer-facing systems
  • Assist System and Cluster Operations teams on reducing system and service downtime after failure by providing tooling and manual intervention for failure analysis and diagnostic
  • Work with the Debug Team to enhance debug tools with the goal of speeding up failure analysis
  • Collaborate with SW teams to improve the software stack, including Kernels, to improve on-field debugging and failure analysis
  • Work with the ASIC and HW architecture teams to codesign the next generation architectures with reliability and ease of debug in mind
  • Lead, mentor, and grow a high-caliber team of engineers, fostering a culture of technical excellence and rapid execution.

Requirements:

  • 6+ years in software engineering
  • 3+ years leading teams in SW/HW reliability, debug, diagnostic, failure analysis or related fields
  • Expertise in parallel and distributed programming (message passing, multicore, GPU, embedded, etc.)
  • Expertise in debug and diagnostic tool development or expert usage (debuggers, core dump handling, code sanitizers, etc.)
  • Experience debugging distributed and parallel applications (deadlocks, livelocks, race conditions, etc.)
  • Deep understanding of computer architectures (instruction pipelining, multithreading, networking, etc.)
  • Strong background in monitoring and reliability engineering (incident response, post-mortem analysis, etc.)
  • Demonstrated ability to recruit and retain high-performing teams, mentor engineers, and partner cross-functionally to deliver customer-facing products.
What we offer:
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Simple, non-corporate work culture that respects individual beliefs.

Additional Information:

Job Posted:
February 17, 2026

Job Link Share:
PREMIUM
More languages and countries
+ Unlock 2204 hidden job offers
Languages
English Čeština Deutsch Ελληνικά Español Français +15
Countries
United States United Kingdom India Canada Australia +
See plans
Plans from $2.99 / month

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Engineering Manager, Kernel Reliability

Associate Director of Embedded Software Engineering

Silvus is seeking an Associate Director of Embedded Software Engineering to join...
Location
Location
United States , Los Angeles
Salary
Salary:
200000.00 - 250000.00 USD / Year
silvustechnologies.com Logo
Silvus Technologies (International)
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Demonstrated experience leading a team of engineers with hands-on development
  • Bachelor of Science degree in Electrical Engineering, Computer Science, or relevant engineering fields
  • 8+ years of relevant embedded system software development experience
  • Strong expertise in C programming
  • Expertise in board support package and secure boot in AMD UltraScale+ MPSoC and/or Microchip Polarfire SoC based products
  • Linux kernel driver development expertise
  • Expertise in network configurations and programming
  • Must be a U.S. Citizen due to clients under U.S. government contracts
Job Responsibility
Job Responsibility
  • Lead a team of engineers and be responsible for the team’s success on assigned projects
  • Work with the Director of Software Engineering and the rest of the engineering team to improve engineering processes, product quality, reliability, and performance
  • Develop device drivers and board support packages
  • Develop the software portion of MAC (Medium Access Control) and mobile ad-hoc networking routing protocols
  • Develop efficient wireless multicast protocols for mobile ad-hoc networking
  • Develop network management software and user interfaces
  • Develop audio streaming and push-to-talk voice applications
  • Perform system level design and implement security protocols and encryption algorithms on StreamCaster radios and other products
  • Support product security effort and regulatory compliance requirements such as NIST FIPS 140-3 and NIAP Common Criteria
  • Engage with and support customers as needed
  • Fulltime
Read More
Arrow Right

Principal Embedded Software Engineer

Silvus is seeking a full-time Principal Embedded Software Engineer to join our E...
Location
Location
United States , Irvine
Salary
Salary:
165000.00 - 215000.00 USD / Year
silvustechnologies.com Logo
Silvus Technologies (International)
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor of Science degree in Electrical Engineering, Computer Science, or relevant engineering fields
  • 8+ years of relevant embedded system software development experience
  • Expertise in C programming and experience in Linux kernel driver development
Job Responsibility
Job Responsibility
  • Implementation of the software portion of MAC (Medium Access Control) and mobile ad-hoc networking routing protocols
  • Network management software and web interface implementation
  • Implementation of different security protocols and encryption algorithms
  • Audio streaming and push-to-talk voice application implementation
  • Analyzing and improving product security and robustness to meet certain regulatory requirements such as NIST FIPS 140-3 and NIAP Common Criteria
  • Implementation of testing software for product performance and reliability testing
  • Device driver and board support package development and maintenance for both ARM and RISC-V based systems
  • Linux system customization and scripting
  • Fulltime
Read More
Arrow Right

Senior Embedded Software Engineer

Silvus is recruiting a Senior Embedded Software Engineer reporting to the Direct...
Location
Location
United States , Los Angeles
Salary
Salary:
135000.00 - 200000.00 USD / Year
silvustechnologies.com Logo
Silvus Technologies (International)
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor of Science degree in Electrical Engineering, Computer Science, or related fields
  • Minimum 5 years of relevant embedded system software development experience
  • Expertise in C programming and experience in Linux kernel driver development
  • Must be a U.S. Citizen due to clients under U.S. government contracts
  • All employment is contingent upon the successful clearance of a background check
Job Responsibility
Job Responsibility
  • Implementation of software portion of MAC (Medium Access Control) and mobile ad-hoc networking routing protocols
  • Network management software and web interface implementation
  • Implementation of different security protocols and encryption algorithms
  • Audio streaming and push to talk voice application implementation
  • Analyze and improve product security and robustness to meet certain regulatory requirements such as NIST FIPS 140-3 and NIAP Common Criteria
  • Implementation of testing software for product performance and reliability testing
  • Device driver and board support package development and maintenance for both ARM and RISC-V based systems
  • Linux system customization and scripting
  • Fulltime
Read More
Arrow Right

Senior Embedded Software Engineer

Silvus is seeking a full-time Senior Embedded Software Engineer to join our Rese...
Location
Location
United States , Los Angeles
Salary
Salary:
140000.00 - 200000.00 USD / Year
silvustechnologies.com Logo
Silvus Technologies (International)
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum Bachelor of Science degree in Electrical, Computer, or Communications Engineering, Computer Science, or relevant engineering fields
  • Minimum 5 years of relevant embedded system software development experience
  • 3 years of relevant embedded system software development experience with an advanced STEM degree
  • Expertise in C programming and experience in Linux kernel driver development
Job Responsibility
Job Responsibility
  • Implementation of software portion of MAC (Medium Access Control) and mobile ad-hoc networking routing protocols
  • Network management software and web interface implementation
  • Implementation of different security protocols and encryption algorithms
  • Audio streaming and push-to-talk voice application implementation
  • Analyze and improve product security and robustness to meet certain regulatory requirements such as NIST FIPS 140-3 and NIAP Common Criteria
  • Implementation of testing software for product performance and reliability testing
  • Device driver and board support package development and maintenance for both ARM and RISC-V based systems
  • Linux system customization and scripting
  • Fulltime
Read More
Arrow Right

Kernel Driver Software Engineer

Etched is building the world’s first AI inference system purpose-built for trans...
Location
Location
United States , San Jose
Salary
Salary:
150000.00 - 275000.00 USD / Year
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proficiency in C/C++
  • Strong understanding of kernel-mode driver development and debugging
  • Deep understanding of operating system internals (Linux preferred)
  • Experience with hardware/software interfacing and device drivers
  • Experience with memory management and synchronization in kernel environments
  • Strong understanding of PCIe and other hardware interfaces
  • Experience with device virtualization technologies, including SR-IOV and VFIO
  • Strong understanding of kernel memory mapping, page table configuration, and IOMMU
  • Familiarity with hardware-software co-design principles
  • Proven ability to analyze complex technical problems and provide effective solutions
Job Responsibility
Job Responsibility
  • Kernel-Mode Driver Development: Design, develop, and maintain kernel-mode drivers ensuring high reliability, informative debug, and optimal performance
  • Performance Optimization: Analyze and optimize driver performance for demanding AI workloads, focusing on minimizing latency and maximizing throughput
  • Hardware Integration and Co-Design: Collaborate closely with hardware engineers throughout the ASIC design process
  • Virtualization Support: Implement driver support for device virtualization technologies, including SR-IOV, VFIO, and para-virtualization
  • Memory Management: Implement efficient memory management strategies considering kernel memory mapping, page tables configuration, NUMA awareness for device data caching, and IOMMU configuration
  • Security: Build kernel drivers fundamentally designed to support and maintain security across host processes, physical memory spaces, and device attestation
  • Debugging and Troubleshooting: Diagnose and resolve complex driver-related issues, using common kernel debugging tools and techniques (ftrace, dmesg, etc.) to identify and fix bugs
  • Synchronization and Concurrency: Design and implement synchronization mechanisms to handle concurrent access to multiple accelerators
  • System Validation and Testing: Develop and execute comprehensive test plans to validate driver functionality, stability, and performance in manufacturing and in general production environments
  • Collaboration and Troubleshooting: Collaborate with software and hardware teams to diagnose and resolve complex system-level issues
What we offer
What we offer
  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • Housing subsidy of $2k per month for those living within walking distance of the office
  • Relocation support for those moving to San Jose (Santana Row)
  • Various wellness benefits covering fitness, mental health, and more
  • Daily lunch + dinner in our office
  • Fulltime
Read More
Arrow Right

Agentic AI Engineer

We are looking for highly skilled Agentic Engineers with strong hands-on experie...
Location
Location
India , Chennai City Corporation
Salary
Salary:
Not provided
optisolbusiness.com Logo
OptiSol Business Solutions
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 1–2 years of hands-on experience in Generative AI, AI Engineering, or Agentic AI systems
  • Strong Python programming skills with practical backend engineering experience
  • Deep practical experience building with: Large Language Models (LLMs), AI Agents, Context Engineering, Workflow Orchestration Systems, Multi-Agent Architectures
  • Strong understanding of: Prompt Design, Context Management, Tool Calling, Memory Systems, Guardrails, Evaluations & Reliability Engineering
  • Experience with frameworks such as LangGraph, CrewAI, AutoGen, Semantic Kernel, or custom orchestration systems
  • Hands-on experience with SQL/NoSQL databases, vector databases, RAG systems, observability platforms, and AI evaluation frameworks
  • Strong product mindset with focus on measurable user value and business outcomes
  • Proven ability to ship quickly, iterate fast, and improve workflows based on real-world usage
  • Track record of improving reliability, reducing latency/cost, or increasing automation efficiency
  • Experience working with coding assistants such as GitHub Copilot, Cursor, Claude Code, and Azure Copilot Studio
Job Responsibility
Job Responsibility
  • Build and maintain production-grade AI agent orchestration systems
  • Design intelligent workflows involving tool calling, memory systems, reasoning, and multi-agent collaboration
  • Develop robust prompt pipelines with structured outputs, retries, fallbacks, and evaluation systems
  • Manage orchestration stacks across multiple LLM providers/models with routing and optimization strategies
  • Implement governed AI agents with enterprise-grade guardrails and secure execution patterns
  • Build and optimize Retrieval-Augmented Generation (RAG) pipelines and context-aware AI systems
  • Implement observability, monitoring, tracing, and failure-handling mechanisms for AI workflows
  • Improve AI reliability through prompt engineering, structured outputs, evaluation systems, and workflow optimization
  • Optimize latency, token usage, infrastructure cost, and automation efficiency across AI systems
  • Build scalable and maintainable AI workflow architectures for production environments
What we offer
What we offer
  • Opportunity to work on cutting-edge AI agent orchestration and Generative AI platforms
  • Hands-on exposure to LLMs, RAG systems, multi-agent architectures, and enterprise AI workflows
  • Collaborative and innovation-driven engineering culture
  • Exposure to real-world AI transformation and automation projects
  • Fast-paced learning environment with strong career growth opportunities
  • Competitive compensation and performance-driven growth path
  • Fulltime
Read More
Arrow Right

Software Engineer, Fleet Management

The Fleet team at OpenAI supports the computing environment that powers our cutt...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 490000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering skills with experience in large-scale infrastructure environments
  • Broad knowledge of cluster-level systems (e.g., Kubernetes, CI/CD pipelines, Terraform, cloud providers)
  • Deep expertise in server-level systems (e.g., systems, containerization, Chef, Linux kernels, firmware management, host routing)
  • Passionate about optimizing the performance and reliability of large compute fleets
  • Thrive in dynamic environments and are eager to solve complex infrastructure challenges
  • Value automation, efficiency, and continuous improvement in everything you build
Job Responsibility
Job Responsibility
  • Design and build systems to manage both cloud and bare-metal fleets at scale
  • Develop tools that integrate low-level hardware metrics with high-level job scheduling and cluster management algorithms
  • Leverage LLMs to coordinate vendor operations and optimize infrastructure workflows
  • Automate infrastructure processes, reducing repetitive toil and improving system reliability
  • Collaborate with hardware, infrastructure, and research teams to ensure seamless integration across the stack
  • Continuously improve tools, automation, processes, and documentation to enhance operational efficiency
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

Software Engineer (Rust)

At hyperexponential, we’re building the AI-powered platform that enables the wor...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
hyperexponential.com Logo
hyperexponential
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Built and shipped backend systems in a polyglot production environment, working across languages such as Rust, Python, or C/C++, and choosing the right tool for the problem
  • Either used Rust in production systems, or demonstrated strong systems experience in C/C++ or Python with clear evidence of learning and adopting new languages quickly
  • Delivered backend features where correctness, performance, and resource efficiency were critical, and improved systems based on real production behaviour
  • Applied solid systems thinking around execution, concurrency, and memory or resource management in real-world backend services
  • Designed and implemented effective testing and observability that made complex backend behaviour understandable, debuggable, and safe to evolve
  • Owned work end-to-end from design through rollout, monitoring, and iteration, collaborating closely with peers and raising risks early
Job Responsibility
Job Responsibility
  • Designing and delivering backend features in the Kernel execution engine that improve correctness, performance, and reliability for customer model runs at scale
  • Strengthening a clean, modular Kernel architecture that is easy to understand, test, and safely evolve over time
  • Building automated tests and observability for complex execution scenarios, enabling faster detection, diagnosis, and resolution of production issues
  • Partnering with adjacent engineering teams to shape Kernel behaviour and interfaces, providing clear technical input and following through to delivery
  • Owning work from design through rollout and monitoring, balancing immediate product needs with long-term system sustainability
  • Demonstrating hx values by proactively identifying risks, collaborating constructively, and taking full accountability in a critical core-platform area
What we offer
What we offer
  • £5,000 training and conference budget for individual and group development
  • 25 days of holiday plus 8 bank holidays (33 days total)
  • Company pension scheme via Penfold
  • Mental health support and therapy via Spectrum.life
  • Individual wellbeing allowance via Juno
  • Private healthcare insurance through AXA
  • Income protection and Life Insurance
  • Cycle to Work Scheme
  • Top-spec equipment (laptop, screens, adjustable desks, etc.)
  • Regular remote and in-person hackathons, lunch and learns, socials, and game nights
  • Fulltime
Read More
Arrow Right