CrawlJobs Logo

Inference Compiler Engineer

cerebras.net Logo

Cerebras Systems

Location Icon

Location:
United Arab Emirates , Dubai

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. Cerebras' current customers include top model labs, global enterprises, and cutting-edge AI-native startups. OpenAI recently announced a multi-year partnership with Cerebras, to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference. Thanks to the groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation. Would you like to participate in creating the fastest Generative Models inference in the world? Join the Cerebras Inference Team to participate in development of unique Software and Hardware combination that sports best inference characteristics in the market while running largest models available. Cerebras wafer scale inference platform allows running Generative models with unprecedented speed thanks to unique hardware architecture that provides fastest access to local memory, ultra-fast interconnect and huge amount of available compute. You will be part of the team that works with latest open and closed generative AI models to optimize for the Cerebras inference platform. Your responsibilities will include working on model representation, optimization and compilation stack to produce the best results on Cerebras current and future platforms.

Job Responsibility:

  • Analysis of new models from generative AI field and understanding of impacts on compilation stack
  • Implementation of compiler and frontend features to support new models, improve inference characteristics and Cerebras user experience
  • Collaboration with other teams throughout feature implementation
  • Research on new methods for model optimization to improve Cerebras inference

Requirements:

  • Degree in Engineering, Computer Science, or equivalent in experience and evidence of exceptional ability
  • Strong experience working with Python and C++ languages
  • Experience working with PyTorch and HuggingFace Transformers library
  • Knowledge and experience working with Large Language Models (understanding Transformer architecture variations, generation cycle, etc.)

Nice to have:

Knowledge of MLIR based compilation stack is a plus

What we offer:
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs

Additional Information:

Job Posted:
February 17, 2026

Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Inference Compiler Engineer

New

Python / PyTorch Developer — Frontend Inference Compiler

Would you like to participate in creating the fastest Generative Models inferenc...
Location
Location
United Arab Emirates , Dubai
Salary
Salary:
Not provided
cerebras.net Logo
Cerebras Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Degree in Engineering, Computer Science, or equivalent in experience and evidence of exceptional ability
  • Strong Python programming skills and in-depth experience with PyTorch internals (e.g., TorchScript, FX, or Dynamo)
  • Solid understanding of computational graphs, tensor operations, and model tracing
  • Experience building or extending compilers, interpreters, or ML graph optimization frameworks
  • Experience working with PyTorch and HuggingFace Transformers library
  • Knowledge and experience working with Large Language Models (understanding Transformer architecture variations, generation cycle, etc.)
  • Strong C++ programming skills
  • Knowledge of MLIR based compilation stack
Job Responsibility
Job Responsibility
  • Analysis of new models from generative AI field and understanding of impacts on compilation stack
  • Develop and maintain model definition framework that consists of model building blocks to represent large language models based on PyTorch and Cerebras dialects ready to be deployed on Cerebras hardware
  • Develop and maintain the frontend compiler infrastructure that ingests PyTorch models and produces an intermediate representation (IR)
  • Extend and optimize PyTorch FX / TorchScript / TorchDynamo-based tooling for graph capture, transformation, and analysis
  • Collaboration with other teams throughout feature implementation
  • Research on new methods for model optimization to improve Cerebras inference
What we offer
What we offer
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs
Read More
Arrow Right
New

LLM Inference Performance & Evals Engineer

Join the inference model team dedicated to bring up the state-of-the-art models,...
Location
Location
Canada , Toronto
Salary
Salary:
Not provided
cerebras.net Logo
Cerebras Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years building high-performance ML or systems software
  • Solid grounding in Transformer math—attention scaling, KV-cache, quantisation—or clear evidence you learn this material rapidly
  • Comfort navigating the full AI toolchain: Python modeling code, compiler IRs, performance profiling, etc.
  • Strong debugging skills across performance, numerical accuracy, and runtime integration
  • Prior experience in modeling, compilers or crafting benchmarks or performance studies
  • not just black-box QA tests
  • Strong passion to leverage AI agents or workflow orchestration tools to boost personal productivity
Job Responsibility
Job Responsibility
  • Prototype and benchmark cutting-edge ideas: new attentions, MoE, speculative decoding, and many more innovations as they emerge
  • Develop agent-driven automation that designs experiments, schedules runs, triages regressions, and drafts pull-requests
  • Work closely with compiler, runtime, and silicon teams: unique opportunity to experience the full stack of software/hardware innovation
  • Keep pace with the latest open- and closed-source models
  • run them first on wafer scale to expose new optimization opportunities
What we offer
What we offer
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs
Read More
Arrow Right
New

Performance Engineer - Inference

Engineers on the inference performance team operate at the intersection of hardw...
Location
Location
Canada , Toronto
Salary
Salary:
Not provided
cerebras.net Logo
Cerebras Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelors / Masters / PhD in Electrical Engineering or Computer Science
  • Strong background in computer architecture
  • Exposure to and understanding of low-level deep learning / LLM math
  • Strong analytical and problem-solving mindset
  • 3+ years of experience in a relevant domain (Computer Architecture, CPU/GPU Performance, Kernel Optimization, HPC)
  • Experience working on CPU/GPU simulators
  • Exposure to performance profiling and debug on any system pipeline
  • Comfort with C++ and Python
Job Responsibility
Job Responsibility
  • Build performance models (kernel-level, end-to-end) to estimate the performance of state of the art and customer ML models
  • Optimize and debug our kernel micro code and compiler algorithms to elevate ML model inference speed, throughput and compute utilization on the Cerebras WSE
  • Debug and understand runtime performance on the system and cluster
  • Develop tools and infrastructure to help visualize performance data collected from the Wafer Scale Engine and our compute cluster
What we offer
What we offer
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs
Read More
Arrow Right
New

Software Engineer, Systems ML - Compilers / Backend

We are seeking a software engineer to support the development of the compiler to...
Location
Location
United States , Sunnyvale
Salary
Salary:
217000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 2+ years experience developing compilers, toolchains, runtime, or similar code optimization software
  • Experience in software design and programming experience in Python and/or C/C++ for development, debugging, testing and performance analysis
  • Experience in AI framework development or accelerating models on hardware architectures (GPU, TPU, custom AI ASICs)
Job Responsibility
Job Responsibility
  • Analyze and design effective compiler passes and optimizations. Implement and/or enhance code generation targeting machine learning accelerators
  • Work with algorithm research teams to map ML graphs to hardware implementations, model data-flows, create cost-benefit analysis and estimate silicon power and performance
  • Work with hardware architects to co-design hardware features that maximize performance, power efficiency and programmability
  • Contribute to the development of machine-learning libraries, intermediate representations, export formats, and analysis tools
  • Analyze and improve the efficiency, scalability, and stability of our toolchains. Optimize and tune kernels and compiled code to achieve latency targets for ML inference
  • Conduct design and code reviews. Evaluate code performance, debug, diagnose and drive resolution of compiler and cross-disciplinary system issues
  • Interface with other compiler-focused teams to evaluate and incorporate their innovations and vice versa
What we offer
What we offer
  • bonus
  • equity
  • benefits
Read More
Arrow Right

Research Engineer, Scaling

As a Research Engineer, Scaling, you will design and build infrastructure to sup...
Location
Location
United States , Palo Alto
Salary
Salary:
180000.00 - 300000.00 USD / Year
1x.tech Logo
1X Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong programming experience in Python and/or C++
  • Deep intuitive understanding of what affects training or inference speed: from bottlenecks to scaling laws
  • A mindset aligned with extremely high scaling: belief that scale is foundational to enabling humanoid robotics
  • Degree in Computer Science or a related field
  • Hands‑on experience with distributed training frameworks (e.g., TorchTitan, DeepSpeed, FSDP/ZeRO), multi‑node debugging, experiment management
  • Proven skills optimizing inference performance: graph compilers, batching/scheduling, serving systems (e.g., using TensorRT or equivalents)
  • Familiarity with quantization strategies: PTQ, QAT, INT8/FP8
  • tools like TensorRT, bitsandbytes, etc.
  • Experience writing or tuning CUDA or Triton kernels
  • understanding of hardware features like vectorization, tensor cores, and memory hierarchies
Job Responsibility
Job Responsibility
  • Own and lead scaling of both distributed training and inference systems
  • Ensure compute resources are sufficient so that data, not hardware, is the limiter
  • Enable massive training at scale (1000+ GPUs) on robot data, handling fault tolerance, experiment tracking, distributed operations, and large datasets
  • Optimize inference throughput in datacenter contexts (e.g., for world models and diffusion engines)
  • Reduce latency and optimize performance for on‑device robot policies through techniques like quantization, scheduling, distillation, etc.
What we offer
What we offer
  • Health, dental, and vision insurance
  • 401(k) with company match
  • Paid time off and holidays
  • Fulltime
Read More
Arrow Right

AI Research Engineer, Scaling

As a Research Engineer focused on Scaling, you will design and build robust infr...
Location
Location
United States , Palo Alto
Salary
Salary:
180000.00 - 300000.00 USD / Year
1x.tech Logo
1X Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong programming experience in Python and/or C++
  • Deep intuitive understanding of training and inference speed bottlenecks and scaling laws
  • A mindset aligned with extremely high scaling: belief that scale is foundational to enabling humanoid robotics
  • Degree in Computer Science or a related field
  • Experience with distributed training frameworks (e.g., TorchTitan, DeepSpeed, FSDP/ZeRO), multi-node debugging, and experiment management
  • Proven skills in optimizing inference performance using graph compilers, batching/scheduling, and serving systems like TensorRT or equivalents
  • Familiarity with quantization strategies (PTQ, QAT, INT8/FP8) and tools such as TensorRT and bitsandbytes
  • Experience developing or tuning CUDA or Triton kernels with understanding of hardware-level optimization (vectorization, tensor cores, memory hierarchies)
Job Responsibility
Job Responsibility
  • Own and lead scaling of distributed training and inference systems
  • Ensure compute resources are optimized to make data the primary constraint
  • Enable massive training runs (1000+ GPUs) using robot data, with robust fault tolerance, experiment tracking, and distributed operations
  • Optimize inference throughput for datacenter use cases such as world models and diffusion engines
  • Reduce latency and enhance performance for on-device robot policies using techniques such as quantization, scheduling, and distillation
What we offer
What we offer
  • Equity
  • Health, dental, and vision insurance
  • 401(k) with company match
  • Paid time off and holidays
  • Fulltime
Read More
Arrow Right

Senior Research Engineer

We are seeking a highly skilled Senior Research Engineer to collaborate closely ...
Location
Location
United States
Salary
Salary:
210000.00 - 309000.00 USD / Year
assembly.ai Logo
Assembly
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong expertise in the Python ecosystem and major ML frameworks (PyTorch, JAX)
  • Experience with lower-level programming (C++ or Rust preferred)
  • Deep understanding of GPU acceleration (CUDA, profiling, kernel-level optimization)
  • TPU experience is a strong plus
  • Proven ability to accelerate deep learning workloads using compiler frameworks, graph optimizations, and parallelization strategies
  • Solid understanding of the deep learning lifecycle: model design, large-scale training, data processing pipelines, and inference deployment
  • Strong debugging, profiling, and optimization skills in large-scale distributed environments
  • Excellent communication and collaboration skills, with the ability to clearly prioritize and articulate impact-driven technical solutions
Job Responsibility
Job Responsibility
  • Investigate and mitigate performance bottlenecks in large-scale distributed training and inference systems
  • Develop and implement both low-level (operator/kernel) and high-level (system/architecture) optimization strategies
  • Translate research models and prototypes into highly optimized, production-ready inference systems
  • Explore and integrate inference compilers such as TensorRT, ONNX Runtime, AWS Neuron and Inferentia, or similar technologies
  • Design, test, and deploy scalable solutions for parallel and distributed workloads on heterogeneous hardware
  • Facilitate knowledge transfer and bidirectional support between Research and Engineering teams, ensuring alignment of priorities and solutions
What we offer
What we offer
  • competitive equity grants
  • 100% employer-paid benefits
  • flexibility of being fully remote
  • Fulltime
Read More
Arrow Right

Engineering Manager, GPU Kernel

As the Engineering Manager for the GPU Kernel team, you’ll lead the team respons...
Location
Location
United Kingdom , London
Salary
Salary:
Not provided
wayve.ai Logo
Wayve
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience as an Engineering Manager delivering complex engineering projects
  • Experience developing GPU kernels and/or ML compilers (e.g. CUDA, OpenCL, TensorRT, MLIR, TVM, etc)
  • Experience optimising systems to meet strict utilisation and latency requirements
  • Excellent interpersonal and communication skills
Job Responsibility
Job Responsibility
  • Lead a multi-disciplinary team of ML GPU kernel engineers to enable efficient ML deployments across millions of customer vehicles
  • Set key foundational strategy on deployment frameworks, compilers, toolchains and SoCs
  • Set clear objectives and priorities, and allocate resource efficiently
  • Have opportunities to develop new skills, especially within end-to-end ML and inference optimisation
  • Fulltime
Read More
Arrow Right