CrawlJobs Logo

Performance Engineer - Inference

cerebras.net Logo

Cerebras Systems

Location Icon

Location:
Canada , Toronto

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

Engineers on the inference performance team operate at the intersection of hardware and software, driving end-to-end model inference speed and throughput. Their work spans low-level kernel performance debugging and optimization, system-level performance analysis, performance modeling and estimation, and the development of tooling for performance projection and diagnostics.

Job Responsibility:

  • Build performance models (kernel-level, end-to-end) to estimate the performance of state of the art and customer ML models
  • Optimize and debug our kernel micro code and compiler algorithms to elevate ML model inference speed, throughput and compute utilization on the Cerebras WSE
  • Debug and understand runtime performance on the system and cluster
  • Develop tools and infrastructure to help visualize performance data collected from the Wafer Scale Engine and our compute cluster

Requirements:

  • Bachelors / Masters / PhD in Electrical Engineering or Computer Science
  • Strong background in computer architecture
  • Exposure to and understanding of low-level deep learning / LLM math
  • Strong analytical and problem-solving mindset
  • 3+ years of experience in a relevant domain (Computer Architecture, CPU/GPU Performance, Kernel Optimization, HPC)
  • Experience working on CPU/GPU simulators
  • Exposure to performance profiling and debug on any system pipeline
  • Comfort with C++ and Python
What we offer:
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs

Additional Information:

Job Posted:
February 17, 2026

Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Performance Engineer - Inference

Head of Inference Kernels

As a core member of the team, you will play a pivotal role in leading a high-per...
Location
Location
United States , San Jose
Salary
Salary:
200000.00 - 300000.00 USD / Year
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience in designing and optimizing GPU kernels for deep learning on GPUs using CUDA, and assembly (ASM)
  • Experience with low-level programming to maximize performance for AI operations, leveraging tools like Compute Kernel (CK), CUTLASS, and Triton for multi-GPU and multi-platform performance
  • Deep fluency with transformer inference architecture, optimization levers, and full-stack systems (e.g., vLLM, custom runtimes)
  • History of delivering tangible perf wins on GPU hardware or custom AI accelerators
  • Solid understanding of roofline models of compute throughput, memory bandwidth and interconnect performance
  • Experienced in running large-scale workloads on heterogeneous compute clusters, optimizing for efficiency and scalability of AI workloads
  • Scopes projects crisply, sets aggressive but realistic milestones, and drives technical decision-making across the team
  • Anticipates blockers and shifts resources proactively
Job Responsibility
Job Responsibility
  • Architect Best-in-Class Inference Performance on Sohu: Deliver continuous batching throughput exceeding B200 by ≥10x on priority workloads
  • Develop Best-in-Performance Inference Mega Kernels: Develop complex, fused kernels that increase chip utilization and reduce inference latency, and validate these optimizations through benchmarking and regression-tested in production pipelines
  • Architect Model Mapping Strategies: Develop system level optimizations using a mix of techniques such tensor parallelism and expert parallelism for optimal performance
  • Hardware-Software Co-design of Inference-time Algorithmic Innovation: Develop and deploy production-ready inference-time algorithmic improvements (e.g., speculative decoding, prefill-decode disaggregation, KV cache offloading)
  • Build Scalable Team and Roadmap: Grow and retain a team of high-performing inference optimization engineers
  • Cross-Functional Performance Alignment: Ensure inference stack and performance goals are aligned with the software infrastructure teams, GTM and hardware teams for future generations of our hardware
What we offer
What we offer
  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • Housing subsidy of $2k per month for those living within walking distance of the office
  • Relocation support for those moving to San Jose (Santana Row)
  • Various wellness benefits covering fitness, mental health, and more
  • Daily lunch + dinner in our office
  • significant equity package
  • Fulltime
Read More
Arrow Right

Research Engineer, Core ML

This is a research engineering role with direct production impact. You will tran...
Location
Location
United States , San Francisco
Salary
Salary:
200000.00 - 280000.00 USD / Year
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of experience working on ML systems, large‑scale model training, inference, or adjacent areas (or equivalent experience via research / open source)
  • Advanced degree in Computer Science, EE, or a related field, or equivalent practical experience
  • Demonstrated experience owning complex technical projects end‑to‑end
  • Strong expertise in at least one of the following: Large‑scale inference systems (e.g., SGLang, vLLM, FasterTransformer, TensorRT, custom engines, or similar), GPU performance, distributed serving
  • RL / post‑training for LLMs or large models (e.g., GRPO, RLHF/RLAIF, DPO‑like methods, reward modeling)
  • Model architecture design for Transformers or other large neural nets
  • Distributed systems / high‑performance computing for ML
  • Strong coding ability in Python
  • Experience profiling and optimizing performance across GPU, networking, and memory layers
  • Track record of impactful work in ML systems, RL, or large‑scale model training (papers, open‑source projects, or production systems)
Job Responsibility
Job Responsibility
  • Advance inference efficiency end‑to‑end
  • Design and prototype algorithms, architectures, and scheduling strategies for low‑latency, high‑throughput inference
  • Implement and maintain changes in high‑performance inference engines
  • Profile and optimize performance across GPU, networking, and memory layers
  • Unify inference with RL / post‑training
  • Design and operate RL and post‑training pipelines
  • Make RL and post‑training workloads more efficient with inference‑aware training loops
  • Co‑design algorithms and infrastructure
  • Run ablations and scale‑up experiments to understand trade‑offs
  • Own critical systems at production scale
What we offer
What we offer
  • Startup equity
  • Health insurance
  • Competitive benefits
  • Fulltime
Read More
Arrow Right
New

DC-GPU Performance Modeling Engineer

Architect, analyse and optimize high-performance GPU-centric SoCs for Machine Le...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong understanding of computer architecture
  • Experience of working in GPUs, SoCs, ML accelerators would be a plus
  • Experience with performance analysis, workload characterization, and hardware/software co-design exploration
  • Familiarity with ML models and software stacks relevant to ML
  • Understanding of AI model distributed training and inference, model layers and ML ops, parallelization strategies
  • Experience in system-level modelling and simulation will be a plus
  • Strong programming skills, including experience with Python (or similar)
  • Ph.D. in Computer Science / Electronics Engineering, and 3+ years of experience as a Performance Engineer
  • M.S./M.Tech. in Computer Science / Electronics Engineering, and 5+ years of experience as a Performance Engineer
  • B.Tech. in Computer Science / Electronics Engineering, and 7+ years of experience as a Performance Engineer
Job Responsibility
Job Responsibility
  • Define, build and maintain performance models for performance projections, analysis and architecture exploration
  • Develop and execute system-level modelling strategies for ML and GPU hardware and software co-design
  • Drive performance trade-off studies for new architectural features, algorithms, and system configurations, providing data-driven recommendations
  • Collaborate with architecture, design and software teams to integrate models, define workloads and analyse simulation results
  • Innovate and advance modelling methodologies, tools and infrastructure to improve accuracy, speed, and architectural insight
Read More
Arrow Right

Inference Technical Lead

The Sora team is pioneering multimodal capabilities for OpenAI’s foundation mode...
Location
Location
United States , San Francisco
Salary
Salary:
380000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Deep expertise in model performance optimization, particularly at the inference layer
  • Strong background in kernel-level systems, data movement, and low-level performance tuning
  • Excited about scaling high-performing AI systems that serve real-world, multimodal workloads
  • Can navigate ambiguity, set technical direction, and drive complex initiatives to completion
Job Responsibility
Job Responsibility
  • Perform engineering efforts focused on improving model serving, inference performance, and system efficiency
  • Drive optimizations from a kernel and data movement perspective to improve system throughput and reliability
  • Partner closely with research and product teams to ensure our models perform effectively at scale
  • Design, build, and improve critical serving infrastructure to support Sora’s growth and reliability needs
  • Contribute to improvements in model serving efficiency for Sora
  • Drive initiatives to optimize inference performance and scalability
  • Be engaged in model design, to help assist our researchers in developing inference-friendly models
What we offer
What we offer
  • Offers Equity
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Fulltime
Read More
Arrow Right
New

DC-GPU Performance Modeling Engineer

Architect, analyse and optimize high-performance GPU-centric SoCs for Machine Le...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
amd.com Logo
AMD
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Ph.D. in Computer Science / Electronics Engineering, and 1+ years of experience as a Performance Engineer
  • M.S./M.Tech. in Computer Science / Electronics Engineering, and 3+ years of experience as a Performance Engineer
  • B.Tech. in Computer Science / Electronics Engineering, and 5+ years of experience as a Performance Engineer
  • Strong understanding of computer architecture
  • Experience of working in GPUs, SoCs, ML accelerators would be a plus
  • Exposure to performance analysis, workload characterization, and hardware/software co-design exploration
  • Familiarity with ML models and software stacks relevant to ML
  • Understanding of AI model distributed training and inference, model layers and ML ops, parallelization strategies
  • Understanding of system-level modelling and simulation will be a plus
  • Strong programming skills, including experience with Python (or similar)
Job Responsibility
Job Responsibility
  • Define, build and maintain performance models for performance projections, analysis and architecture exploration
  • Develop and execute system-level modelling strategies for ML and GPU hardware and software co-design
  • Drive performance trade-off studies for new architectural features, algorithms, and system configurations, providing data-driven recommendations
  • Collaborate with architecture, design and software teams to integrate models, define workloads and analyse simulation results
  • Innovate and advance modelling methodologies, tools and infrastructure to improve accuracy, speed, and architectural insight
Read More
Arrow Right

Engineering Manager - Inference

We are looking for an Inference Engineering Manager to lead our AI Inference tea...
Location
Location
United States , San Francisco
Salary
Salary:
300000.00 - 385000.00 USD / Year
perplexity.ai Logo
Perplexity
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of engineering experience with 2+ years in a technical leadership or management role
  • Deep experience with ML systems and inference frameworks (PyTorch, TensorFlow, ONNX, TensorRT, vLLM)
  • Strong understanding of LLM architecture: Multi-Head Attention, Multi/Grouped-Query Attention, and common layers
  • Experience with inference optimizations: batching, quantization, kernel fusion, FlashAttention
  • Familiarity with GPU characteristics, roofline models, and performance analysis
  • Experience deploying reliable, distributed, real-time systems at scale
  • Track record of building and leading high-performing engineering teams
  • Experience with parallelism strategies: tensor parallelism, pipeline parallelism, expert parallelism
  • Strong technical communication and cross-functional collaboration skills
Job Responsibility
Job Responsibility
  • Lead and grow a high-performing team of AI inference engineers
  • Develop APIs for AI inference used by both internal and external customers
  • Architect and scale our inference infrastructure for reliability and efficiency
  • Benchmark and eliminate bottlenecks throughout our inference stack
  • Drive large sparse/MoE model inference at rack scale, including sharding strategies for massive models
  • Push the frontier with building inference systems to support sparse attention, disaggregated pre-fill/decoding serving, etc.
  • Improve the reliability and observability of our systems and lead incident response
  • Own technical decisions around batching, throughput, latency, and GPU utilization
  • Partner with ML research teams on model optimization and deployment
  • Recruit, mentor, and develop engineering talent
What we offer
What we offer
  • Equity
  • Health
  • Dental
  • Vision
  • Retirement
  • Fitness
  • Commuter and dependent care accounts
  • Fulltime
Read More
Arrow Right

Machine Learning Engineer - Inference

Together AI is seeking a Machine Learning Engineer to join our Inference Engine ...
Location
Location
United States , San Francisco
Salary
Salary:
160000.00 - 230000.00 USD / Year
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of experience writing high-performance, well-tested, production-quality code
  • Proficiency with Python and PyTorch
  • Demonstrated experience in building high performance libraries and tooling
  • Excellent understanding of low-level operating systems concepts including multi-threading, memory management, networking, storage, performance, and scale
Job Responsibility
Job Responsibility
  • Design and build the production systems that power the Together AI inference engine, enabling reliability and performance at scale
  • Develop and optimize runtime inference services for large-scale AI applications
  • Collaborate with researchers, engineers, product managers, and designers to bring new features and research capabilities to the world
  • Conduct design and code reviews to ensure high standards of quality
  • Create services, tools, and developer documentation to support the inference engine
  • Implement robust and fault-tolerant systems for data ingestion and processing
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • other competitive benefits
  • Fulltime
Read More
Arrow Right

LLM Inference Frameworks and Optimization Engineer

At Together.ai, we are building state-of-the-art infrastructure to enable effici...
Location
Location
United States , San Francisco
Salary
Salary:
160000.00 - 230000.00 USD / Year
together.ai Logo
Together AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of experience in deep learning inference frameworks, distributed systems, or high-performance computing
  • Familiar with at least one LLM inference frameworks (e.g., TensorRT-LLM, vLLM, SGLang, TGI(Text Generation Inference))
  • Background knowledge and experience in at least one of the following: GPU programming (CUDA/Triton/TensorRT), compiler, model quantization, and GPU cluster scheduling
  • Deep understanding of KV cache systems like Mooncake, PagedAttention, or custom in-house variants
  • Proficient in Python and C++/CUDA for high-performance deep learning inference
  • Deep understanding of Transformer architectures and LLM/VLM/Diffusion model optimization
  • Knowledge of inference optimization, such as workload scheduling, CUDA graph, compiled, efficient kernels
  • Strong analytical problem-solving skills with a performance-driven mindset
  • Excellent collaboration and communication skills across teams
Job Responsibility
Job Responsibility
  • Design and develop fault-tolerant, high-concurrency distributed inference engine for text, image, and multimodal generation models
  • Implement and optimize distributed inference strategies, including Mixture of Experts (MoE) parallelism, tensor parallelism, pipeline parallelism for high-performance serving
  • Apply CUDA graph optimizations, TensorRT/TRT-LLM graph optimizations, and PyTorch-based compilation (torch.compile), and speculative decoding to enhance efficiency and scalability
  • Collaborate with hardware teams on performance bottleneck analysis, co-optimize inference performance for GPUs, TPUs, or custom accelerators
  • Work closely with AI researchers and infrastructure engineers to develop efficient model execution plans and optimize E2E model serving pipelines
What we offer
What we offer
  • competitive compensation
  • startup equity
  • health insurance
  • other competitive benefits
  • Fulltime
Read More
Arrow Right