CrawlJobs Logo

Performance Engineer - Inference

cerebras.net Logo

Cerebras Systems

Location Icon

Location:
Canada , Toronto

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

Engineers on the inference performance team operate at the intersection of hardware and software, driving end-to-end model inference speed and throughput. Their work spans low-level kernel performance debugging and optimization, system-level performance analysis, performance modeling and estimation, and the development of tooling for performance projection and diagnostics.

Job Responsibility:

  • Build performance models (kernel-level, end-to-end) to estimate the performance of state of the art and customer ML models
  • Optimize and debug our kernel micro code and compiler algorithms to elevate ML model inference speed, throughput and compute utilization on the Cerebras WSE
  • Debug and understand runtime performance on the system and cluster
  • Develop tools and infrastructure to help visualize performance data collected from the Wafer Scale Engine and our compute cluster

Requirements:

  • Bachelors / Masters / PhD in Electrical Engineering or Computer Science
  • Strong background in computer architecture
  • Exposure to and understanding of low-level deep learning / LLM math
  • Strong analytical and problem-solving mindset
  • 3+ years of experience in a relevant domain (Computer Architecture, CPU/GPU Performance, Kernel Optimization, HPC)
  • Experience working on CPU/GPU simulators
  • Exposure to performance profiling and debug on any system pipeline
  • Comfort with C++ and Python
What we offer:
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs

Additional Information:

Job Posted:
February 17, 2026

Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Performance Engineer - Inference

Research Engineer AI

The role involves conducting high-quality research in AI and HPC, shaping future...
Location
Location
United Kingdom , Bristol
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A good working knowledge of AI/ML frameworks, at least TensorFlow and PyTorch, as well as the data preparation, handling, and lineage control, as well as model deployment, in particular in a distributed environment
  • At least a B.Sc. equivalent in a Science, Technology, Engineering or Mathematical discipline
  • Development experience in compiled languages such as C, C++ or Fortran and experience with interpreted environments such as Python
  • Parallel programming experience, with relevant programming models such as OpenMP, MPI, CUDA, OpenACC, HIP, PGAS languages is highly desirable
Job Responsibility
Job Responsibility
  • Perform world-class research while also shaping products of the future
  • Enable high performance AI software stacks on supercomputers
  • Provide new environments/abstractions to support application developers to build, deploy, and run AI applications taking advantage of leading-edge hardware at scale
  • Manage modern data-intensive AI training and inference workloads
  • Port and optimize workloads of key research centers like the AI safety institute
  • Support onboarding and scaling of domain-specific applications
  • Foster collaboration with the UK and European research community
What we offer
What we offer
  • Health & Wellbeing benefits that support physical, financial and emotional wellbeing
  • Career development programs catered to achieving career goals
  • Unconditional inclusion in the workplace
  • Flexibility to manage work and personal needs
  • Fulltime
Read More
Arrow Right
New

LLM Inference Performance & Evals Engineer

Join the inference model team dedicated to bring up the state-of-the-art models,...
Location
Location
Canada , Toronto
Salary
Salary:
Not provided
cerebras.net Logo
Cerebras Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years building high-performance ML or systems software
  • Solid grounding in Transformer math—attention scaling, KV-cache, quantisation—or clear evidence you learn this material rapidly
  • Comfort navigating the full AI toolchain: Python modeling code, compiler IRs, performance profiling, etc.
  • Strong debugging skills across performance, numerical accuracy, and runtime integration
  • Prior experience in modeling, compilers or crafting benchmarks or performance studies
  • not just black-box QA tests
  • Strong passion to leverage AI agents or workflow orchestration tools to boost personal productivity
Job Responsibility
Job Responsibility
  • Prototype and benchmark cutting-edge ideas: new attentions, MoE, speculative decoding, and many more innovations as they emerge
  • Develop agent-driven automation that designs experiments, schedules runs, triages regressions, and drafts pull-requests
  • Work closely with compiler, runtime, and silicon teams: unique opportunity to experience the full stack of software/hardware innovation
  • Keep pace with the latest open- and closed-source models
  • run them first on wafer scale to expose new optimization opportunities
What we offer
What we offer
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs
Read More
Arrow Right
New

Senior GPU Engineer

We are seeking an expert Senior GPU Engineer to join our AI Infrastructure team....
Location
Location
China , Beijing
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 4+ years of experience in systems programming, HPC, or GPU software development, featuring at least 5 years of hands-on CUDA/C++ kernel development
  • Expertise in the CUDA programming model and NVIDIA GPU architectures (specifically Ampere/Hopper)
  • Deep understanding of the memory hierarchy (Shared Memory, L2 cache, Registers), warp-level primitives, occupancy optimization, and bank conflict resolution
  • Familiarity with advanced hardware features: Tensor Cores, TMA (Tensor Memory Accelerator), and asynchronous copy
  • Proven ability to navigate and modify complex, large-scale codebases (e.g., PyTorch internals, Linux kernel)
  • Experience with build and binding ecosystems: CMake, pybind11, and CI/CD for GPU workloads
  • Mastery of NVIDIA Nsight Systems/Compute
  • Ability to mathematically reason about performance using the Roofline Model, memory bandwidth utilization, and compute throughput
Job Responsibility
Job Responsibility
  • Custom Operator Development: Design and implement highly optimized GPU kernels (CUDA/Triton) for critical deep learning operations (e.g., FlashAttention, GEMM, LayerNorm) to outperform standard libraries
  • Inference Engine Architecture: Contribute to the development of our high-performance inference engine, focusing on graph optimizations, operator fusion, and dynamic memory management (e.g., KV Cache optimization)
  • Performance Optimization: Deeply analyze and profile model performance using tools like Nsight Systems/Compute. Identify bottlenecks in memory bandwidth, instruction throughput, and kernel launch overheads
  • Model Acceleration: Implement advanced acceleration techniques such as Quantization (INT8, FP8, AWQ), Kernel Fusion, and continuous batching
  • Distributed Computing: Optimize communication primitives (NCCL) to enable efficient multi-GPU and multi-node inference (Tensor Parallelism, Pipeline Parallelism)
  • Hardware Adaptation: Ensure the software stack fully utilizes modern GPU architecture features (e.g., NVIDIA Hopper/Ampere Tensor Cores, Asynchronous Copy)
  • Fulltime
Read More
Arrow Right
New

Software Engineer II and Senior Software Engineer - Performance

The Artificial Intelligence Performance team at Microsoft develops AI software t...
Location
Location
United States , Mountain View
Salary
Salary:
100600.00 - 199000.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, or Python OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Identify and drive improvements to end-to-end inference performance of OpenAI and other state-of-the-art LLMs
  • Measure, benchmark performance on Nvidia/AMD GPUs and first party Microsoft silicon
  • Optimize and monitor performance of LLMs and build SW tooling to enable insights into performance opportunities ranging from the model level to the systems and silicon level to improve customer experience and reduce the footprint of the computing fleet
  • Enable fast time to market of LLMs/models and their deployments at scale by building SW tools that afford velocity in porting models on new Nvidia and AMD GPUs
  • Design, implement, and test functions or components for our AI/DNN/LLM frameworks and tools
  • Speeding up/reducing complexity of key components/pipelines to improve performance and/or efficiency of our systems
  • Communicate and collaborate with our partners both internal and external
  • Embody Microsoft's Culture and Values
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Performance Optimization

We're looking for a Software Engineer focused on Performance Optimization to hel...
Location
Location
United States , San Mateo
Salary
Salary:
175000.00 - 220000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent practical experience
  • 5+ years of experience working on performance optimization or high-performance computing systems
  • Proficiency in CUDA or ROCm and experience with GPU profiling tools (e.g., Nsight, nvprof, CUPTI)
  • Familiarity with PyTorch and performance-critical model execution
  • Experience with distributed system debugging and optimization in multi-GPU environments
  • Deep understanding of GPU architecture, parallel programming models, and compute kernels
Job Responsibility
Job Responsibility
  • Optimize system and GPU performance for high-throughput AI workloads across training and inference
  • Analyze and improve latency, throughput, memory usage, and compute efficiency
  • Profile system performance to detect and resolve GPU- and kernel-level bottlenecks
  • Implement low-level optimizations using CUDA, Triton, and other performance tooling
  • Drive improvements in execution speed and resource utilization for large-scale model workloads (LLMs, VLMs, and video models)
  • Collaborate with ML researchers to co-design and tune model architectures for hardware efficiency
  • Improve support for mixed precision, quantization, and model graph optimization
  • Build and maintain performance benchmarking and monitoring infrastructure
  • Scale inference and training systems across multi-GPU, multi-node environments
  • Evaluate and integrate optimizations for emerging hardware accelerators and specialized runtimes
What we offer
What we offer
  • Meaningful equity in a fast-growing startup
  • Competitive salary
  • Comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Research Engineer, Scaling

As a Research Engineer, Scaling, you will design and build infrastructure to sup...
Location
Location
United States , Palo Alto
Salary
Salary:
180000.00 - 300000.00 USD / Year
1x.tech Logo
1X Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong programming experience in Python and/or C++
  • Deep intuitive understanding of what affects training or inference speed: from bottlenecks to scaling laws
  • A mindset aligned with extremely high scaling: belief that scale is foundational to enabling humanoid robotics
  • Degree in Computer Science or a related field
  • Hands‑on experience with distributed training frameworks (e.g., TorchTitan, DeepSpeed, FSDP/ZeRO), multi‑node debugging, experiment management
  • Proven skills optimizing inference performance: graph compilers, batching/scheduling, serving systems (e.g., using TensorRT or equivalents)
  • Familiarity with quantization strategies: PTQ, QAT, INT8/FP8
  • tools like TensorRT, bitsandbytes, etc.
  • Experience writing or tuning CUDA or Triton kernels
  • understanding of hardware features like vectorization, tensor cores, and memory hierarchies
Job Responsibility
Job Responsibility
  • Own and lead scaling of both distributed training and inference systems
  • Ensure compute resources are sufficient so that data, not hardware, is the limiter
  • Enable massive training at scale (1000+ GPUs) on robot data, handling fault tolerance, experiment tracking, distributed operations, and large datasets
  • Optimize inference throughput in datacenter contexts (e.g., for world models and diffusion engines)
  • Reduce latency and optimize performance for on‑device robot policies through techniques like quantization, scheduling, distillation, etc.
What we offer
What we offer
  • Health, dental, and vision insurance
  • 401(k) with company match
  • Paid time off and holidays
  • Fulltime
Read More
Arrow Right

AI Research Engineer, Scaling

As a Research Engineer focused on Scaling, you will design and build robust infr...
Location
Location
United States , Palo Alto
Salary
Salary:
180000.00 - 300000.00 USD / Year
1x.tech Logo
1X Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong programming experience in Python and/or C++
  • Deep intuitive understanding of training and inference speed bottlenecks and scaling laws
  • A mindset aligned with extremely high scaling: belief that scale is foundational to enabling humanoid robotics
  • Degree in Computer Science or a related field
  • Experience with distributed training frameworks (e.g., TorchTitan, DeepSpeed, FSDP/ZeRO), multi-node debugging, and experiment management
  • Proven skills in optimizing inference performance using graph compilers, batching/scheduling, and serving systems like TensorRT or equivalents
  • Familiarity with quantization strategies (PTQ, QAT, INT8/FP8) and tools such as TensorRT and bitsandbytes
  • Experience developing or tuning CUDA or Triton kernels with understanding of hardware-level optimization (vectorization, tensor cores, memory hierarchies)
Job Responsibility
Job Responsibility
  • Own and lead scaling of distributed training and inference systems
  • Ensure compute resources are optimized to make data the primary constraint
  • Enable massive training runs (1000+ GPUs) using robot data, with robust fault tolerance, experiment tracking, and distributed operations
  • Optimize inference throughput for datacenter use cases such as world models and diffusion engines
  • Reduce latency and enhance performance for on-device robot policies using techniques such as quantization, scheduling, and distillation
What we offer
What we offer
  • Equity
  • Health, dental, and vision insurance
  • 401(k) with company match
  • Paid time off and holidays
  • Fulltime
Read More
Arrow Right

Data Scientist

Circle K is seeking a Data Scientist responsible for delivering advanced analyti...
Location
Location
India , Gurugram
Salary
Salary:
Not provided
https://www.circlek.com Logo
Circle K
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A higher degree in an analytical discipline like Finance, Mathematics, Statistics, Engineering, or similar
  • Experience: 3-4 years for Data Scientist
  • Relevant working experience in a quantitative/ applied analytics role
  • Experience with programming, and the ability to quickly pick up handling large data volumes with modern data processing tools, e.g. by using Spark / SQL / Python
  • Excellent communication skills in English, both verbal and written
  • Delivery Excellence
  • Business disposition
  • Social intelligence
  • Innovation and agility
  • Functional Analytics (Retail Analytics, Supply Chain Analytics, Marketing Analytics, Customer Analytics, etc.)
Job Responsibility
Job Responsibility
  • Evaluate performance of categories and activities, using proven and advanced analytical methods
  • Support stakeholders with actionable insights based on transactional, financial or customer data on an ongoing basis
  • Oversee the design and measurement of experiments and pilots
  • Initiate and conduct advanced analytics projects such as clustering, forecasting, causal impact
  • Build highly impactful and intuitive dashboards that bring the underlying data to life through insights
  • Improve data quality by using and improving tools to automatically detect issues
  • Develop analytical solutions or dashboards using user-centric design techniques in alignment with ACT’s protocol
  • Study industry/organization benchmarks and design/develop analytical solutions to monitor or improve business performance across retail, marketing, and other business areas
  • Work with Peers, Functional Consultants, Data Engineers, and cross-functional teams to lead / support the complete lifecycle of analytical applications, from development of mock-ups and storyboards to complete production ready application
  • Provide regular updates to stakeholders to simplify and clarify complex concepts, and communicate the output of work to business
  • Fulltime
Read More
Arrow Right