Performance Engineer - Inference Job at Cerebras Systems (Toronto)

Research Engineer AI

The role involves conducting high-quality research in AI and HPC, shaping future...

Location

United Kingdom , Bristol

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

A good working knowledge of AI/ML frameworks, at least TensorFlow and PyTorch, as well as the data preparation, handling, and lineage control, as well as model deployment, in particular in a distributed environment
At least a B.Sc. equivalent in a Science, Technology, Engineering or Mathematical discipline
Development experience in compiled languages such as C, C++ or Fortran and experience with interpreted environments such as Python
Parallel programming experience, with relevant programming models such as OpenMP, MPI, CUDA, OpenACC, HIP, PGAS languages is highly desirable

Job Responsibility

Perform world-class research while also shaping products of the future
Enable high performance AI software stacks on supercomputers
Provide new environments/abstractions to support application developers to build, deploy, and run AI applications taking advantage of leading-edge hardware at scale
Manage modern data-intensive AI training and inference workloads
Port and optimize workloads of key research centers like the AI safety institute
Support onboarding and scaling of domain-specific applications
Foster collaboration with the UK and European research community

What we offer

Health & Wellbeing benefits that support physical, financial and emotional wellbeing
Career development programs catered to achieving career goals
Unconditional inclusion in the workplace
Flexibility to manage work and personal needs

Fulltime

New

LLM Inference Performance & Evals Engineer

Join the inference model team dedicated to bring up the state-of-the-art models,...

Location

Canada , Toronto

Salary:

Not provided

Cerebras Systems

Expiration Date

Until further notice

Requirements

3+ years building high-performance ML or systems software
Solid grounding in Transformer math—attention scaling, KV-cache, quantisation—or clear evidence you learn this material rapidly
Comfort navigating the full AI toolchain: Python modeling code, compiler IRs, performance profiling, etc.
Strong debugging skills across performance, numerical accuracy, and runtime integration
Prior experience in modeling, compilers or crafting benchmarks or performance studies
not just black-box QA tests
Strong passion to leverage AI agents or workflow orchestration tools to boost personal productivity

Job Responsibility

Prototype and benchmark cutting-edge ideas: new attentions, MoE, speculative decoding, and many more innovations as they emerge
Develop agent-driven automation that designs experiments, schedules runs, triages regressions, and drafts pull-requests
Work closely with compiler, runtime, and silicon teams: unique opportunity to experience the full stack of software/hardware innovation
Keep pace with the latest open- and closed-source models
run them first on wafer scale to expose new optimization opportunities

What we offer

Build a breakthrough AI platform beyond the constraints of the GPU
Publish and open source their cutting-edge AI research
Work on one of the fastest AI supercomputers in the world
Enjoy job stability with startup vitality
Our simple, non-corporate work culture that respects individual beliefs

New

Senior GPU Engineer

We are seeking an expert Senior GPU Engineer to join our AI Infrastructure team....

Location

China , Beijing

Salary:

Not provided

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
4+ years of experience in systems programming, HPC, or GPU software development, featuring at least 5 years of hands-on CUDA/C++ kernel development
Expertise in the CUDA programming model and NVIDIA GPU architectures (specifically Ampere/Hopper)
Deep understanding of the memory hierarchy (Shared Memory, L2 cache, Registers), warp-level primitives, occupancy optimization, and bank conflict resolution
Familiarity with advanced hardware features: Tensor Cores, TMA (Tensor Memory Accelerator), and asynchronous copy
Proven ability to navigate and modify complex, large-scale codebases (e.g., PyTorch internals, Linux kernel)
Experience with build and binding ecosystems: CMake, pybind11, and CI/CD for GPU workloads
Mastery of NVIDIA Nsight Systems/Compute
Ability to mathematically reason about performance using the Roofline Model, memory bandwidth utilization, and compute throughput

Job Responsibility

Custom Operator Development: Design and implement highly optimized GPU kernels (CUDA/Triton) for critical deep learning operations (e.g., FlashAttention, GEMM, LayerNorm) to outperform standard libraries
Inference Engine Architecture: Contribute to the development of our high-performance inference engine, focusing on graph optimizations, operator fusion, and dynamic memory management (e.g., KV Cache optimization)
Performance Optimization: Deeply analyze and profile model performance using tools like Nsight Systems/Compute. Identify bottlenecks in memory bandwidth, instruction throughput, and kernel launch overheads
Model Acceleration: Implement advanced acceleration techniques such as Quantization (INT8, FP8, AWQ), Kernel Fusion, and continuous batching
Distributed Computing: Optimize communication primitives (NCCL) to enable efficient multi-GPU and multi-node inference (Tensor Parallelism, Pipeline Parallelism)
Hardware Adaptation: Ensure the software stack fully utilizes modern GPU architecture features (e.g., NVIDIA Hopper/Ampere Tensor Cores, Asynchronous Copy)

Fulltime

New

Software Engineer II and Senior Software Engineer - Performance

The Artificial Intelligence Performance team at Microsoft develops AI software t...

Location

United States , Mountain View

Salary:

100600.00 - 199000.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, or Python OR equivalent experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter

Job Responsibility

Identify and drive improvements to end-to-end inference performance of OpenAI and other state-of-the-art LLMs
Measure, benchmark performance on Nvidia/AMD GPUs and first party Microsoft silicon
Optimize and monitor performance of LLMs and build SW tooling to enable insights into performance opportunities ranging from the model level to the systems and silicon level to improve customer experience and reduce the footprint of the computing fleet
Enable fast time to market of LLMs/models and their deployments at scale by building SW tools that afford velocity in porting models on new Nvidia and AMD GPUs
Design, implement, and test functions or components for our AI/DNN/LLM frameworks and tools
Speeding up/reducing complexity of key components/pipelines to improve performance and/or efficiency of our systems
Communicate and collaborate with our partners both internal and external
Embody Microsoft's Culture and Values

Fulltime

Member of Technical Staff, Performance Optimization

We're looking for a Software Engineer focused on Performance Optimization to hel...

Location

United States , San Mateo

Salary:

175000.00 - 220000.00 USD / Year

Fireworks AI

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent practical experience
5+ years of experience working on performance optimization or high-performance computing systems
Proficiency in CUDA or ROCm and experience with GPU profiling tools (e.g., Nsight, nvprof, CUPTI)
Familiarity with PyTorch and performance-critical model execution
Experience with distributed system debugging and optimization in multi-GPU environments
Deep understanding of GPU architecture, parallel programming models, and compute kernels

Job Responsibility

Optimize system and GPU performance for high-throughput AI workloads across training and inference
Analyze and improve latency, throughput, memory usage, and compute efficiency
Profile system performance to detect and resolve GPU- and kernel-level bottlenecks
Implement low-level optimizations using CUDA, Triton, and other performance tooling
Drive improvements in execution speed and resource utilization for large-scale model workloads (LLMs, VLMs, and video models)
Collaborate with ML researchers to co-design and tune model architectures for hardware efficiency
Improve support for mixed precision, quantization, and model graph optimization
Build and maintain performance benchmarking and monitoring infrastructure
Scale inference and training systems across multi-GPU, multi-node environments
Evaluate and integrate optimizations for emerging hardware accelerators and specialized runtimes

What we offer

Meaningful equity in a fast-growing startup
Competitive salary
Comprehensive benefits package

Fulltime

Research Engineer, Scaling

As a Research Engineer, Scaling, you will design and build infrastructure to sup...

Location

United States , Palo Alto

Salary:

180000.00 - 300000.00 USD / Year

1X Technologies

Expiration Date

Until further notice

Requirements

Strong programming experience in Python and/or C++
Deep intuitive understanding of what affects training or inference speed: from bottlenecks to scaling laws
A mindset aligned with extremely high scaling: belief that scale is foundational to enabling humanoid robotics
Degree in Computer Science or a related field
Hands‑on experience with distributed training frameworks (e.g., TorchTitan, DeepSpeed, FSDP/ZeRO), multi‑node debugging, experiment management
Proven skills optimizing inference performance: graph compilers, batching/scheduling, serving systems (e.g., using TensorRT or equivalents)
Familiarity with quantization strategies: PTQ, QAT, INT8/FP8
tools like TensorRT, bitsandbytes, etc.
Experience writing or tuning CUDA or Triton kernels
understanding of hardware features like vectorization, tensor cores, and memory hierarchies

Job Responsibility

Own and lead scaling of both distributed training and inference systems
Ensure compute resources are sufficient so that data, not hardware, is the limiter
Enable massive training at scale (1000+ GPUs) on robot data, handling fault tolerance, experiment tracking, distributed operations, and large datasets
Optimize inference throughput in datacenter contexts (e.g., for world models and diffusion engines)
Reduce latency and optimize performance for on‑device robot policies through techniques like quantization, scheduling, distillation, etc.

What we offer

Health, dental, and vision insurance
401(k) with company match
Paid time off and holidays

Fulltime

AI Research Engineer, Scaling

As a Research Engineer focused on Scaling, you will design and build robust infr...

Location

United States , Palo Alto

Salary:

180000.00 - 300000.00 USD / Year

1X Technologies

Expiration Date

Until further notice

Requirements

Strong programming experience in Python and/or C++
Deep intuitive understanding of training and inference speed bottlenecks and scaling laws
A mindset aligned with extremely high scaling: belief that scale is foundational to enabling humanoid robotics
Degree in Computer Science or a related field
Experience with distributed training frameworks (e.g., TorchTitan, DeepSpeed, FSDP/ZeRO), multi-node debugging, and experiment management
Proven skills in optimizing inference performance using graph compilers, batching/scheduling, and serving systems like TensorRT or equivalents
Familiarity with quantization strategies (PTQ, QAT, INT8/FP8) and tools such as TensorRT and bitsandbytes
Experience developing or tuning CUDA or Triton kernels with understanding of hardware-level optimization (vectorization, tensor cores, memory hierarchies)

Job Responsibility

Own and lead scaling of distributed training and inference systems
Ensure compute resources are optimized to make data the primary constraint
Enable massive training runs (1000+ GPUs) using robot data, with robust fault tolerance, experiment tracking, and distributed operations
Optimize inference throughput for datacenter use cases such as world models and diffusion engines
Reduce latency and enhance performance for on-device robot policies using techniques such as quantization, scheduling, and distillation

What we offer

Equity
Health, dental, and vision insurance
401(k) with company match
Paid time off and holidays

Fulltime

Data Scientist

Circle K is seeking a Data Scientist responsible for delivering advanced analyti...

Location

India , Gurugram

Salary:

Not provided

Circle K

Expiration Date

Until further notice

Requirements

A higher degree in an analytical discipline like Finance, Mathematics, Statistics, Engineering, or similar
Experience: 3-4 years for Data Scientist
Relevant working experience in a quantitative/ applied analytics role
Experience with programming, and the ability to quickly pick up handling large data volumes with modern data processing tools, e.g. by using Spark / SQL / Python
Excellent communication skills in English, both verbal and written
Delivery Excellence
Business disposition
Social intelligence
Innovation and agility
Functional Analytics (Retail Analytics, Supply Chain Analytics, Marketing Analytics, Customer Analytics, etc.)

Job Responsibility

Evaluate performance of categories and activities, using proven and advanced analytical methods
Support stakeholders with actionable insights based on transactional, financial or customer data on an ongoing basis
Oversee the design and measurement of experiments and pilots
Initiate and conduct advanced analytics projects such as clustering, forecasting, causal impact
Build highly impactful and intuitive dashboards that bring the underlying data to life through insights
Improve data quality by using and improving tools to automatically detect issues
Develop analytical solutions or dashboards using user-centric design techniques in alignment with ACT’s protocol
Study industry/organization benchmarks and design/develop analytical solutions to monitor or improve business performance across retail, marketing, and other business areas
Work with Peers, Functional Consultants, Data Engineers, and cross-functional teams to lead / support the complete lifecycle of analytical applications, from development of mock-ups and storyboards to complete production ready application
Provide regular updates to stakeholders to simplify and clarify complex concepts, and communicate the output of work to business

Fulltime

Performance Engineer - Inference

Cerebras Systems

Location:
Canada , Toronto

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Additional Information:

Job Posted:
February 17, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Performance Engineer - Inference

Research Engineer AI

LLM Inference Performance & Evals Engineer

Senior GPU Engineer

Software Engineer II and Senior Software Engineer - Performance

Member of Technical Staff, Performance Optimization

Research Engineer, Scaling

AI Research Engineer, Scaling

Data Scientist

Performance Engineer - Inference

Cerebras Systems

Location:Canada , Toronto

Category:IT - Software Development

Contract Type:Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Additional Information:

Job Posted:February 17, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Performance Engineer - Inference

Research Engineer AI

LLM Inference Performance & Evals Engineer

Senior GPU Engineer

Software Engineer II and Senior Software Engineer - Performance

Member of Technical Staff, Performance Optimization

Research Engineer, Scaling

AI Research Engineer, Scaling

Data Scientist

Location:
Canada , Toronto

Category:
IT - Software Development

Contract Type:
Not provided

Job Posted:
February 17, 2026