Full-Stack Software Engineer, Inference Job at Cohere (Toronto)

Senior Software Engineer

The AI & Innovation team at Microsoft Suzhou is seeking a highly motivated Senio...

Location

China , Beijing

Salary:

Not provided

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science, Electrical Engineering, or related technical field AND 4+ years of technical engineering experience with coding in languages such as Python, C++, or C#
OR equivalent industry experience
7+ years of software engineering experience with a focus on AI/ML systems
Proven experience with one or more of the following: Developing or applying generative AI models
Building and optimizing inference pipelines for large AI models on cloud infrastructure
Integrating AI features into consumer-facing web or mobile applications at scale
Working with programmatic advertising ecosystems
Familiarity with cloud services (Azure preferred), microservices architecture, and DevOps practices
Hands-on experience in at least two of the three core areas: AI/ML Prototyping: Experience with deep learning frameworks (PyTorch, TensorFlow) and implementing/tuning models from recent literature
Video/Graphics Processing: Experience with video codecs (FFmpeg), computer graphics, GPU programming (CUDA), or real-time media pipelines

Job Responsibility

Rapid AI Prototyping: Design, build, and iterate on high-potential prototypes for AI-powered video generation, editing, and content understanding
System Integration & Productionization: Bridge the gap between research prototypes and production-ready systems
Integrate AI video generation capabilities with large-scale advertising platforms and consumer products
Full-Stack Development: Develop end-to-end solutions encompassing backend AI service APIs, model inference optimization, and frontend interfaces
Cross-Functional Collaboration: Work closely with Applied Scientists, Machine Learning Engineers, Product Managers, and Ads Platform teams
Technical Leadership: Drive architectural decisions for scalable, reliable, and cost-effective AI service deployment
Mentor junior engineers and promote engineering best practices
Live Site Ownership: Participate in on-call rotations and act as a Designated Responsible Individual (DRI) to ensure the health, performance, and reliability of services

Fulltime

Software Engineer, Research - Human Data

OpenAI’s mission is to ensure that artificial general intelligence (AGI) benefit...

Location

United States; United Kingdom , San Francisco; London

Salary:

230000.00 - 385000.00 USD / Year

OpenAI

Expiration Date

Until further notice

Requirements

Strong software engineering fundamentals
Experience building production systems at scale
Enjoy full-stack development with end-to-end ownership
Motivated by high-impact collaboration with research teams and solving novel, ambiguous problems
Excited to shape how AI systems learn from human preferences and reflect a broad range of human values
Care deeply about inclusive tooling and building systems that enhance model safety, reliability, and usefulness

Job Responsibility

Build and maintain robust full-stack systems for feedback collection, data labeling, and evaluation pipelines, while maintaining high levels of security
Translate experimental alignment research into scalable production infrastructure, including inference and model training stacks
Design and iterate on user-facing tools and backend services to support high-quality data workflows
Partner with researchers, engineers, and program leads to shape feedback loops and model interaction paradigms
Drive infrastructure improvements that enable faster iteration and scaling across OpenAI’s frontier models, from internal research tooling all the way to production ChatGPT

What we offer

Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth
Daily meals in our offices, and meal delivery credits as eligible

Fulltime

AI Product Performance Engineer

WHAT YOU DO AT AMD CHANGES EVERYTHING. At AMD, our mission is to build great pro...

Location

China , Shenzhen

Salary:

Not provided

AMD

Expiration Date

Until further notice

Requirements

deep knowledge of Data Center AI workloads such as LLM, Generative AI, Recommendation, NLP, Video Analytics, and/or transformer
hands-on experiences with various AI models, end-to-end pipeline, industry framework / SDKs and solutions
GPU Architecture Mastery
Kernel Programming Expertise: Strong proficiency in C++ and parallel computing, with extensive hands-on experience in NVIDIA CUDA or AMD HIP kernel programming
Performance Engineering: Demonstrated ability to debug and profile complex GPU workloads
Systems Knowledge: Familiarity with asynchronous execution, stream management, and host-device memory transfers
Python DSLs & Triton: Experience implementing kernels using OpenAI Triton or other Python-based DSLs
Inference Engine Experience: Hands-on experience integrating custom kernels into large-scale inference frameworks such as vLLM, SGLang, or TensorRT-LLM
Deep Learning Frameworks: Familiarity with writing custom extensions or operators for PyTorch (C++/CUDA extensions)
Hardware Agnosticism: Experience porting kernels between NVIDIA and AMD architectures or working with cross-platform HPC libraries

Job Responsibility

High-Performance Kernel Development: Design, implement, and optimize high-performance GPU kernels for AI/ML workloads to maximize hardware utilization
Performance Optimization: Analyze and optimize kernel execution for latency and throughput, addressing bottlenecks in memory bandwidth, instruction latency, and thread divergence
Workload Analysis: Evaluate the end-to-end performance impact of individual kernels on full-stack AI models, ensuring that micro-optimizations translate to application-level speedups
Profiling & Tuning: Utilize advanced GPU profiling tools (e.g., ROCm Profiler, Pytorch Profiler) to identify performance cliffs, stall pipelines, and memory hierarchy inefficiencies
Architecture Adaptation: Tailor implementation strategies to leverage specific features of modern GPU architectures (e.g., Matrix Cores, HBM characteristics)
Framework Integration: Collaborate with software stack teams to expose optimized kernels within high-level frameworks and inference engines

What we offer

AMD benefits at a glance

Fulltime

Ai Solutions Architect / Field Application Engineer

We are looking for an AI enthusiast with strong technical fundamentals and custo...

Location

United States , Austin

Salary:

102320.00 - 153480.00 USD / Year

AMD

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Computer Science, Electrical Engineering, Computer Engineering, or a related field (or equivalent practical experience)
Strong interest in AI/ML technologies and a desire to work across hardware and software layers
Hands-on experience with Linux-based systems
Programming experience in one or more of the following: Python, C/C++, Bash
Familiarity with AI frameworks or tools (e.g., PyTorch, TensorFlow, ONNX, Hugging Face, or similar)
Strong communication skills with the ability to explain technical concepts clearly
Ability to work effectively in a team-oriented, cross-functional environment

Job Responsibility

Serve as a technical point of contact for customers, supporting AI and HPC workloads on AMD CPU and GPU platforms
Work directly with customers to understand their use cases, requirements, and constraints, and guide them through solution design and deployment
Deliver technical presentations, demos, and architecture walkthroughs to both technical and non-technical audiences
Program-manage customer opportunities as they grow in complexity, coordinating activities across internal and external stakeholders
Perform hands-on system bring-up including hardware installation, firmware configuration, OS installation, and driver setup
Deploy and validate open-source AI and HPC software stacks (e.g., Linux, ROCm, AI frameworks, containers)
Run functionality, performance, and scalability benchmarks on CPU and GPU workloads
Perform first-level profiling and analysis of applications to identify performance bottlenecks and optimization opportunities
Support AI workloads such as training, inference, and data preprocessing across CPU and GPU platforms
Develop working knowledge of AMD CPU and GPU architectures and how they impact real-world workloads

Fulltime

Senior AI Presales Consultant

We are seeking a high-impact, strategic AI Presales Consultant to join our elite...

Location

India , Mumbai

Salary:

Not provided

Eviden

Expiration Date

Until further notice

Requirements

7+ years in a customer-facing technical role (e.g., Presales, Solutions Architecture, AI Specialist, or Technical Consulting), with a proven track record of designing large-scale AI, ML, or HPC solutions
Deep, hands-on understanding of LLM architectures. Must be able to architect, explain, and build PoCs for RAG pipelines, including vector databases (e.g., Milvus, Pinecone, Chroma), embedding models, and data ingestion strategies
Direct experience in sizing AI infrastructure. Must be able to perform "napkin math" and detailed calculations for GPU, CPU, memory, and network requirements
Must be able to fluently discuss performance metrics (tokens/second, latency, throughput, TFLOPS) and their relationship to hardware choice (e.g., NVIDIA H100 vs. A100, memory bandwidth, interconnects like NVLink/InfiniBand)
Expertise in the AI software stack. Strong understanding of MLOps principles (Kubeflow, MLflow), Kubernetes (K8s) for AI workloads, and model serving platforms (NVIDIA Triton, KServe, or similar)
Strong, current knowledge of the AI model landscape (e.g., Llama family, Mistral, GPT-family, foundation models). Ability to discuss fine-tuning techniques, quantization, and pruning
Exceptional communication, whiteboarding, and presentation skills. Ability to translate executive-level business needs into detailed technical architecture and build a compelling C-level value proposition
Bachelor's or Master's degree in Computer Science, AI, Data Science, or a related engineering field

Job Responsibility

Strategic Client Advisory: Lead executive-level "Art of the Possible" workshops and technical discovery sessions to understand a client's business goals, data readiness, and AI maturity
Full-Stack Solution Architecture: Design holistic, end-to-end AI solutions that synergize our supercomputing hardware, AI software platform, and MLOps capabilities to meet specific client needs
Generative AI & LLM Expertise: Act as the subject matter expert on Generative AI. Architect and evangelize scalable data ingestion and preparation pipelines, specializing in Retrieval-Augmented Generation (RAG) frameworks
Infrastructure Sizing & Performance Modelling: Analyse customer workloads (data volume, model complexity, training frequency, inference throughput) to accurately size the required platform infrastructure, including Kubernetes clusters, data storage, and software licenses. This includes calculating compute, storage, and network requirements based on key performance metrics like model parameters, token performance (tokens/sec), desired latency, and concurrent user load
Model & Software Consultation: Advise clients on AI model selection, comparing the trade-offs of open-source vs. proprietary LLMs, fine-tuning vs. foundation models, and model quantization
Position and demonstrate our proprietary AI software platform, MLOps tools, and libraries, integrating them into the client's ecosystem
Inference Optimization: Design and architect robust, low-latency, and high-throughput inference solutions for complex AI models, including large-scale LLM serving
User Experience (UX) Advocacy: Collaborate with client teams to define the end-user experience, ensuring the solution delivers tangible business value and a seamless interface for data scientists, analysts, and application users
Sales Cycle Enablement: Own the technical narrative throughout the sales cycle. Build and deliver compelling presentations, custom demonstrations, and Proofs of Concept (PoCs). Lead the technical response to complex RFIs/RFPs

Fulltime

Head of Inference Kernels

As a core member of the team, you will play a pivotal role in leading a high-per...

Location

United States , San Jose

Salary:

200000.00 - 300000.00 USD / Year

Etched

Expiration Date

Until further notice

Requirements

Experience in designing and optimizing GPU kernels for deep learning on GPUs using CUDA, and assembly (ASM)
Experience with low-level programming to maximize performance for AI operations, leveraging tools like Compute Kernel (CK), CUTLASS, and Triton for multi-GPU and multi-platform performance
Deep fluency with transformer inference architecture, optimization levers, and full-stack systems (e.g., vLLM, custom runtimes)
History of delivering tangible perf wins on GPU hardware or custom AI accelerators
Solid understanding of roofline models of compute throughput, memory bandwidth and interconnect performance
Experienced in running large-scale workloads on heterogeneous compute clusters, optimizing for efficiency and scalability of AI workloads
Scopes projects crisply, sets aggressive but realistic milestones, and drives technical decision-making across the team
Anticipates blockers and shifts resources proactively

Job Responsibility

Architect Best-in-Class Inference Performance on Sohu: Deliver continuous batching throughput exceeding B200 by ≥10x on priority workloads
Develop Best-in-Performance Inference Mega Kernels: Develop complex, fused kernels that increase chip utilization and reduce inference latency, and validate these optimizations through benchmarking and regression-tested in production pipelines
Architect Model Mapping Strategies: Develop system level optimizations using a mix of techniques such tensor parallelism and expert parallelism for optimal performance
Hardware-Software Co-design of Inference-time Algorithmic Innovation: Develop and deploy production-ready inference-time algorithmic improvements (e.g., speculative decoding, prefill-decode disaggregation, KV cache offloading)
Build Scalable Team and Roadmap: Grow and retain a team of high-performing inference optimization engineers
Cross-Functional Performance Alignment: Ensure inference stack and performance goals are aligned with the software infrastructure teams, GTM and hardware teams for future generations of our hardware

What we offer

Medical, dental, and vision packages with generous premium coverage
$500 per month credit for waiving medical benefits
Housing subsidy of $2k per month for those living within walking distance of the office
Relocation support for those moving to San Jose (Santana Row)
Various wellness benefits covering fitness, mental health, and more
Daily lunch + dinner in our office
significant equity package

Fulltime

Ai Solutions Architect / Field Application Engineer

We are looking for an AI enthusiast with strong technical fundamentals and custo...

Location

United States , Austin

Salary:

128400.00 - 192600.00 USD / Year

AMD

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Computer Science, Electrical Engineering, Computer Engineering, or a related field (or equivalent practical experience)
Strong interest in AI/ML technologies and a desire to work across hardware and software layers
Hands-on experience with Linux-based systems
Programming experience in one or more of the following: Python, C/C++, Bash
Familiarity with AI frameworks or tools (e.g., PyTorch, TensorFlow, ONNX, Hugging Face, or similar)
Strong communication skills with the ability to explain technical concepts clearly
Ability to work effectively in a team-oriented, cross-functional environment

Job Responsibility

Serve as a technical point of contact for customers, supporting AI and HPC workloads on AMD CPU and GPU platforms
Work directly with customers to understand their use cases, requirements, and constraints, and guide them through solution design and deployment
Deliver technical presentations, demos, and architecture walkthroughs to both technical and non-technical audiences
Program-manage customer opportunities as they grow in complexity, coordinating activities across internal and external stakeholders
Perform hands-on system bring-up including hardware installation, firmware configuration, OS installation, and driver setup
Deploy and validate open-source AI and HPC software stacks (e.g., Linux, ROCm, AI frameworks, containers)
Run functionality, performance, and scalability benchmarks on CPU and GPU workloads
Perform first-level profiling and analysis of applications to identify performance bottlenecks and optimization opportunities
Support AI workloads such as training, inference, and data preprocessing across CPU and GPU platforms
Develop working knowledge of AMD CPU and GPU architectures and how they impact real-world workloads

Fulltime

AI Systems Engineer - Agentic Autonomy

We are seeking an AI Systems Engineer with deep expertise in large language mode...

Location

United States , Greater Boston

Salary:

140000.00 - 180000.00 USD / Year

HavocAI

Expiration Date

Until further notice

Requirements

Bachelor’s, Master’s, or PhD in Computer Science, Machine Learning, Robotics, or a related field
Deep hands-on experience building with LLMs and multi-agent/agentic AI frameworks
Strong software engineering background in modern ML frameworks, cloud orchestration, and API development
Experience integrating AI systems into larger software architectures or robotics/autonomy workflows
Understanding of RAG pipelines, tool-use frameworks, LLM function-calling, memory systems, and agent orchestration
Experience with safety evaluation, model alignment, or mission-critical AI system validation
Ability to lead system-level design discussions and coordinate across multiple engineering disciplines
Must be a U.S. Citizen and eligible to obtain a Secret Clearance

Job Responsibility

Lead the design and development of LLM-powered software modules for mission reasoning, planning, operator interaction, and autonomous decision support
Integrate LLMs and agentic systems into HavocAI’s autonomy architecture, including ROS/ROS2 systems, planning engines, and mission software
Build multi-agent, tool-using AI systems that interact with perception data, mission databases, simulation systems, and operator inputs
Develop APIs, wrappers, and orchestration layers enabling LLMs to interface safely with embedded, cloud, and edge compute environments
Optimize LLM inference pipelines for performance, latency, and reliability in field-deployed systems
Evaluate model behavior, perform safety testing, and develop guardrails for mission-critical use cases
Collaborate with autonomy, embedded, simulation, and full-stack teams to define requirements and ensure robust system-level integration
Guide strategic decisions on model selection, fine-tuning approaches, safety frameworks, and long-term AI architecture
Contribute to field testing, operator evaluations, and iterative deployment cycles for AI-augmented autonomy systems

What we offer

100% Employer paid Health, Dental and Vision Insurance for you and your families
Life Insurance (Employer Paid)
Ability to participate in the companies 401k program (Matching)
Unlimited PTO policy with an enforced 2 week minimum
Equity Package
Work / Home Office Stipend
Global Entry
16 Week Paid Parental Leave
Monthly Health and Wellness Stipend

Fulltime

Select Country

Full-Stack Software Engineer, Inference

Job Description

Job Responsibility

Requirements

What we offer

Looking for more opportunities?