CrawlJobs Logo

Member of technical staff (Inference)

H Company

Location Icon

Location:
France; United Kingdom , Paris

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

The Inference team develops and enhances the inference stack for serving H-models that power our agent technology. The team focuses on optimizing hardware utilization to reach high throughput, low latency and cost efficiency in order to deliver a seamless user experience.

Job Responsibility:

  • Develop scalable, low-latency and cost effective inference pipelines
  • Optimize model performance: memory usage, throughput, and latency, using advanced techniques like distributed computing, model compression, quantization and caching mechanisms
  • Develop specialized GPU kernels for performance-critical tasks like attention mechanisms, matrix multiplications, etc.
  • Collaborate with H research teams on model architectures to enhance efficiency during inference
  • Review state-of-the-art papers to improve memory usage, throughput and latency (Flash attention, Paged Attention, Continuous batching, etc.)
  • Prioritize and implement state-of-the-art inference techniques

Requirements:

  • MS or PhD in Computer Science, Machine Learning or related fields
  • Proficient in at least one of the following programming languages: Python, Rust or C/C++
  • Experience in GPU programming such as CUDA, Open AI Triton, Metal, etc.
  • Experience in model compression and quantization techniques
  • Collaborative mindset, thriving in dynamic, multidisciplinary teams
  • Strong communication and presentation skills
  • Eager to explore new challenges

Nice to have:

  • Experience with LLM serving frameworks such as vLLM, TensorRT-LLM, SGLang, llama.cpp, etc.
  • Experience with CUDA kernel programming and NCCL
  • Experience in deep learning inference framework (Pytorch/execuTorch, ONNX Runtime, GGML, etc.)
What we offer:
  • Join the exciting journey of shaping the future of AI, and be part of the early days of one of the hottest AI startups
  • Collaborate with a fun, dynamic and multicultural team, working alongside world-class AI talent in a highly collaborative environment
  • Enjoy a competitive salary
  • Unlock opportunities for professional growth, continuous learning, and career development

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Member of technical staff (Inference)

Member of Technical Staff, Cloud Infrastructure

As a Software Engineer on our Cloud Infrastructure team, you'll be at the forefr...
Location
Location
United States , New York, NY; San Mateo, CA; Redwood City, CA
Salary
Salary:
175000.00 - 220000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • 5+ years of experience designing and building backend infrastructure in cloud environments (e.g., AWS, GCP, Azure)
  • Proven experience in ML infrastructure and tooling (e.g., PyTorch, TensorFlow, Vertex AI, SageMaker, Kubernetes, etc.)
  • Strong software development skills in languages like Python, or C++
  • Deep understanding of distributed systems fundamentals: scheduling, orchestration, storage, networking, and compute optimization
Job Responsibility
Job Responsibility
  • Architect and build scalable, resilient, and high-performance backend infrastructure to support distributed training, inference, and data processing pipelines
  • Lead technical design discussions, mentor other engineers, and establish best practices for building and operating large-scale ML infrastructure
  • Design and implement core backend services (e.g., job schedulers, resource managers, autoscalers, model serving layers) with a focus on efficiency and low latency
  • Drive infrastructure optimization initiatives, including compute cost reduction, storage lifecycle management, and network performance tuning
  • Collaborate cross-functionally with ML, DevOps, and product teams to translate research and product needs into robust infrastructure solutions
  • Continuously evaluate and integrate cloud-native and open-source technologies (e.g., Kubernetes, Ray, Kubeflow, MLFlow) to enhance our platform’s capabilities and reliability
  • Own end-to-end systems from design to deployment and observability, with a strong emphasis on reliability, fault tolerance, and operational excellence
What we offer
What we offer
  • Meaningful equity in a fast-growing startup
  • Competitive salary
  • Comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Performance Optimization

We're looking for a Software Engineer focused on Performance Optimization to hel...
Location
Location
United States , San Mateo
Salary
Salary:
175000.00 - 220000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent practical experience
  • 5+ years of experience working on performance optimization or high-performance computing systems
  • Proficiency in CUDA or ROCm and experience with GPU profiling tools (e.g., Nsight, nvprof, CUPTI)
  • Familiarity with PyTorch and performance-critical model execution
  • Experience with distributed system debugging and optimization in multi-GPU environments
  • Deep understanding of GPU architecture, parallel programming models, and compute kernels
Job Responsibility
Job Responsibility
  • Optimize system and GPU performance for high-throughput AI workloads across training and inference
  • Analyze and improve latency, throughput, memory usage, and compute efficiency
  • Profile system performance to detect and resolve GPU- and kernel-level bottlenecks
  • Implement low-level optimizations using CUDA, Triton, and other performance tooling
  • Drive improvements in execution speed and resource utilization for large-scale model workloads (LLMs, VLMs, and video models)
  • Collaborate with ML researchers to co-design and tune model architectures for hardware efficiency
  • Improve support for mixed precision, quantization, and model graph optimization
  • Build and maintain performance benchmarking and monitoring infrastructure
  • Scale inference and training systems across multi-GPU, multi-node environments
  • Evaluate and integrate optimizations for emerging hardware accelerators and specialized runtimes
What we offer
What we offer
  • Meaningful equity in a fast-growing startup
  • Competitive salary
  • Comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Member of Technical Staff – Backend

As a backend engineer at Inflection, you will own the platforms, systems, and se...
Location
Location
United States , Palo Alto
Salary
Salary:
175000.00 - 350000.00 USD / Year
inflection.ai Logo
Inflection AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience building and scaling backend systems for high-throughput applications
  • Fluent in building distributed systems with Python, Go, Rust, or similar languages
  • Comfortable with cloud-native architectures (e.g., Kubernetes, gRPC, Postgres, Redis, Kafka)
  • Owned backend services end-to-end—from design and implementation to deployment, monitoring, and debugging
  • Thrive in fast-paced environments where you can move quickly without sacrificing engineering rigor
  • Proactively improve tooling and infrastructure to support teammates’ workflows and reliability goals
  • Communicate clearly across disciplines and take pride in solving user-facing problems with clean backend solutions
  • Have a bachelor’s degree or equivalent in a related field to the offered position requirements
Job Responsibility
Job Responsibility
  • Design and implement scalable backend systems and APIs that power production LLM experiences, including agentic workflows, memory systems, and tool integrations
  • Build and operate high-availability infrastructure to support real-time inference, retrieval, and conversation pipelines
  • Develop internal platforms to improve engineering productivity—CI/CD pipelines, service templates, observability frameworks, and rollout tooling
  • Collaborate closely with applied research and frontend teams to rapidly prototype, ship, and iterate on end-user features
  • Ensure systems meet our high bar for security, uptime, and latency—through incident response, load testing, monitoring, and automation
  • Participate in on-call rotations to maintain the reliability of the services you build
What we offer
What we offer
  • Diverse medical, dental and vision options
  • 401k matching program
  • Unlimited paid time off
  • Parental leave and flexibility for all parents and caregivers
  • Support of country-specific visa needs for international employees living in the Bay Area
  • Competitive stock options
Read More
Arrow Right

Member of Technical Staff - Platform Engineer

Platform Engineer to join our team building backend infrastructure for new ML-po...
Location
Location
United States , Palo Alto
Salary
Salary:
175000.00 - 350000.00 USD / Year
inflection.ai Logo
Inflection AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Backend engineering experience with Python, TypeScript, or Node.js
  • Hands-on experience working with production PyTorch models, model checkpoints, and inference logic
  • Strong knowledge of building APIs and services that are scalable, stable, and secure
  • Passion for bridging backend engineering and ML systems, especially at the infrastructure layer
  • Familiarity with tools such as FastAPI, Postgres, Redis, Kubernetes, and React
  • Desire to be hands-on and contribute to shaping the foundation of a new enterprise ML product
  • Have a bachelor’s degree or equivalent in a related field to the offered position requirements
Job Responsibility
Job Responsibility
  • Build and maintain backend services to support LLM integration, inference orchestration, and data flow
  • Write clean, reliable Python code for experimentation, model integration, and production systems
  • Collaborate closely with ML researchers to rapidly iterate on product ideas and deploy features
  • Design and implement infrastructure to handle scalable inference workloads and enterprise-level use cases
  • Own system components and ensure reliability, observability, and maintainability from day one
What we offer
What we offer
  • Diverse medical, dental and vision options
  • 401k matching program
  • Unlimited paid time off
  • Parental leave and flexibility for all parents and caregivers
  • Support of country-specific visa needs for international employees living in the Bay Area
  • Competitive stock options
Read More
Arrow Right

Member of Technical Staff – Model Training

At Inflection AI, our public benefit mission is to harness the power of AI to im...
Location
Location
United States , Palo Alto
Salary
Salary:
175000.00 - 350000.00 USD / Year
inflection.ai Logo
Inflection AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Have hands-on experience training and fine-tuning large transformer models on multi-GPU / multi-node clusters
  • Are fluent in PyTorch and its ecosystem tools (Torchtune, FSDP, DeepSpeed) and enjoy digging into distributed-training internals, mixed precision, and memory-efficiency tricks
  • Have shipped or published work in RLHF, DPO, GRPO, or RLAIF and understand their practical trade-offs
  • Care deeply about training tools, pipelines, and reproducibility—you automate the boring parts so you can iterate on the fun parts
  • Balance research curiosity with product pragmatism—you know when to run an ablation and when to ship
  • Communicate crisply with both technical and non-technical teammates
  • Have a bachelor’s degree or equivalent in a related field to the offered position requirements
Job Responsibility
Job Responsibility
  • Contribute to end-to-end post-training workflows—dataset curation, hyper-parameter search, evaluation, and rollout—using PyTorch, Torchtune, FSDP/DeepSpeed, and our internal orchestration stack
  • Prototype and compare alignment techniques (e.g., curriculum RL, multi-objective reward modeling, tool-use fine-tuning) and push the best ideas into production
  • Automate training at scale: build robust pipeline components, tools, scripts, and dashboards so experiments are reproducible and easy to trace
  • Define the metrics that matter
  • run A/B tests and iterate quickly to meet aggressive quality targets
  • Collaborate with inference, safety, and product teams to land improvements in customer-facing systems
What we offer
What we offer
  • Diverse medical, dental and vision options
  • 401k matching program
  • Unlimited paid time off
  • Parental leave and flexibility for all parents and caregivers
  • Support of country-specific visa needs for international employees living in the Bay Area
  • Competitive stock options
Read More
Arrow Right

Member of Technical Staff, Inference

We're looking for an ML infrastructure engineer to bridge the gap between resear...
Location
Location
United States
Salary
Salary:
240000.00 - 290000.00 USD / Year
runwayml.com Logo
Runway
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience running ML model inference at scale in production environments
  • Strong experience with PyTorch and multi-GPU inference for large models
  • Experience with Kubernetes for ML workloads—deploying, scaling, and debugging GPU-based services
  • Comfortable working across multiple cloud providers and managing GPU driver compatibility
  • Experience with monitoring and observability for ML systems (errors, throughput, GPU utilization)
  • Self-starter who can work embedded with research teams and move fast
  • Strong systems thinking and pragmatic approach to production reliability
  • Humility and open mindedness
Job Responsibility
Job Responsibility
  • Productionize model checkpoints end-to-end: from research completion to internal testing to production deployment to post-release support
  • Build and optimize inference systems for large-scale generative models running on multi-GPU environments
  • Design and implement model serving infrastructure specialized for diffusion models and real-time diffusion workflows
  • Add monitoring and observability for new model releases—track errors, throughput, GPU utilization, and latency
  • Embed with research teams to gather training data, run preprocessing scripts, and support the model development process
  • Explore and integrate with GPU inference providers (Modal, E2E, Baseten, etc.)
  • Fulltime
Read More
Arrow Right
New

Member of Technical Staff - Inference

Prime Intellect is building the open superintelligence stack - from frontier age...
Location
Location
United States , San Francisco
Salary
Salary:
Not provided
Prime Intellect
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years building and running large‑scale ML/LLM services with clear latency/availability SLOs
  • Hands‑on with at least one of vLLM, SGLang, TensorRT‑LLM
  • Familiarity with distributed and disaggregated serving infrastructure such as NVIDIA Dynamo
  • Deep understanding of prefill vs. decode, KV‑cache behavior, batching, sampling, speculative decoding, parallelism strategies
  • Comfortable debugging CUDA/NCCL, drivers/kernels, containers, service mesh/networking, and storage, owning incidents end‑to‑end
  • Python: Systems tooling and backend services
  • PyTorch: LLM Inference engine development and integration, deployment readiness
  • AWS/GCP service experience, cloud deployment patterns
  • Running infrastructure at scale with containers on Kubernetes
  • Architecture, CUDA runtime, NCCL, InfiniBand
Job Responsibility
Job Responsibility
  • Build a multi-tenant LLM serving platform that operates across our cloud GPU fleets
  • Design placement and scheduling algorithms for heterogeneous accelerators
  • Implement multi‑region/zone failover and traffic shifting for resilience and cost control
  • Build autoscaling, routing, and load balancing to meet throughput/latency SLOs
  • Optimize model distribution and cold-start times across clusters
  • Integrate and contribute to LLM inference frameworks such as vLLM, SGLang, TensorRT‑LLM
  • Optimize configurations for tensor/pipeline/expert parallelism, prefix caching, memory management and other axes for maximum performance
  • Profile kernels, memory bandwidth and transport
  • apply techniques such as quantization and speculative decoding
  • Develop reproducible performance suites (latency, throughput, context length, batch size, precision)
What we offer
What we offer
  • Competitive compensation with significant equity incentives
  • Flexible work arrangement (remote or San Francisco office)
  • Full visa sponsorship and relocation support
  • Professional development budget
  • Regular team off-sites and conference attendance
  • Opportunity to shape decentralized AI and RL at Prime Intellect
  • Fulltime
Read More
Arrow Right
New

Member of Technical Staff - Edge Inference Engineer

Our Edge Inference team compiles Liquid Foundation Models into optimized machine...
Location
Location
United States , San Francisco; Boston
Salary
Salary:
Not provided
Liquid AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in systems programming with strong C++ proficiency
  • Embedded software engineering experience or work on resource-constrained systems
  • Understanding of ML fundamentals at the linear algebra level (how matrix operations, attention, and quantization work)
  • Experience with hardware architecture concepts: cache hierarchies, memory bandwidth, SIMD/vectorization
Job Responsibility
Job Responsibility
  • Implement and optimize inference kernels for CPU, NPU, and GPU architectures across diverse edge hardware
  • Develop quantization strategies (INT4, INT8, FP8) that maximize compression while preserving model quality under strict memory budgets
  • Contribute to llama.cpp and other open-source inference frameworks, including new model architectures (audio, vision)
  • Profile and optimize end-to-end inference pipelines to achieve sub-100ms time-to-first-token on target devices
  • Collaborate with ML researchers to understand model architectures and identify optimization opportunities specific to Liquid Foundation Models
What we offer
What we offer
  • Competitive base salary with equity in a unicorn-stage company
  • 100% of medical, dental, and vision premiums for employees and dependents
  • 401(k) matching up to 4% of base pay
  • Unlimited PTO plus company-wide Refill Days throughout the year
  • Fulltime
Read More
Arrow Right