This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking a Senior / Staff LLM Systems Engineer to lead the development, optimization, and deployment of large language model inference pipelines. This role focuses on high-throughput, low-latency serving and production reliability, bridging ML research and platform engineering. This is not a training-focused role – the emphasis is on serving models at scale, optimizing systems, and enabling production ML reliability.
Job Responsibility:
Design, implement, and optimize inference pipelines for large language models
Improve throughput and latency of model serving in production environments
Collaborate closely with infrastructure, platform, and ML research teams to ensure smooth deployment
Build monitoring, observability, and alerting systems for inference performance and reliability
Identify and solve scaling challenges across GPUs, TPUs, or distributed environments
Evaluate and adopt new technologies, frameworks, and architectures to improve inference efficiency
Mentor other engineers and contribute to technical strategy for production ML systems
Requirements:
5+ years of software engineering experience, including hands-on ML systems experience
Strong background in distributed systems, performance tuning, and low-latency architectures
Experience with model serving frameworks (e.g., Triton, vLLM, Ray, TorchServe)
Familiarity with GPU/TPU infrastructure, multi-node deployment, and system-level optimization
Understanding of ML workloads and trade-offs between accuracy, latency, and cost
Proven ability to deliver production-grade ML systems at scale
Excellent collaboration and problem-solving skills