This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As a Senior Machine Learning Systems Engineer at Abridge, you’ll play a pivotal role in building and optimizing the core infrastructure that powers our machine learning models. Your work will be instrumental in enhancing the scalability, efficiency, and performance of our AI-driven solutions. You will work with our Infrastructure and Research teams to build, deploy, optimize and orchestrate across our AI models.
Job Responsibility:
Design, deploy and maintain scalable Kubernetes clusters for AI model inference and training
Develop, optimize, and maintain ML model serving and training infrastructure, ensuring high-performance and low-latency
Collaborate with ML and product teams to scale backend infrastructure for AI-driven products, focusing on model deployment, throughout optimization, and compute efficiency
Optimize compute-heavy workflows and enhance GPU utilization for ML workloads
Build a robust model API orchestration system
Collaborate with leadership to define and implement strategies for scaling infrastructure as the company grows, ensuring long-term efficiency and performance
Requirements:
Strong experience in building and deploying machine learning models in production environments
Deep understanding of container orchestration and distributed systems architecture
Expertise in Kubernetes administration, including custom resource definitions, operators, and cluster management
Experience developing APIs and managing distributed systems for both batch and real-time workloads
Excellent communication skills, with the ability to interface between research and product engineering
Nice to have:
Expertise with model serving frameworks such as NVIDIA Triton Server, VLLM, TRT-LLM and so on
Expertise with ML toolchains such as PyTorch, Tensorflow or distributed training and inference libraries
Familiarity with GPU cluster management and CUDA optimization
Knowledge of infrastructure as code (Terraform, Ansible) and GitOps practices
Experience with container registries, image optimization, and multi-stage builds for ML workloads
Experience orchestrating across ASR models or LLM models for building various GenAI applications
What we offer:
Generous Time Off: 14 paid holidays, flexible PTO for salaried employees, and accrued time off for hourly employees
Comprehensive Health Plans: Medical, Dental, and Vision coverage for all full-time employees and their families
Generous HSA Contribution: If you choose a High Deductible Health Plan, Abridge makes monthly contributions to your HSA
Paid Parental Leave: Generous paid parental leave for all full-time employees
Family Forming Benefits: Resources and financial support to help you build your family
401(k) Matching: Contribution matching to help invest in your future
Personal Device Allowance: Tax free funds for personal device usage
Pre-tax Benefits: Access to Flexible Spending Accounts (FSA) and Commuter Benefits
Lifestyle Wallet: Monthly contributions for fitness, professional development, coworking, and more
Mental Health Support: Dedicated access to therapy and coaching to help you reach your goals
Sabbatical Leave: Paid Sabbatical Leave after 5 years of employment
Compensation and Equity: Competitive compensation and equity grants for full time employees