This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Staff Software Engineer for the Model LifeCycle team will play a key role in building a comprehensive managed platform for the entire application development lifecycle, with a specific focus on leveraging Machine Learning models, including Large Language Models (LLMs). This role offers significant scope for ownership — you'll be implementing and contributing to the design of core systems.
Job Responsibility:
Contribute to fine-tuning systems for large foundation models (SFT, PEFT, LoRA, adapters), including multi-node orchestration, checkpointing, failure recovery, and cost-efficient scaling
Implement and maintain end-to-end training pipelines for Large Language Models
Contribute to distillation and reinforcement learning pipelines (e.g., preference optimization, policy optimization, reward modeling)
Develop and maintain agent execution infrastructure
Implement features for dataset, model, and experiment management: versioning, lineage, evaluation, and reproducible fine-tuning at scale
Work closely with Principal Engineers, product, business, and platform teams to implement the core abstractions and APIs of the system
Contribute to architectural decisions around training runtimes, scheduling, storage, and model lifecycle management
Engage with the open-source LLM ecosystem
Requirements:
Bachelor's or Master's degree in Computer Science, Engineering, or a related field
8-10+ years of industry experience with demonstrated history of consistent success leading a varied portfolio of initiatives across your function
Proven track record of delivering production features on time
Experience in using cloud-based services, such as, elastic compute, object storage, virtual private networks, managed database, etc.
Experience with Generative AI (Large Language Models, Multimodal)
Experience with AI infrastructure, including training, inference
Nice to have:
Proficiency in Golang or Python for large-scale, production-level services
Experience contributing to open-source AI projects
Experience with performance optimizations on GPU systems and inference frameworks
Experience working with PyTorch
Experience with training and fine-tuning LLMs
Proactive and collaborative approach with the ability to work independently
Strong communication and interpersonal skills
Passion for building cutting-edge AI products and solving challenging technical problems
What we offer:
Restricted Stock Units in a fast growing, well-funded technology company
Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
Employer contributions to HSA accounts
Paid Parental Leave
Paid life insurance, short-term and long-term disability