This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
You’ll take on challenging engineering tasks crucial to the development of tabular foundation models. You’ll work on building and maintaining best-in-class training infrastructure, while maintaining our developer productivity tooling and open source projects. You’ll work closely with researchers to ensure that we can iterate quickly and scale our models.
Job Responsibility:
Training & research compute infrastructure: Own our cloud GPU cluster (operations, reliability, and cost/performance) currently based on Slurm. Design and implement future versions as our compute needs scale and we expand across multiple cloud/HPC providers
Training & inference performance: Work closely with researchers to identify and resolve performance bottlenecks in distributed training and inference. Support high hardware utilization and efficient memory usage through systems-level debugging, profiling, and infrastructure improvements
Developer productivity: Manage our internal repositories on GitHub and keep their CI and other pipelines speedy. Ensure our experiment tracking, model registry, data processing pipelines are working smoothly
Try out your own ideas! We operate an open environment. If you’ve got the next SOTA tabular architecture up your sleeve, go ahead and train it
Requirements:
Exceptional software engineering fundamentals and expert-level Python proficiency, with 5+ years of hands-on industry experience building and operating production systems
Proven track record of designing and building complex, scalable software, preferably for data processing or distributed systems
Deep, practical knowledge of the modern ML ecosystem (PyTorch, scikit-learn, etc.) and a genuine interest in applying systems thinking to solve hard problems in AI
Core MLOps Concepts: Strong understanding of the entire machine learning lifecycle (MLLC) from data ingestion and preparation to model deployment, monitoring, and retraining. Familiarity with MLOps principles and best practices (e.g., reproducibility, versioning, automation, continuous integration/delivery for ML)
What we offer:
Competitive compensation package with meaningful equity
30 days of paid vacation + public holidays
Comprehensive benefits including healthcare, transportation, and fitness
Work with state-of-the-art ML architecture, substantial compute resources and with a world-class team