CrawlJobs Logo

ML Engineer, Training Infrastructure

priorlabs.ai Logo

Prior Labs

Location Icon

Location:
Germany; United States , Berlin

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

You’ll take on challenging engineering tasks crucial to the development of tabular foundation models. You’ll work on building and maintaining best-in-class training infrastructure, while maintaining our developer productivity tooling and open source projects. You’ll work closely with researchers to ensure that we can iterate quickly and scale our models.

Job Responsibility:

  • Training & research compute infrastructure: Own our cloud GPU cluster (operations, reliability, and cost/performance) currently based on Slurm. Design and implement future versions as our compute needs scale and we expand across multiple cloud/HPC providers
  • Training & inference performance: Work closely with researchers to identify and resolve performance bottlenecks in distributed training and inference. Support high hardware utilization and efficient memory usage through systems-level debugging, profiling, and infrastructure improvements
  • Developer productivity: Manage our internal repositories on GitHub and keep their CI and other pipelines speedy. Ensure our experiment tracking, model registry, data processing pipelines are working smoothly
  • Try out your own ideas! We operate an open environment. If you’ve got the next SOTA tabular architecture up your sleeve, go ahead and train it

Requirements:

  • Exceptional software engineering fundamentals and expert-level Python proficiency, with 5+ years of hands-on industry experience building and operating production systems
  • Proven track record of designing and building complex, scalable software, preferably for data processing or distributed systems
  • Deep, practical knowledge of the modern ML ecosystem (PyTorch, scikit-learn, etc.) and a genuine interest in applying systems thinking to solve hard problems in AI
  • Core MLOps Concepts: Strong understanding of the entire machine learning lifecycle (MLLC) from data ingestion and preparation to model deployment, monitoring, and retraining. Familiarity with MLOps principles and best practices (e.g., reproducibility, versioning, automation, continuous integration/delivery for ML)
What we offer:
  • Competitive compensation package with meaningful equity
  • 30 days of paid vacation + public holidays
  • Comprehensive benefits including healthcare, transportation, and fitness
  • Work with state-of-the-art ML architecture, substantial compute resources and with a world-class team

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for ML Engineer, Training Infrastructure

Senior Software Engineer - ML Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 270000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of industry experience as a software engineer, with strong focus on ML/AI infrastructure or large-scale distributed systems
  • Hands-on expertise in building and operating ML platforms (e.g., feature stores, data pipelines, training/inference frameworks)
  • Proven experience delivering reliable and scalable infrastructure in production
  • Solid understanding of ML Ops concepts and tooling, as well as best practices for observability, security, and reliability
  • Strong communication skills and ability to collaborate across teams
Job Responsibility
Job Responsibility
  • Design and implement large-scale ML infrastructure, including feature stores, pipelines, deployment tooling, and inference systems
  • Drive the rollout of Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Help define and evangelize an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines and services, including reliability, scalability, performance, and cost efficiency
  • Collaborate with ML product teams to understand requirements and deliver solutions that accelerate experimentation and iteration
  • Contribute to technical strategy and architecture discussions within the team
  • Mentor and support other engineers through code reviews, design discussions, and technical guidance
What we offer
What we offer
  • medical, dental, vision, and 401(k)
  • Fulltime
Read More
Arrow Right

Data Infrastructure Engineer

A venture-backed startup at the intersection of AI and national security is buil...
Location
Location
United States , New York City Metropolitan Area
Salary
Salary:
Not provided
weareorbis.com Logo
Orbis Consultants
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong engineering experience in Python, Go, or C
  • Experience building and scaling production data systems
  • Hands-on expertise with model deployment and ML Ops practices
  • Knowledge of database design, performance tuning, and operations
  • Someone who thrives in early-stage, fast-paced environments and enjoys tackling complex challenges
Job Responsibility
Job Responsibility
  • Build and maintain the data pipelines and infrastructure that power ML applications
  • Deploy and manage models at scale, from training through production
  • Design APIs and services that integrate smoothly into mission-critical workflows
  • Ensure data is handled and secured properly across large, distributed environments
  • Collaborate closely with a small, fast-moving team to solve hard technical problems in real-world settings
What we offer
What we offer
  • Significant equity
  • Strong health & wellness benefits
  • Fulltime
Read More
Arrow Right

Engineering Manager - Machine Learning Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
241200.00 - 400000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8–10 years of experience in ML infrastructure, including direct hands-on expertise as an engineer, IC/TL
  • 2+ years of experience managing infrastructure or ML platform engineers
  • Proven experience delivering and operating ML or AI infrastructure at scale
  • Solid technical depth across ML/AI infrastructure domains (e.g., feature stores, pipelines, deployment, inference, observability)
  • Demonstrated ability to drive execution on complex technical projects with cross-team stakeholders
  • Strong communication and stakeholder management skills
Job Responsibility
Job Responsibility
  • Lead and support the ML Infra team, driving project execution and ensuring delivery on key commitments
  • Build and launch Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Define and drive adoption of an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines, deployment tooling, and inference systems
  • Partner with ML product teams to understand requirements and deliver solutions that accelerate model development and iteration
  • Recruit, mentor, and develop engineers, fostering a collaborative and high-performing team culture
What we offer
What we offer
  • medical
  • dental
  • vision
  • 401(k)
  • equity
  • commission
  • Fulltime
Read More
Arrow Right

Data Infrastructure Engineer

This young, early-stage start-up challenger are currently looking for a hands-on...
Location
Location
United States , New York or DC
Salary
Salary:
Not provided
weareorbis.com Logo
Orbis Consultants
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Startup Energy: You thrive in fast-paced environments, manage ambiguity well, and focus on what moves the needle
  • Designing and deploying intuitive, user-friendly APIs
  • Demonstrated ability to train and deploy models at scale
  • Successfully launching machine learning services, particularly those leveraging LLMs, embeddings, and inference, into production environments
  • Handling and securing large-scale production data
  • Demonstrated proficiency in Python, Go, or C
  • A proactive approach to tackling complex challenges in a fast-paced, early-stage environment
  • A passion for innovation and a collaborative spirit
Job Responsibility
Job Responsibility
  • Developing secure data sharing middleware
  • Integrating software seamlessly into the workflows of specialized professionals, ensuring secure and efficient data access throughout the asset recruitment process
  • Building, shipping and supporting mission critical services in support of the services that make up the Data platform
  • Providing solutions for the full data stack – from the data management, software development and model and deployment lifecycles
What we offer
What we offer
  • Competitive Salary + Equity
  • Fulltime
Read More
Arrow Right

Data Infrastructure Engineer

This young, early-stage start-up challenger is currently looking for a hands-on ...
Location
Location
United States , New York or DC
Salary
Salary:
Not provided
weareorbis.com Logo
Orbis Consultants
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Startup Energy: You thrive in fast-paced environments, manage ambiguity well, and focus on what moves the needle
  • Designing and deploying intuitive, user-friendly APIs
  • Demonstrated ability to train and deploy models at scale
  • Successfully launching machine learning services, particularly those leveraging LLMs, embeddings, and inference, into production environments
  • Handling and securing large-scale production data
  • Demonstrated proficiency in Python, Go, or C
  • A proactive approach to tackling complex challenges in a fast-paced, early-stage environment
  • A passion for innovation and a collaborative spirit
Job Responsibility
Job Responsibility
  • Developing secure data sharing middleware
  • Integrating software seamlessly into the workflows of specialised professionals, ensuring secure and efficient data access throughout the asset recruitment process
  • Providing solutions for the full data stack – from the data management, software development and model and deployment lifecycles
What we offer
What we offer
  • Equity
  • Opportunity to work with an Ambitious, Rapidly-Growing Start-Up
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Network Enablement (Applied ML)

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 270000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering skills including systems design, APIs, and building reliable backend services (Go or Python preferred)
  • Production experience with batch and streaming data pipelines and orchestration tools such as Airflow or Spark
  • Experience building or operating real-time scoring and online feature-serving systems, including feature stores and low-latency model inference
  • Experience integrating model outputs into product flows (APIs, feature flags) and measuring impact through experiments and product metrics
  • Experience with model lifecycle and operations: model registries, CI/CD for models, reproducible training, offline & online parity, monitoring and incident response
Job Responsibility
Job Responsibility
  • Embed model inference into Network Enablement product flows and decision logic (APIs, feature flags, backend flows)
  • Define and instrument product + ML success metrics (fraud reduction, retention lift, false positives, downstream impact)
  • Design and run experiments and rollout plans (backtesting, shadow scoring, A/B tests, feature-flagged releases) to validate product hypotheses
  • Build and operate offline training pipelines and production batch scoring for bank intelligence products
  • Ship and maintain online feature serving and low-latency model inference endpoints for real-time partner/bank scoring
  • Implement model CI/CD, model/version registry, and safe rollout/rollback strategies
  • Monitor model/data health: drift/regression detection, model-quality dashboards, alerts, and SLOs targeted to partner product needs
  • Ensure offline and online parity, data lineage, and automated validation / data contracts to reduce regressions
  • Optimize inference performance and cost for real-time scoring (batching, caching, runtime selection)
  • Ensure fairness, explainability and PII-aware handling for partner-facing ML features
What we offer
What we offer
  • medical
  • dental
  • vision
  • 401(k)
  • equity
  • commission
  • Fulltime
Read More
Arrow Right

Senior ML Data Engineer

As a Senior Data Engineer, you will play a pivotal role in our AI/ML workstream,...
Location
Location
Salary
Salary:
Not provided
awin.com Logo
Awin Global
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor or Master’s degree in data science, data engineering, Computer Science with focus on math and statistics / Master’s degree is preferred
  • At least 5 years experience as AI/ML data engineer undertaking above task and accountabilities
  • Strong foundation in computer science principes and statistical methods
  • Strong experience with cloud technology (AWS or Azure)
  • Strong experience with creation of data ingestion pipeline and ET process
  • Strong knowledge of big data tool such as Spark, Databricks and Python
  • Strong understanding of common machine learning techniques and frameworks (e.g. mlflow)
  • Strong knowledge of Natural language processing (NPL) concepts
  • Strong knowledge of scrum practices and agile mindset
Job Responsibility
Job Responsibility
  • Design and maintain scalable data pipelines and storage systems for both agentic and traditional ML workloads
  • Productionise LLM- and agent-based workflows, ensuring reliability, observability, and performance
  • Build and maintain feature stores, vector/embedding stores, and core data assets for ML
  • Develop and manage end-to-end traditional ML pipelines: data prep, training, validation, deployment, and monitoring
  • Implement data quality checks, drift detection, and automated retraining processes
  • Optimise cost, latency, and performance across all AI/ML infrastructure
  • Collaborate with data scientists and engineers to deliver production-ready ML and AI systems
  • Ensure AI/ML systems meet governance, security, and compliance requirements
  • Mentor teams and drive innovation across both agentic and classical ML engineering practices
  • Participate in team meetings and contribute to project planning and strategy discussions
What we offer
What we offer
  • Flexi-Week and Work-Life Balance: We prioritise your mental health and well-being, offering you a flexible four-day Flexi-Week at full pay and with no reduction to your annual holiday allowance. We also offer a variety of different paid special leaves as well as volunteer days
  • Remote Working Allowance: You will receive a monthly allowance to cover part of your running costs. In addition, we will support you in setting up your remote workspace appropriately
  • Pension: Awin offers access to an additional pension insurance to all employees in Germany
  • Flexi-Office: We offer an international culture and flexibility through our Flexi-Office and hybrid/remote work possibilities to work across Awin regions
  • Development: We’ve built our extensive training suite Awin Academy to cover a wide range of skills that nurture you professionally and personally, with trainings conveniently packaged together to support your overall development
  • Appreciation: Thank and reward colleagues by sending them a voucher through our peer-to-peer program
Read More
Arrow Right

Senior Software Engineer - Data Infrastructure

We build the data and machine learning infrastructure to enable Plaid engineers ...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 270000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of software engineering experience
  • Extensive hands-on software engineering experience, with a strong track record of delivering successful projects within the Data Infrastructure or Platform domain at similar or larger companies
  • Deep understanding of one of: ML Infrastructure systems, including Feature Stores, Training Infrastructure, Serving Infrastructure, and Model Monitoring OR Data Infrastructure systems, including Data Warehouses, Data Lakehouses, Apache Spark, Streaming Infrastructure, Workflow Orchestration
  • Strong cross-functional collaboration, communication, and project management skills, with proven ability to coordinate effectively
  • Proficiency in coding, testing, and system design, ensuring reliable and scalable solutions
  • Demonstrated leadership abilities, including experience mentoring and guiding junior engineers
Job Responsibility
Job Responsibility
  • Contribute towards the long-term technical roadmap for data-driven and machine learning iteration at Plaid
  • Leading key data infrastructure projects such as improving ML development golden paths, implementing offline streaming solutions for data freshness, building net new ETL pipeline infrastructure, and evolving data warehouse or data lakehouse capabilities
  • Working with stakeholders in other teams and functions to define technical roadmaps for key backend systems and abstractions across Plaid
  • Debugging, troubleshooting, and reducing operational burden for our Data Platform
  • Growing the team via mentorship and leadership, reviewing technical documents and code changes
What we offer
What we offer
  • medical, dental, vision, and 401(k)
  • equity and/or commission
  • Fulltime
Read More
Arrow Right