This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
You will be a core contributor on Zyphra’s Vision Team building the next generation of vision-language models which can understand natural scenes with a focus on web, desktop, and mobile UIs. You will be deeply involved in the entire model training process from data gathering and processing to designing novel architectures and training methodologies.
Job Responsibility:
Building the next generation of vision-language models which can understand natural scenes with a focus on web, desktop, and mobile UIs
Deeply involved in the entire model training process from data gathering and processing to designing novel architectures and training methodologies
Work across: Large-scale vision encoder and vision language training runs
Performance optimization of our training stack
Image and video dataset collection, processing, and evaluation
Architecture and training methodology ablations and improvements
Requirements:
Strong research taste and intuition
Strong implementation and prototyping ability
The ability to work well and cooperate with others in a high-paced research setting
Willing to be in-person in our office in Palo Alto
US authorization to work
Nice to have:
Experience with training and evaluating vision language models
Experience with creating and collecting large scale machine learning datasets, especially in the visual modality
Experience with training vision encoders using contrastive learning or other methods
Experience with supervised finetuning and preference learning methods as well as reinforcement learning methods
A good intuitive ability to understand model behaviours and correct them through iterative finetuning
Interest in grappling in detail with data and spending significant time involved in data engineering and synthetic data generation