This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Reality Labs is building the future of connection through world-class AR/VR hardware and software. The XR Tech AIX (AI Experiences) team is developing cutting-edge real-time AI systems that power next-generation communication experiences. We are creating intelligent agents that seamlessly interface with fine-tuned foundation models to enable rich, real-time interactions in video calling and telepresence scenarios. We are seeking an exceptional Research Scientist Intern to join our team and contribute to the development of real-time multimodal AI systems. This role focuses on fine-tuning and optimizing large foundation models—particularly vision-language models—for real-time agent-based applications. You will work at the intersection of multimodal learning, real-time systems, and agentic AI. Our internships are twelve (12) to twenty-four (24) weeks long with a flexible summer start date.
Job Responsibility:
Research and develop novel approaches for fine-tuning large multimodal foundation models (vision-language, audio-visual) for real-time applications
Design and implement efficient inference pipelines for deploying fine-tuned models in real-time communication scenarios
Explore agentic architectures that leverage fine-tuned models as tools within larger AI systems
Collaborate with cross-functional teams to integrate models into prototype experiences
Document and present research progress with the goal of publishing findings at top-tier ML/CV conferences
Contribute to building working prototypes that demonstrate the capabilities of fine-tuned multimodal models
Requirements:
Currently has, or is in the process of obtaining, a PhD degree in Computer Science, Machine Learning, Electrical Engineering, or a related field
2+ years of research experience in one or more of the following areas: multimodal learning, vision-language models, large language models, or foundation model fine-tuning
Hands-on experience fine-tuning large foundation models (e.g., LLaVA, InternVL, Qwen-VL, LLaMA, or similar)
Strong programming skills in Python
Experience with deep learning frameworks such as PyTorch
Excellent communication skills and ability to work independently
Must obtain work authorization in the country of employment at the time of hire, and maintain ongoing work authorization during employment
Nice to have:
Proven track record of achieving significant results as demonstrated by first-authored publications at leading conferences such as NeurIPS, ICML, ICLR, CVPR, ICCV, ECCV, ICASSP, Interspeech, ACL, EMNLP, or similar
Experience with speech-to-speech LLMs or audio-visual foundation models
Familiarity with real-time communication systems (e.g., LiveKit, WebRTC) or low-latency inference optimization
Experience with cloud infrastructure (AWS) and containerization (Docker)
Experience with parameter-efficient fine-tuning techniques (LoRA, QLoRA, adapters, etc.)
Experience with agentic AI systems, tool-use, or function-calling in LLMs
Demonstrated software engineering experience via internships, work experience, or contributions to open source repositories (e.g., GitHub)
Intent to return to degree program after completion of the internship