ViLL-E: Video LLM Embeddings for Retrieval

Rohit Gupta; Jayakrishnan Unnikrishnan; Fan Fei; Sheng Liu; Son Tran; Mubarak Shah

ViLL-E: Video LLM Embeddings for Retrieval

Rohit Gupta, Jayakrishnan Unnikrishnan, Fan Fei, Sheng Liu, Son Tran, Mubarak Shah

Abstract

Video Large Language Models (VideoLLMs) excel at video understanding tasks where outputs are textual, such as Video Question Answering and Video Captioning. However, they underperform specialized embedding-based models in Retrieval tasks, such as Text-toVideo Retrieval and Moment Retrieval. We introduce ViLL-E (Video-LLM-Embed), a unified VideoLLM architecture endowed with a novel embedding generation mechanism that allows the model to "think longer" for complex videos and stop early for easy ones. We train this model with a three-stage training methodology combining generative and contrastive learning: initial large-scale pre-training with video-caption pairs; followed by continual training on a smaller, detailed-caption dataset; and concluding with task-specific fine-tuning on a novel multi-task dataset covering Video QA, Temporal Localization, Video Retrieval, and Video-Text Matching. Our model significantly improves temporal localization (on avg. 7% over other VideoLLMs) and video retrieval (up to 4% over dual encoder models), achieving performance comparable to state-of-the-art specialized embedding models while remaining competitive on VideoQA tasks. Furthermore, our joint contrastive-generative training unlocks new zero-shot capabilities, significantly outperforming state-of-the-art methods in composed video retrieval (+5% over SotA) and retrieval from long text (+2% over SotA).

Anthology ID:: 2026.acl-long.2003
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 43239–43258
Language:
URL:: https://aclanthology.org/2026.acl-long.2003/
DOI:
Bibkey:
Cite (ACL):: Rohit Gupta, Jayakrishnan Unnikrishnan, Fan Fei, Sheng Liu, Son Tran, and Mubarak Shah. 2026. ViLL-E: Video LLM Embeddings for Retrieval. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 43239–43258, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: ViLL-E: Video LLM Embeddings for Retrieval (Gupta et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.2003.pdf
Checklist:: 2026.acl-long.2003.checklist.pdf

PDF Cite Search Checklist Fix data