Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Haibo Wang, Zhiyang Xu, Yu Cheng, Shizhe Diao, Yufan Zhou, Yixin Cao, Qifan Wang, Weifeng Ge, Lifu Huang


Abstract
Despite their impressive performance in coarse-grained video understanding, Video Large Language Models (Video-LLMs) still face challenges in fine-grained temporal grounding, including ineffective temporal modeling and inadequate timestamp representations. In this work, we introduce Grounded-VideoLLM, a novel Video-LLM designed to perceive and reason over specific video moments with fine-grained temporal precision. Our model features (1) a two-stream encoder that explicitly captures inter-frame relationships while preserving intra-frame visual details and (2) discrete temporal tokens enriched with structured time knowledge for timestamp representation. Besides, we propose a multi-stage training strategy tailored to such grounding-specific architecture. The model is initially trained on simple video-caption tasks and progressively introduced to complex video temporal grounding tasks, ensuring a smooth learning curve and temporal alignment. We further strengthen Grounded-VideoLLM’s temporal reasoning by constructing a VideoQA dataset with grounded information using an automated annotation pipeline. Extensive experiments demonstrate that Grounded-VideoLLM not only surpasses existing models in fine-grained grounding tasks but also exhibits strong potential as a general video understanding assistant.
Anthology ID:
2025.findings-emnlp.50
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
959–975
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.50/
DOI:
Bibkey:
Cite (ACL):
Haibo Wang, Zhiyang Xu, Yu Cheng, Shizhe Diao, Yufan Zhou, Yixin Cao, Qifan Wang, Weifeng Ge, and Lifu Huang. 2025. Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 959–975, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models (Wang et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.50.pdf
Checklist:
 2025.findings-emnlp.50.checklist.pdf