TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos

Fawad Javed Fateh; Umer Ahmed; Hamza Khan; Zeeshan Zia; Quoc-Huy Tran

TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos

Fawad Javed Fateh, Umer Ahmed, Hamza Khan, Zeeshan Zia, Quoc-Huy Tran

Abstract

We introduce TemporalVLM, a video large language model (video LLM) for temporal reasoning and fine-grained understanding in long videos. Our approach includes a visual encoder for mapping a long-term video into features which are time-aware and contain both local and global cues. It first divides an input video into short-term clips, which are jointly encoded with timestamps and fused across overlapping temporal windows into time-sensitive local features. Next, the local features are passed through a bidirectional long short-term memory (BiLSTM) module for global feature aggregation. Moreover, to facilitate the evaluation of TemporalVLM, we present a large-scale long video dataset of industry assembly processes, namely IndustryASM, consisting of videos recorded on factory floors with actions and timestamps annotated by industrial engineers for time and motion studies and temporal action segmentation evaluation. Finally, extensive experiments show that TemporalVLM outperforms previous methods across temporal reasoning and fine-grained understanding tasks, i.e., dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation.

Anthology ID:: 2026.findings-acl.70
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1427–1447
Language:
URL:: https://aclanthology.org/2026.findings-acl.70/
DOI:
Bibkey:
Cite (ACL):: Fawad Javed Fateh, Umer Ahmed, Hamza Khan, Zeeshan Zia, and Quoc-Huy Tran. 2026. TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos. In Findings of the Association for Computational Linguistics: ACL 2026, pages 1427–1447, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos (Fateh et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.70.pdf
Checklist:: 2026.findings-acl.70.checklist.pdf

PDF Cite Search Checklist Fix data