@inproceedings{zhou-etal-2025-glimpse,
title = "{GLIMPSE}: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?",
author = "Zhou, Yiyang and
Li, Linjie and
Qiu, Shi and
Yang, Zhengyuan and
Zhao, Yuyang and
Han, Siwei and
He, Yangfan and
Li, Kangqi and
Ji, Haonian and
Zhao, Zihao and
Tong, Haibo and
Wang, Lijuan and
Yao, Huaxiu",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.1415/",
pages = "27830--27844",
ISBN = "979-8-89176-332-6",
abstract = "Existing video benchmarks often resemble image-based benchmarks, with question types like ``What actions does the person perform throughout the video?'' or ``What color is the woman{'}s dress in the video?'' For these, models can often answer by scanning just a few key frames, without deep temporal reasoning. This limits our ability to assess whether large vision-language models (LVLMs) can truly think with videos rather than perform superficial frame-level analysis. To address this, we introduce , a benchmark specifically designed to evaluate whether LVLMs can genuinely think with videos. Unlike prior benchmarks, emphasizes comprehensive video understanding beyond static image cues. It consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories, including Trajectory Analysis, Temporal Reasoning, and Forensics Detection. All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context{---}this is what we mean by thinking with video. These questions cannot be answered by scanning selected frames or relying on text alone. In human evaluations, achieves 94.82{\%} accuracy, but current LVLMs face significant challenges. Even the best-performing model, GPT-o3, reaches only 66.43{\%}, highlighting that LVLMs still struggle to move beyond surface-level reasoning to truly think with videos. We publicly release our benchmark and code at https://github.com/aiming-lab/GLIMPSE."
}<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="zhou-etal-2025-glimpse">
<titleInfo>
<title>GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?</title>
</titleInfo>
<name type="personal">
<namePart type="given">Yiyang</namePart>
<namePart type="family">Zhou</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Linjie</namePart>
<namePart type="family">Li</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Shi</namePart>
<namePart type="family">Qiu</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Zhengyuan</namePart>
<namePart type="family">Yang</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Yuyang</namePart>
<namePart type="family">Zhao</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Siwei</namePart>
<namePart type="family">Han</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Yangfan</namePart>
<namePart type="family">He</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Kangqi</namePart>
<namePart type="family">Li</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Haonian</namePart>
<namePart type="family">Ji</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Zihao</namePart>
<namePart type="family">Zhao</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Haibo</namePart>
<namePart type="family">Tong</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Lijuan</namePart>
<namePart type="family">Wang</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Huaxiu</namePart>
<namePart type="family">Yao</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<originInfo>
<dateIssued>2025-11</dateIssued>
</originInfo>
<typeOfResource>text</typeOfResource>
<relatedItem type="host">
<titleInfo>
<title>Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing</title>
</titleInfo>
<name type="personal">
<namePart type="given">Christos</namePart>
<namePart type="family">Christodoulopoulos</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Tanmoy</namePart>
<namePart type="family">Chakraborty</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Carolyn</namePart>
<namePart type="family">Rose</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Violet</namePart>
<namePart type="family">Peng</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<originInfo>
<publisher>Association for Computational Linguistics</publisher>
<place>
<placeTerm type="text">Suzhou, China</placeTerm>
</place>
</originInfo>
<genre authority="marcgt">conference publication</genre>
<identifier type="isbn">979-8-89176-332-6</identifier>
</relatedItem>
<abstract>Existing video benchmarks often resemble image-based benchmarks, with question types like “What actions does the person perform throughout the video?” or “What color is the woman’s dress in the video?” For these, models can often answer by scanning just a few key frames, without deep temporal reasoning. This limits our ability to assess whether large vision-language models (LVLMs) can truly think with videos rather than perform superficial frame-level analysis. To address this, we introduce , a benchmark specifically designed to evaluate whether LVLMs can genuinely think with videos. Unlike prior benchmarks, emphasizes comprehensive video understanding beyond static image cues. It consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories, including Trajectory Analysis, Temporal Reasoning, and Forensics Detection. All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context—this is what we mean by thinking with video. These questions cannot be answered by scanning selected frames or relying on text alone. In human evaluations, achieves 94.82% accuracy, but current LVLMs face significant challenges. Even the best-performing model, GPT-o3, reaches only 66.43%, highlighting that LVLMs still struggle to move beyond surface-level reasoning to truly think with videos. We publicly release our benchmark and code at https://github.com/aiming-lab/GLIMPSE.</abstract>
<identifier type="citekey">zhou-etal-2025-glimpse</identifier>
<location>
<url>https://aclanthology.org/2025.emnlp-main.1415/</url>
</location>
<part>
<date>2025-11</date>
<extent unit="page">
<start>27830</start>
<end>27844</end>
</extent>
</part>
</mods>
</modsCollection>
%0 Conference Proceedings
%T GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?
%A Zhou, Yiyang
%A Li, Linjie
%A Qiu, Shi
%A Yang, Zhengyuan
%A Zhao, Yuyang
%A Han, Siwei
%A He, Yangfan
%A Li, Kangqi
%A Ji, Haonian
%A Zhao, Zihao
%A Tong, Haibo
%A Wang, Lijuan
%A Yao, Huaxiu
%Y Christodoulopoulos, Christos
%Y Chakraborty, Tanmoy
%Y Rose, Carolyn
%Y Peng, Violet
%S Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
%D 2025
%8 November
%I Association for Computational Linguistics
%C Suzhou, China
%@ 979-8-89176-332-6
%F zhou-etal-2025-glimpse
%X Existing video benchmarks often resemble image-based benchmarks, with question types like “What actions does the person perform throughout the video?” or “What color is the woman’s dress in the video?” For these, models can often answer by scanning just a few key frames, without deep temporal reasoning. This limits our ability to assess whether large vision-language models (LVLMs) can truly think with videos rather than perform superficial frame-level analysis. To address this, we introduce , a benchmark specifically designed to evaluate whether LVLMs can genuinely think with videos. Unlike prior benchmarks, emphasizes comprehensive video understanding beyond static image cues. It consists of 3,269 videos and over 4,342 highly visual-centric questions across 11 categories, including Trajectory Analysis, Temporal Reasoning, and Forensics Detection. All questions are carefully crafted by human annotators and require watching the entire video and reasoning over full video context—this is what we mean by thinking with video. These questions cannot be answered by scanning selected frames or relying on text alone. In human evaluations, achieves 94.82% accuracy, but current LVLMs face significant challenges. Even the best-performing model, GPT-o3, reaches only 66.43%, highlighting that LVLMs still struggle to move beyond surface-level reasoning to truly think with videos. We publicly release our benchmark and code at https://github.com/aiming-lab/GLIMPSE.
%U https://aclanthology.org/2025.emnlp-main.1415/
%P 27830-27844
Markdown (Informal)
[GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?](https://aclanthology.org/2025.emnlp-main.1415/) (Zhou et al., EMNLP 2025)
ACL
- Yiyang Zhou, Linjie Li, Shi Qiu, Zhengyuan Yang, Yuyang Zhao, Siwei Han, Yangfan He, Kangqi Li, Haonian Ji, Zihao Zhao, Haibo Tong, Lijuan Wang, and Huaxiu Yao. 2025. GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 27830–27844, Suzhou, China. Association for Computational Linguistics.