Longyu Guan
2025
Beyond Single Frames: Can LMMs Comprehend Implicit Narratives in Comic Strip?
Xiaochen Wang
|
Heming Xia
|
Jialin Song
|
Longyu Guan
|
Qingxiu Dong
|
Rui Li
|
Yixin Yang
|
Yifan Pu
|
Weiyao Luo
|
Yiru Wang
|
Xiangdi Meng
|
Wenjie Li
|
Zhifang Sui
Findings of the Association for Computational Linguistics: EMNLP 2025
Large Multimodal Models (LMMs) have demonstrated strong performance on vision-language benchmarks, yet current evaluations predominantly focus on single-image reasoning. In contrast, real-world scenarios always involve understanding sequences of images. A typical scenario is comic strips understanding, which requires models to perform nuanced visual reasoning beyond surface-level recognition. To address this gap, we introduce STRIPCIPHER , a benchmark designed to evaluate the model ability on understanding implicit narratives in silent comics. STRIPCIPHER is a high-quality, human-annotated dataset featuring fine-grained annotations and comprehensive coverage of varying difficulty levels. It comprises three tasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering. % , covering various difficulty. Notably, evaluation results on STRIPCIPHER reveals a significant gap between current LMMs and human performance—e.g., GPT-4o achieves only 23.93% accuracy in the reordering task, 56.07% below human levels. These findings underscore the limitations of current LMMs in implicit visual narrative understanding and highlight opportunities for advancing sequential multimodal reasoning.
Search
Fix author
Co-authors
- Qingxiu Dong 1
- Rui Li 1
- Wenjie Li 1
- Weiyao Luo 1
- Xiangdi Meng 1
- show all...