Beyond Single Frames: Can LMMs Comprehend Implicit Narratives in Comic Strip?

Xiaochen Wang; Heming Xia; Jialin Song; Longyu Guan; Qingxiu Dong; Rui Li; Yixin Yang; Yifan Pu; Weiyao Luo; Yiru Wang; Xiangdi Meng; Wenjie Li; Zhifang Sui

doi:10.18653/v1/2025.findings-emnlp.342

Beyond Single Frames: Can LMMs Comprehend Implicit Narratives in Comic Strip?

Xiaochen Wang, Heming Xia, Jialin Song, Longyu Guan, Qingxiu Dong, Rui Li, Yixin Yang, Yifan Pu, Weiyao Luo, Yiru Wang, Xiangdi Meng, Wenjie Li, Zhifang Sui

Abstract

Large Multimodal Models (LMMs) have demonstrated strong performance on vision-language benchmarks, yet current evaluations predominantly focus on single-image reasoning. In contrast, real-world scenarios always involve understanding sequences of images. A typical scenario is comic strips understanding, which requires models to perform nuanced visual reasoning beyond surface-level recognition. To address this gap, we introduce STRIPCIPHER , a benchmark designed to evaluate the model ability on understanding implicit narratives in silent comics. STRIPCIPHER is a high-quality, human-annotated dataset featuring fine-grained annotations and comprehensive coverage of varying difficulty levels. It comprises three tasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering. % , covering various difficulty. Notably, evaluation results on STRIPCIPHER reveals a significant gap between current LMMs and human performance—e.g., GPT-4o achieves only 23.93% accuracy in the reordering task, 56.07% below human levels. These findings underscore the limitations of current LMMs in implicit visual narrative understanding and highlight opportunities for advancing sequential multimodal reasoning.

Anthology ID:: 2025.findings-emnlp.342
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6436–6452
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.342/
DOI:: 10.18653/v1/2025.findings-emnlp.342
Bibkey:
Cite (ACL):: Xiaochen Wang, Heming Xia, Jialin Song, Longyu Guan, Qingxiu Dong, Rui Li, Yixin Yang, Yifan Pu, Weiyao Luo, Yiru Wang, Xiangdi Meng, Wenjie Li, and Zhifang Sui. 2025. Beyond Single Frames: Can LMMs Comprehend Implicit Narratives in Comic Strip?. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 6436–6452, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Beyond Single Frames: Can LMMs Comprehend Implicit Narratives in Comic Strip? (Wang et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.342.pdf
Checklist:: 2025.findings-emnlp.342.checklist.pdf

PDF Cite Search Checklist Fix data