Is Cognition Consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding

Zirui Shao; Feiyu Gao; Zhaoqing Zhu; Chuwei Luo; Hangdi Xing; Zhi Yu; Qi Zheng; Ming Yan; Jiajun Bu

doi:10.18653/v1/2025.emnlp-main.1574

Is Cognition Consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding

Zirui Shao, Feiyu Gao, Zhaoqing Zhu, Chuwei Luo, Hangdi Xing, Zhi Yu, Qi Zheng, Ming Yan, Jiajun Bu

Abstract

Multimodal large language models (MLLMs) have shown impressive capabilities in document understanding, a rapidly growing research area with significant industrial demand. As a multimodal task, document understanding requires models to possess both perceptual and cognitive abilities. However, due to different types of annotation noise in training, current MLLMs often face conflicts between perception and cognition. Taking a document VQA task (cognition) as an example, an MLLM might generate answers that do not match the corresponding visual content identified by its OCR (perception). This conflict suggests that the MLLM might struggle to establish an intrinsic connection between the information it “sees” and what it “understands”. Such conflicts challenge the intuitive notion that cognition is consistent with perception, hindering the performance and explainability of MLLMs. In this paper, we define the conflicts between cognition and perception as Cognition and Perception (C&P) knowledge conflicts, a form of multimodal knowledge conflicts, and systematically assess them with a focus on document understanding. Our analysis reveals that even GPT-4o, a leading MLLM, achieves only 75.26% C&P consistency. To mitigate the C&P knowledge conflicts, we propose a novel method called Multimodal Knowledge Consistency Fine-tuning. Our method reduces C&P knowledge conflicts across all tested MLLMs and enhances their performance in both cognitive and perceptual tasks.

Anthology ID:: 2025.emnlp-main.1574
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 30923–30944
Language:
URL:: https://aclanthology.org/2025.emnlp-main.1574/
DOI:: 10.18653/v1/2025.emnlp-main.1574
Bibkey:
Cite (ACL):: Zirui Shao, Feiyu Gao, Zhaoqing Zhu, Chuwei Luo, Hangdi Xing, Zhi Yu, Qi Zheng, Ming Yan, and Jiajun Bu. 2025. Is Cognition Consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 30923–30944, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Is Cognition Consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding (Shao et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.1574.pdf
Checklist:: 2025.emnlp-main.1574.checklist.pdf

PDF Cite Search Checklist Fix data