The Role of Visual Modality in Multimodal Mathematical Reasoning: Challenges and Insights

Yufang Liu; Yao Du; Tao Ji; Jianing Wang; Yang Liu; Yuanbin Wu; Aimin Zhou; Mengdi Zhang; Xunliang Cai

doi:10.18653/v1/2025.acl-long.1102

The Role of Visual Modality in Multimodal Mathematical Reasoning: Challenges and Insights

Yufang Liu, Yao Du, Tao Ji, Jianing Wang, Yang Liu, Yuanbin Wu, Aimin Zhou, Mengdi Zhang, Xunliang Cai

Abstract

Recent research has increasingly focused on multimodal mathematical reasoning, particularly emphasizing the creation of relevant datasets and benchmarks. Despite this, the role of visual information in reasoning has been underexplored. Our findings show that existing multimodal mathematical models minimally leverage visual information, and model performance remains largely unaffected by changes to or removal of images in the dataset. We attribute this to the dominance of textual information and answer options that inadvertently guide the model to correct answers. To improve evaluation methods, we introduce the HC-M3D dataset, specifically designed to require image reliance for problem-solving and to challenge models with similar, yet distinct, images that change the correct answer. In testing leading models, their failure to detect these subtle visual differences suggests limitations in current visual perception capabilities. Additionally, we observe that the common approach of improving general VQA capabilities by combining various types of image encoders does not contribute to math reasoning performance. This finding also presents a challenge to enhancing visual reliance during math reasoning.

Anthology ID:: 2025.acl-long.1102
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 22596–22611
Language:
URL:: https://aclanthology.org/2025.acl-long.1102/
DOI:: 10.18653/v1/2025.acl-long.1102
Bibkey:
Cite (ACL):: Yufang Liu, Yao Du, Tao Ji, Jianing Wang, Yang Liu, Yuanbin Wu, Aimin Zhou, Mengdi Zhang, and Xunliang Cai. 2025. The Role of Visual Modality in Multimodal Mathematical Reasoning: Challenges and Insights. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22596–22611, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: The Role of Visual Modality in Multimodal Mathematical Reasoning: Challenges and Insights (Liu et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.1102.pdf

PDF Cite Search Fix data