MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification

Linzhuang Sun; Hao Liang; Jingxuan Wei; Bihui Yu; Tianpeng Li; Fan Yang; Zenan Zhou; Wentao Zhang

doi:10.18653/v1/2025.acl-long.689

MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification

Linzhuang Sun, Hao Liang, Jingxuan Wei, Bihui Yu, Tianpeng Li, Fan Yang, Zenan Zhou, Wentao Zhang

Abstract

According to the Test-Time Scaling, the integration of External Slow-Thinking with the Verify mechanism has been demonstrated to enhance multi-round reasoning in large language models (LLMs). However, in the multimodal (MM) domain, there is still a lack of a strong MM-Verifier. In this paper, we introduce MM-Verifier and MM-Reasoner to enhance multimodal reasoning through longer inference and more robust verification. First, we propose a two-step MM verification data synthesis method, which combines a simulation-based tree search with verification and uses rejection sampling to generate high-quality Chain-of-Thought (COT) data. This data is then used to fine-tune the verification model, MM-Verifier. Additionally, we present a more efficient method for synthesizing MMCOT data, bridging the gap between text-based and multimodal reasoning. The synthesized data is used to fine-tune MM-Reasoner. Our MM-Verifier outperforms all larger models on the MathCheck, MathVista, and MathVerse benchmarks. Moreover, MM-Reasoner demonstrates strong effectiveness and scalability, with performance improving as data size increases. Finally, our approach achieves strong performance when combining MM-Reasoner and MM-Verifier, reaching an accuracy of 65.3 on MathVista, surpassing GPT-4o (63.8) with 12 rollouts.

Anthology ID:: 2025.acl-long.689
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14100–14115
Language:
URL:: https://aclanthology.org/2025.acl-long.689/
DOI:: 10.18653/v1/2025.acl-long.689
Bibkey:
Cite (ACL):: Linzhuang Sun, Hao Liang, Jingxuan Wei, Bihui Yu, Tianpeng Li, Fan Yang, Zenan Zhou, and Wentao Zhang. 2025. MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14100–14115, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification (Sun et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.689.pdf

PDF Cite Search Fix data