GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning

Guizhen Chen, Weiwen Xu, Hao Zhang, Hou Pong Chan, Deli Zhao, Anh Tuan Luu, Yu Rong


Abstract
Recent advancements in reinforcement learning (RL) have enhanced the reasoning abilities of large language models (LLMs), yet the impact on multimodal LLMs (MLLMs) is limited. Particularly in vision-intensive tasks like geometric reasoning, MLLMs hallucinate frequently, leading to inaccurate reasoning. We attribute this to the perceptual bottleneck in MLLMs, which caps the benefits of reasoning training. To quantify this, we design a Geo-Perception Question-Answering (GeoPQA) benchmark, targeting basic geometric concepts and spatial relationships. Experiments on GeoPQA reveal significant shortcomings of MLLMs in visual perception, constraining RL reward signals for training. To address this bottleneck, we propose a two-stage RL training framework by first enhancing the visual perception of geometric structures, then fostering reasoning capabilities. Applied to Qwen2.5-VL-3B-Instruct, our two-stage training improves geometric reasoning by 9.7% and problem-solving by 9.1%, compared to the direct reasoning training approach. Our method also generalizes to other vision-intensive domains like figure understanding, highlighting the importance of perceptual grounding in effective MLLM reasoning.
Anthology ID:
2025.findings-emnlp.1400
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
25680–25688
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.1400/
DOI:
Bibkey:
Cite (ACL):
Guizhen Chen, Weiwen Xu, Hao Zhang, Hou Pong Chan, Deli Zhao, Anh Tuan Luu, and Yu Rong. 2025. GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 25680–25688, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning (Chen et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.1400.pdf
Checklist:
 2025.findings-emnlp.1400.checklist.pdf