Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models

Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, David Acuna


Abstract
Despite recent advances demonstrating vision- language models’ (VLMs) abilities to describe complex relationships among objects in images using natural language, their capability to quantitatively reason about object sizes and distances remains underexplored. In this work, we introduce a manually annotated benchmark of 241 questions across five categories specifically designed for quantitative spatial reasoning, and systematically investigate the performance of SoTA VLMs on this task. Our analysis reveals that questions involving reasoning about distances between objects are particularly challenging for SoTA VLMs; however, some VLMs perform significantly better at this task than others, with an almost 40 points gap between the two best performing models. We also make the surprising observation that the success rate of the top-performing VLM increases by 19 points when a reasoning path using a reference object emerges naturally in the response. Inspired by this observation, we develop a zero-shot prompting technique, SpatialPrompt, that encourages VLMs to answer quantitative spatial questions using references objects as visual cues. Specifically, we demonstrate that instruct- ing VLMs to use reference objects in their reasoning paths significantly improves their quantitative spatial reasoning performance, bypassing the need for external data, architectural modifications, or fine-tuning. Remarkably, by solely using SpatialPrompt, Gemini 1.5 Pro, GPT-4V, and GPT-4o improve by 56.2, 28.5, and 6.7 points on average in Q-Spatial Bench without the need for more data, model architectural modifications, or fine-tuning.
Anthology ID:
2024.emnlp-main.947
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
17028–17047
Language:
URL:
https://aclanthology.org/2024.emnlp-main.947
DOI:
Bibkey:
Cite (ACL):
Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, and David Acuna. 2024. Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17028–17047, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models (Liao et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.947.pdf