Rafid Mahmood


2024

pdf bib
Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models
Yuan-Hong Liao | Rafid Mahmood | Sanja Fidler | David Acuna
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Despite recent advances demonstrating vision- language models’ (VLMs) abilities to describe complex relationships among objects in images using natural language, their capability to quantitatively reason about object sizes and distances remains underexplored. In this work, we introduce a manually annotated benchmark of 241 questions across five categories specifically designed for quantitative spatial reasoning, and systematically investigate the performance of SoTA VLMs on this task. Our analysis reveals that questions involving reasoning about distances between objects are particularly challenging for SoTA VLMs; however, some VLMs perform significantly better at this task than others, with an almost 40 points gap between the two best performing models. We also make the surprising observation that the success rate of the top-performing VLM increases by 19 points when a reasoning path using a reference object emerges naturally in the response. Inspired by this observation, we develop a zero-shot prompting technique, SpatialPrompt, that encourages VLMs to answer quantitative spatial questions using references objects as visual cues. Specifically, we demonstrate that instruct- ing VLMs to use reference objects in their reasoning paths significantly improves their quantitative spatial reasoning performance, bypassing the need for external data, architectural modifications, or fine-tuning. Remarkably, by solely using SpatialPrompt, Gemini 1.5 Pro, GPT-4V, and GPT-4o improve by 56.2, 28.5, and 6.7 points on average in Q-Spatial Bench without the need for more data, model architectural modifications, or fine-tuning.