Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers

Georgios Pantazopoulos, Alessandro Suglia, Oliver Lemon, Arash Eshghi


Abstract
An effective method for combining frozen large language models (LLM) and visual encoders involves a resampler module that creates a ‘visual prompt’ which is provided to the LLM, along with the textual prompt. While this approach has enabled impressive performance across many coarse-grained tasks like image captioning and visual question answering, more fine-grained tasks that require spatial understanding have not been thoroughly examined. In this paper, we use diagnostic classifiers to measure the extent to which the visual prompt produced by the resampler encodes spatial information. Our results show that this information is largely absent from the resampler output when kept frozen during training of the classifiers. However, when the resampler and classifier are trained jointly, we observe a significant performance boost. This shows that the compression achieved by the resamplers can in principle encode the requisite spatial information, but that more object-aware objectives are needed at the pretraining stage to facilitate this capability.
Anthology ID:
2024.naacl-short.45
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
540–549
Language:
URL:
https://aclanthology.org/2024.naacl-short.45
DOI:
Bibkey:
Cite (ACL):
Georgios Pantazopoulos, Alessandro Suglia, Oliver Lemon, and Arash Eshghi. 2024. Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 540–549, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers (Pantazopoulos et al., NAACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.naacl-short.45.pdf