The Compositional Grounding Gap: Why Vision-Language Models Fail at Relational Reasoning and How to Fix It

Kaustubh S. Bukkapatnam

The Compositional Grounding Gap: Why Vision-Language Models Fail at Relational Reasoning and How to Fix It

Abstract

Large vision-language models (LVLMs) achieve strong performance on many multimodal tasks, yet consistently fail at compositional relational reasoning—distinguishing "the cat on the mat" from "the mat on the cat." We provide a formal explanation for this failure. We prove that any vision-language alignment operating on pooled (order-invariant) visual features contains compositional blind spots: semantically distinct scenes that map to identical representations. We show that the number of blind spots grows factorially with scene complexity, establishing a fundamental limit on pooled-feature architectures. Motivated by this analysis, we propose REGROUND, a training-free, test-time method that re-introduces spatial structure into alignment by performing relation-guided cross-attention over spatial visual tokens, directed by a lightweight parse of the text query. Without any fine-tuning, REGROUND improves compositional accuracy by +8.6 points on Winoground, +8.4 on ARO-Relation, +6.4 on SugarCrepe, and +8.4 on VSR when applied to LLaVA-1.5, and provides consistent gains across other LVLMs. Ablation studies confirm that each component—parse guidance, token-level attention, and relation masking—contributes significantly.

Anthology ID:: 2026.alvr-main.28
Volume:: Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Qianqi Yan, Syrielle Montariol, Yue Fan, Jing Gu, Jiayi Pan, Manling Li, Parisa Kordjamshidi, Alane Suhr, Xin Eric Wang
Venues:: ALVR | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 287–293
Language:
URL:: https://aclanthology.org/2026.alvr-main.28/
DOI:
Bibkey:
Cite (ACL):: Kaustubh S. Bukkapatnam. 2026. The Compositional Grounding Gap: Why Vision-Language Models Fail at Relational Reasoning and How to Fix It. In Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR), pages 287–293, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: The Compositional Grounding Gap: Why Vision-Language Models Fail at Relational Reasoning and How to Fix It (Bukkapatnam, ALVR 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.alvr-main.28.pdf

PDF Cite Search Fix data