Measuring How (Not Just Whether) VLMs Build Common Ground

Saki Imai; Mert Inan; Anthony B. Sicilia; Malihe Alikhani

Measuring How (Not Just Whether) VLMs Build Common Ground

Saki Imai, Mert Inan, Anthony B. Sicilia, Malihe Alikhani

Abstract

Large vision language models (VLMs) increasingly claim reasoning skills, yet current benchmarks evaluate them in single-turn or question answering settings. However, grounding is an interactive process in which people gradually develop shared understanding through ongoing communication. We introduce a four-metric suite (grounding efficiency, content alignment, lexical adaptation, and human-likeness) to systematically evaluate VLM performance in interactive grounding contexts. We deploy the suite on 150 self-play sessions of interactive referential games between three proprietary VLMs and compare them with human dyads. All three models diverge from human patterns on at least three metrics, while GPT4o-mini is the closest overall. We find that (i) task success scores do not indicate successful grounding and (ii) high image-utterance alignment does not necessarily predict task success. Our metric suite and findings offer a framework for future research on VLM grounding.

Anthology ID:: 2025.ranlp-1.53
Volume:: Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Month:: September
Year:: 2025
Address:: Varna, Bulgaria
Editors:: Galia Angelova, Maria Kunilovskaya, Marie Escribe, Ruslan Mitkov
Venue:: RANLP
SIG:
Publisher:: INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:: 441–451
Language:
URL:: https://aclanthology.org/2025.ranlp-1.53/
DOI:
Bibkey:
Cite (ACL):: Saki Imai, Mert Inan, Anthony B. Sicilia, and Malihe Alikhani. 2025. Measuring How (Not Just Whether) VLMs Build Common Ground. In Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era, pages 441–451, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):: Measuring How (Not Just Whether) VLMs Build Common Ground (Imai et al., RANLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.ranlp-1.53.pdf

PDF Cite Search Fix data