Using Game Play to Investigate Multimodal and Conversational Grounding in Large Multimodal Models

Sherzod Hakimov, Yerkezhan Abdullayeva, Kushal Koshti, Antonia Schmidt, Yan Weiser, Anne Beyer, David Schlangen


Abstract
While the situation has improved for text-only models, it again seems to be the case currently that multimodal (text and image) models develop faster than ways to evaluate them. In this paper, we bring a recently developed evaluation paradigm from text models to multimodal models, namely evaluation through the goal-oriented game (self) play, complementing reference-based and preference-based evaluation. Specifically, we define games that challenge a model’s capability to represent a situation from visual information and align such representations through dialogue. We find that the largest closed models perform rather well on the games that we define, while even the best open-weight models struggle with them. On further analysis, we find that the exceptional deep captioning capabilities of the largest models drive some of the performance. There is still room to grow for both kinds of models, ensuring the continued relevance of the benchmark.
Anthology ID:
2025.coling-main.381
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5686–5718
Language:
URL:
https://aclanthology.org/2025.coling-main.381/
DOI:
Bibkey:
Cite (ACL):
Sherzod Hakimov, Yerkezhan Abdullayeva, Kushal Koshti, Antonia Schmidt, Yan Weiser, Anne Beyer, and David Schlangen. 2025. Using Game Play to Investigate Multimodal and Conversational Grounding in Large Multimodal Models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 5686–5718, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Using Game Play to Investigate Multimodal and Conversational Grounding in Large Multimodal Models (Hakimov et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.381.pdf