Image-conditioned human language comprehension and psychometric benchmarking of visual language models

Subha Nawer Pushpita; Roger Levy

doi:10.18653/v1/2024.conll-1.34

Image-conditioned human language comprehension and psychometric benchmarking of visual language models

Abstract

Large language model (LLM)s’ next-word predictions have shown impressive performance in capturing human expectations during real-time language comprehension. This finding has enabled a line of research on psychometric benchmarking of LLMs against human language-comprehension data in order to reverse-engineer humans’ linguistic subjective probability distributions and representations. However, to date, this work has exclusively involved unimodal (language-only) comprehension data, whereas much human language use takes place in rich multimodal contexts. Here we extend psychometric benchmarking to visual language models (VLMs). We develop a novel experimental paradigm, Image-Conditioned Maze Reading, in which participants first view an image and then read a text describing an image within the Maze paradigm, yielding word-by-word reaction-time measures with high signal-to-noise ratio and good localization of expectation-driven language processing effects. We find a large facilitatory effect of correct image context on language comprehension, not only for words such as concrete nouns that are directly grounded in the image but even for ungrounded words in the image descriptions. Furthermore, we find that VLM surprisal captures most to all of this effect. We use these findings to benchmark a range of VLMs, showing that models with lower perplexity generally have better psychometric performance, but that among the best VLMs tested perplexity and psychometric performance dissociate. Overall, our work offers new possibilities for connecting psycholinguistics with multimodal LLMs for both scientific and engineering goals.

Anthology ID:: 2024.conll-1.34
Volume:: Proceedings of the 28th Conference on Computational Natural Language Learning
Month:: November
Year:: 2024
Address:: Miami, FL, USA
Editors:: Libby Barak, Malihe Alikhani
Venue:: CoNLL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 447–457
Language:
URL:: https://aclanthology.org/2024.conll-1.34/
DOI:: 10.18653/v1/2024.conll-1.34
Bibkey:
Cite (ACL):: Subha Nawer Pushpita and Roger P. Levy. 2024. Image-conditioned human language comprehension and psychometric benchmarking of visual language models. In Proceedings of the 28th Conference on Computational Natural Language Learning, pages 447–457, Miami, FL, USA. Association for Computational Linguistics.
Cite (Informal):: Image-conditioned human language comprehension and psychometric benchmarking of visual language models (Pushpita & Levy, CoNLL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.conll-1.34.pdf

PDF Cite Search Fix data