A Multimodal Large Language Model “Foresees” Objects Based on Verb Information but Not Gender

Shuqi Wang, Xufeng Duan, Zhenguang Cai


Abstract
This study employs the classical psycholinguistics paradigm, the visual world eye-tracking paradigm (VWP), to explore the predictive capabilities of LLAVA, a multimodal large language model (MLLM), and compare them with human anticipatory gaze behaviors. Specifically, we examine the attention weight distributions of LLAVA when presented with visual displays and English sentences containing verb and gender cues. Our findings reveal that LLAVA, like humans, can predictively attend to objects relevant to verbs, but fails to demonstrate gender-based anticipatory attention. Layer-wise analysis indicates that the middle layers of the model are more related to predictive attention than the early or late layers. This study is pioneering in applying psycholinguistic paradigms to compare the multimodal predictive attention of humans and MLLMs, revealing both similarities and differences between them.
Anthology ID:
2024.conll-1.32
Volume:
Proceedings of the 28th Conference on Computational Natural Language Learning
Month:
November
Year:
2024
Address:
Miami, FL, USA
Editors:
Libby Barak, Malihe Alikhani
Venue:
CoNLL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
435–441
Language:
URL:
https://aclanthology.org/2024.conll-1.32
DOI:
Bibkey:
Cite (ACL):
Shuqi Wang, Xufeng Duan, and Zhenguang Cai. 2024. A Multimodal Large Language Model “Foresees” Objects Based on Verb Information but Not Gender. In Proceedings of the 28th Conference on Computational Natural Language Learning, pages 435–441, Miami, FL, USA. Association for Computational Linguistics.
Cite (Informal):
A Multimodal Large Language Model “Foresees” Objects Based on Verb Information but Not Gender (Wang et al., CoNLL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.conll-1.32.pdf