Are Language-and-Vision Transformers Sensitive to Discourse? A Case Study of ViLBERT

Ekaterina Voloshina, Nikolai Ilinykh, Simon Dobnik


Abstract
Language-and-vision models have shown good performance in tasks such as image-caption matching and caption generation. However, it is challenging for such models to generate pragmatically correct captions, which adequately reflect what is happening in one image or several images. It is crucial to evaluate this behaviour to understand underlying reasons behind it. Here we explore to what extent contextual language-and-vision models are sensitive to different discourse, both textual and visual. In particular, we employ one of the multi-modal transformers (ViLBERT) and test if it can match descriptions and images, differentiating them from distractors of different degree of similarity that are sampled from different visual and textual contexts. We place our evaluation in the multi-sentence and multi-image setup, where images and sentences are expected to form a single narrative structure. We show that the model can distinguish different situations but it is not sensitive to differences within one narrative structure. We also show that performance depends on the task itself, for example, what modality remains unchanged in non-matching pairs or how similar non-matching pairs are to original pairs.
Anthology ID:
2023.mmnlg-1.4
Volume:
Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023)
Month:
September
Year:
2023
Address:
Prague, Czech Republic
Editors:
Albert Gatt, Claire Gardent, Liam Cripwell, Anya Belz, Claudia Borg, Aykut Erdem, Erkut Erdem
Venues:
MMNLG | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
28–38
Language:
URL:
https://aclanthology.org/2023.mmnlg-1.4
DOI:
Bibkey:
Cite (ACL):
Ekaterina Voloshina, Nikolai Ilinykh, and Simon Dobnik. 2023. Are Language-and-Vision Transformers Sensitive to Discourse? A Case Study of ViLBERT. In Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023), pages 28–38, Prague, Czech Republic. Association for Computational Linguistics.
Cite (Informal):
Are Language-and-Vision Transformers Sensitive to Discourse? A Case Study of ViLBERT (Voloshina et al., MMNLG-WS 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.mmnlg-1.4.pdf