How Vision Affects Language: Comparing Masked Self-Attention in Uni-Modal and Multi-Modal Transformer

Nikolai Ilinykh, Simon Dobnik


Abstract
The problem of interpretation of knowledge learned by multi-head self-attention in transformers has been one of the central questions in NLP. However, a lot of work mainly focused on models trained for uni-modal tasks, e.g. machine translation. In this paper, we examine masked self-attention in a multi-modal transformer trained for the task of image captioning. In particular, we test whether the multi-modality of the task objective affects the learned attention patterns. Our visualisations of masked self-attention demonstrate that (i) it can learn general linguistic knowledge of the textual input, and (ii) its attention patterns incorporate artefacts from visual modality even though it has never accessed it directly. We compare our transformer’s attention patterns with masked attention in distilgpt-2 tested for uni-modal text generation of image captions. Based on the maps of extracted attention weights, we argue that masked self-attention in image captioning transformer seems to be enhanced with semantic knowledge from images, exemplifying joint language-and-vision information in its attention patterns.
Anthology ID:
2021.mmsr-1.5
Volume:
Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR)
Month:
June
Year:
2021
Address:
Groningen, Netherlands (Online)
Editors:
Lucia Donatelli, Nikhil Krishnaswamy, Kenneth Lai, James Pustejovsky
Venue:
MMSR
SIG:
SIGSEM
Publisher:
Association for Computational Linguistics
Note:
Pages:
45–55
Language:
URL:
https://aclanthology.org/2021.mmsr-1.5
DOI:
Bibkey:
Cite (ACL):
Nikolai Ilinykh and Simon Dobnik. 2021. How Vision Affects Language: Comparing Masked Self-Attention in Uni-Modal and Multi-Modal Transformer. In Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR), pages 45–55, Groningen, Netherlands (Online). Association for Computational Linguistics.
Cite (Informal):
How Vision Affects Language: Comparing Masked Self-Attention in Uni-Modal and Multi-Modal Transformer (Ilinykh & Dobnik, MMSR 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.mmsr-1.5.pdf