What Does BERT with Vision Look At?

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang


Abstract
Pre-trained visually grounded language models such as ViLBERT, LXMERT, and UNITER have achieved significant performance improvement on vision-and-language tasks but what they learn during pre-training remains unclear. In this work, we demonstrate that certain attention heads of a visually grounded language model actively ground elements of language to image regions. Specifically, some heads can map entities to image regions, performing the task known as entity grounding. Some heads can even detect the syntactic relations between non-entity words and image regions, tracking, for example, associations between verbs and regions corresponding to their arguments. We denote this ability as syntactic grounding. We verify grounding both quantitatively and qualitatively, using Flickr30K Entities as a testbed.
Anthology ID:
2020.acl-main.469
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Editors:
Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5265–5275
Language:
URL:
https://aclanthology.org/2020.acl-main.469
DOI:
10.18653/v1/2020.acl-main.469
Bibkey:
Cite (ACL):
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2020. What Does BERT with Vision Look At?. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5265–5275, Online. Association for Computational Linguistics.
Cite (Informal):
What Does BERT with Vision Look At? (Li et al., ACL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.acl-main.469.pdf
Video:
 http://slideslive.com/38928841