Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models

Taichi Iki; Akiko Aizawa

doi:10.18653/v1/2021.emnlp-main.167

Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models

Abstract

A method for creating a vision-and-language (V&L) model is to extend a language model through structural modifications and V&L pre-training. Such an extension aims to make a V&L model inherit the capability of natural language understanding (NLU) from the original language model. To see how well this is achieved, we propose to evaluate V&L models using an NLU benchmark (GLUE). We compare five V&L models, including single-stream and dual-stream models, trained with the same pre-training. Dual-stream models, with their higher modality independence achieved by approximately doubling the number of parameters, are expected to preserve the NLU capability better. Our main finding is that the dual-stream scores are not much different than the single-stream scores, contrary to expectation. Further analysis shows that pre-training causes the performance drop in NLU tasks with few exceptions. These results suggest that adopting a single-stream structure and devising the pre-training could be an effective method for improving the maintenance of language knowledge in V&L extensions.

Anthology ID:: 2021.emnlp-main.167
Volume:: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2021
Address:: Online and Punta Cana, Dominican Republic
Editors:: Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2189–2196
Language:
URL:: https://aclanthology.org/2021.emnlp-main.167/
DOI:: 10.18653/v1/2021.emnlp-main.167
Bibkey:
Cite (ACL):: Taichi Iki and Akiko Aizawa. 2021. Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2189–2196, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models (Iki & Aizawa, EMNLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.emnlp-main.167.pdf
Video:: https://aclanthology.org/2021.emnlp-main.167.mp4

PDF Cite Search Video Fix data