Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models

Taichi Iki, Akiko Aizawa


Abstract
A method for creating a vision-and-language (V&L) model is to extend a language model through structural modifications and V&L pre-training. Such an extension aims to make a V&L model inherit the capability of natural language understanding (NLU) from the original language model. To see how well this is achieved, we propose to evaluate V&L models using an NLU benchmark (GLUE). We compare five V&L models, including single-stream and dual-stream models, trained with the same pre-training. Dual-stream models, with their higher modality independence achieved by approximately doubling the number of parameters, are expected to preserve the NLU capability better. Our main finding is that the dual-stream scores are not much different than the single-stream scores, contrary to expectation. Further analysis shows that pre-training causes the performance drop in NLU tasks with few exceptions. These results suggest that adopting a single-stream structure and devising the pre-training could be an effective method for improving the maintenance of language knowledge in V&L extensions.
Anthology ID:
2021.emnlp-main.167
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2189–2196
Language:
URL:
https://aclanthology.org/2021.emnlp-main.167
DOI:
10.18653/v1/2021.emnlp-main.167
Bibkey:
Cite (ACL):
Taichi Iki and Akiko Aizawa. 2021. Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2189–2196, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models (Iki & Aizawa, EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.167.pdf
Video:
 https://aclanthology.org/2021.emnlp-main.167.mp4
Code
 alab-nii/eval_vl_glue
Data
CoLAGLUEMRPCQNLISSTSST-2