Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models

A method for creating a vision-and-language (V&L) model is to extend a language model through structural modifications and V&L pre-training. Such an extension aims to make a V&L model inherit the capability of natural language understanding (NLU) from the original language model. To see how well this is achieved, we propose to evaluate V&L models using an NLU benchmark (GLUE). We compare five V&L models, including single-stream and dual-stream models, trained with the same pre-training. Dual-stream models, with their higher modality independence achieved by approximately doubling the number of parameters, are expected to preserve the NLU capability better. Our main finding is that the dual-stream scores are not much different than the single-stream scores, contrary to expectation. Further analysis shows that pre-training causes the performance drop in NLU tasks with few exceptions. These results suggest that adopting a single-stream structure and devising the pre-training could be an effective method for improving the maintenance of language knowledge in V&L extensions.


Introduction
Pre-trained vision-and-language (V&L) models improve the performance of tasks that require an understanding of the V&L grounding, including visual question answering (Antol et al., 2015), referring expression comprehension (Kazemzadeh et al., 2014), and image-text matching (ITM) (Suhr et al., 2019). Recent V&L tasks, such as multimodal reading comprehension (Kembhavi et al., 2017;Yagcioglu et al., 2018;Hannan et al., 2020;Tanaka et al., 2021) and dialogue (Ilinykh et al., 2019;Haber et al., 2019;Udagawa and Aizawa, 2019), require a deeper NLU as well as the grounding. Extending pre-trained language models (LMs) is an option for those tasks as this allows V&L models to inherit language knowledge from their source LMs. The typical extending consists of visual pre-training and structure such as the stream type; the single-stream inserts vision tokens into the input sequence of the LM, and the dual-stream uses another sequence for early visual encoding.
One of the remaining challenges is to understand how such extensions affect the pre-trained language knowledge. For example, Lu et al. (2019) proposed the dual-stream model where part of the goal was to protect the learned LMs. The authors focused on evaluation with V&L tasks and did not evaluate their models with language-only tasks.  evaluated the extent of language knowledge loss in the single/dual-stream models against the source LM using language-only tasks. However, the difference between single-stream and dualstream models was unclear because the pre-training was also different in their models.
In this paper, we investigate the effect of visual extensions in V&L models on language-only tasks 1 . Bugliarello et al. (2020) proposed a framework to unify transformer-based V&L models and compared some single/dual-stream models in the same setup. Based on their work, our study shows how these structural differences affect the performance of NLU using the GLUE (Wang et al., 2019) tasks.
In our experiments, fine-tuning of pre-trained V&L models shows that both single/dual-stream models perform worse than the source LM and that single-stream models perform slightly better than dual-stream models. Further, we fine-tune the models created by only structural modifications without pre-training. We observe that the single/dual modification alone has little effect on the GLUE scores, indicating the performance degradation is primarily caused by pre-training. We also see how the V&L models changed from the source LM by analyzing the changes in the model parameters and the problem sets that each model can solve. Our results suggest that it would be more effective to adopt a single stream, and devise pre-training strategies for maintaining language knowledge.

Controlled V&L Models
In this section, we describe the pre-trained V&L models used in our experiments. Bugliarello et al. (2020) proposed a framework for V&L models that consider a sequence of tokens in sentences as language information, and a sequence of recognized object regions as visual information. In their framework, they reproduced five existing models, Visual-BERT (Li et al., 2019), Uniter , VL-BERT (Su et al., 2020), ViLBERT (Lu et al., 2019), and LXMERT (Tan and Bansal, 2019), and made their controlled versions by modifying some parts for a fairer and easier comparison. We use these controlled versions.

Structural Modification
We describe streams and embeddings, which are the basic factors of the model structures. We summarize the model structures in the controlled setup used in this experiment in Table 1.
Streams. V&L models can be divided into two categories based on how the vision and language sequences are encoded. Single-stream models, Visu-alBERT, Uniter, and VL-BERT, jointly process the vision and language sequences in a single encoder. Dual-stream models, ViLBERT and LXMERT, encode those sequences separately before encoding them jointly. ViLBERT is an early example of the dual-stream models and was proposed mainly to account for the differences in abstraction levels between vision and language, and to protect learned language models. In the controlled setup of Bugliarello et al. (2020), the stream type is identical to the original one in all models.
Embeddings. The major difference in embeddings is the use of global visual feature. The original VisualBERT, Uniter, and LXMERT do not use the global visual feature. ViLBERT has a token that represents the global visual feature at the beginning of vision sequences. VL-BERT inserts the global visual feature to the last of vision sequences and also adds the global visual feature to each token embedding in the language sequence. Object location is also expressed differently. The original VL-BERT and LXMERT use four attributes (left, top, right, bottom). In addition to the four attributes, the original ViLBERT uses area, and the original Uniter uses width, height, and area. VisualBERT does not use location information 2 .
The controlled setup is based on the structure of ViLBERT. For the global image feature, the setup inserts the average of vision tokens to the head of the vision sequence for all models. In addition to inserting the global visual feature, the controlled VL-BERT adds it to the respective tokens in the language sequence. For location, VisualBERT's setup that do not use location information remain the same, while the other models use the five attributes. The five attributes are normalized by width or height. Another point is the token type for the vision tokens. In the controlled setup, the token type is not added for ViLBERT and LXMERT because they have separate streams. Of the single-stream models, VisualBERT and Uniter use BERT's token type ID to specify vision tokens, while VL-BERT adds a new embedding to represent vision tokens.

Pre-training
We summarize the pre-training used in the controlled setup to train the five model structures described above. Note that we omit the detail of the pre-training used in each original paper here.
The five models were pre-trained on Google's Conceptual Captions (Sharma et al., 2018) corpus, which was collected from Web images and their alt-text HTML attributes. The corpus was filtered before training, and the size was approximately 2.7 M pairs as a result. Three tasks, masked language modelling (MLM), masked object classification (MOC), and ITM, were made from image-text pairs in the corpus. Given an image-text pair, the model predicts masked language tokens for MLM, the object class of masked vision tokens for MOC, and whether the pair is correct or not for ITM.
The weights of the five models were initialized with the pre-trained weights of BERT BASE if the corresponding weights were in BERT BASE ; otherwise (e.g., the weights of the vision encoder in dualstream models), they were initialized randomly.

Datasets
The GLUE benchmark (Wang et al., 2019) is a collection of diverse tasks for studying NLU systems.    Giampiccolo et al., 2007;Bentivogli et al., 2009), and WNLI (Levesque et al., 2012). STS-B is a single-valued regression task, and the others are classification tasks. We train the controlled pre-trained models on the training sets and evaluate them with the development sets. Figure 1 (left) shows the number of the training sentences in the corpora and their word overlap between the corpus used in the V&L pre-training.

Implementation Details
We fine-tuned pre-trained models published by Bugliarello et al. (2020)  black images for tasks where no image is provided as a simple way to preserve the input format used in pre-training. However, there are many possible methods for complementing the image input. For example, a method as simple as the present one can use other images, a noise input, or learnable parameters. Examining the impact of image input completion methods remains as future work.
Head for classification. We adopted the method used in Bugliarello et al. (2020) for V&L tasks. We used a learnable linear layer to calculate the likelihood of document classes, such as entailment/neutral/contradiction. We input the elementwise product of two vectors made from the model's output sequence into the linear layer. For those two vectors, we pooled the portions of the model's output sequence that correspond to the vision input and to the language input, respectively, by taking the first token of each portion. This corresponds to taking the outputs of the [CLS] token (in the language sequence) and the global visual token.
Hyperparameters for fine-tuning. We used a batch size of 64 and Adam for optimization. The learning rate was initialized at 2e-5 and decreased linearly. We trained for five epochs, evaluating the loss on the dev sets at the end of each epoch. Finally, we adopted the model with the lowest loss. Table 2 shows the results of the GLUE benchmark. In our experiment, we fine-tuned five V&L models and their source language model-BERT BASE . We also cited the BiLSTM baseline from the GLUE paper. The Glue avg of five V&L models decrease compared to BERT BASE . We can see a trend where the single-stream models perform slightly better than the dual-stream models. Note that this trend is consistent with the results of  for linguistic probing of the original Uniter and LXMERT. Although the difference is small, this suggests that the single-stream models can main-

Overall Result
. The other values related to GLUE are our results. We fine-tuned the pre-trained models for each task three times with different random seeds. We show the standard deviation in parentheses for avg and in Appendix B for each task. In the last column, we also show the scores of V&L tasks calculated by averaging the results in Bugliarello et al. (2020). The detail is described in Section 4.3. tain more of BERT BASE 's knowledge.
Performance of each task. V&L models perform lower than the BiLSTM baseline for some tasks, including MRPC, RTE, and WNLI. Figure 1 (right) shows the correlation between the word overlap between the corpus for pre-training and the GLUE task corpora and the GLUE score. We can see a positive correlation between those two variables. Although we do not conclude clearly because word overlap and the number of training data also correlate, word overlap could have a large impact on task performance.

Amount of Change in Parameters
We expected the model inference to be closer to BERT's inference if a model has parameters closer to BERT. Therefore, we calculated the cosine similarity of the corresponding parameters between pretrained models and BERT to indicate the degree to which the parameters had changed. Table 3 shows the averaged cosine similarity. We flattened param-   Table 4: Analysis of which models were successful in answering the classification task. STS-B was excluded because it is a regression task. We defined success in a problem as answering correctly in at least two out of three runs.
eters and calculated their similarity as vectors. We can see that the parameters of the single-stream and dual-stream models changed by the same extent. This suggests that separating streams alone may not be sufficient for knowledge maintenance.  culated the tables of successful models for each GLUE task and V&L model and second averaged the tables for the tasks. For all five models, there are approximately 5% of problems that they only can solve. This category shows the positive impact of V&L pre-training on NLU. Problems that both models can solve tended to be more common for the single-stream models. This supports the finding that these models retain more language knowledge. The difference of corpora for the last pre-training between BERT (mainly English Wikipedia) and the V&L models (images' alt-texts) might affect the complexity of the sentences in the problem sets that can be solved only by BERT and only by the V&L models. Thus, we analyzed the distributions of some metrics (sentence length, readability). However, we found no significant difference between the two sets in each model. We show the distributions in Appendix C.

Language and V&L Tasks
The last column of Table 2 shows the V&L scores for the V&L models. We calculated these scores by averaging the results on the five V&L tasks reported in Bugliarello et al. (2020). Their tasks cover four groups widely used to test V&L models: VQA, image-text retrieval, referring expressions, and multi-modal verification. Comparing the V&L and GLUE scores, we cans see that no model is best in both respects at the same time. There is room for improvement in the V&L extension.

Structural Modification or Pre-training:
Which Has the Greater Impact?
To further analyze the impact of structural modification, we fine-tuned models with only structural modifications (Mod. only). Table 5 shows a comparison between the GLUE scores of the Mod-only models and the full models (Mod+V&L-PT). Except for VL CTRL , the Mod-only models achieve a score comparable to BERT BASE , and the GLUE score decreases for the Mod+V&L-PT models. The fact that the structural modification preserves the score of the GLUE tasks in most cases suggests that the main factor for the drop in the GLUE tasks is V&L pre-training. This observation emphasizes the impact of pre-training on maintaining the language knowledge. Note that a possible reason for the exception of VL CTRL is that the global visual feature added to the language embeddings may break the language knowledge.

Discussion and Conclusion
The number of V&L model works that focus on both V&L tasks and language-only tasks has increased (Ororbia et al., 2019;Lin et al., 2021;Hu and Singh, 2021). Ororbia et al. (2019) proposed a V&L neural architecture and trained it on a language model in a visual context. They demonstrated that their architecture outperforms its equivalent trained on language alone in perplexity and stated that language is inseparable from its physical context. Although it is not clear whether methods that improve the perplexity of language modeling can also apply to maintain the performance of downstream tasks, the strategy of improving models with reference to human cognition would be an important direction. More recently,  achieved better performance on language-only tasks than their base model with pre-training on three types of corpora (text, image, and image-text pairs) at the same time. Lin et al. (2021) reported that adding separated extractors for vision and language on top of a single-stream encoder can help maintain language knowledge. In this paper, we fine-tuned V&L models extended from a language model (LM) to an NLU benchmark to compare their NLU performance. We used five V&L models, including single-stream and dual-stream models, pre-trained in the same setup. The benchmark scores of those models decreased compared with their source LM. We also found that the single-stream models tended to retain (slightly) more language knowledge than the dual-stream models, and that the main cause of the drop in the NLU tasks can be pre-training. Our observations suggest that adopting a single stream and devising pre-training strategies could be effective, at least for preserving the language knowledge.  Table 6: Training dataset statistics. CC: The Conceptual Captions dataset (Sharma et al., 2018). CAP: image captioning, P/S: paraphrase/similarity task, SS: single-sentence task.  Table 7: Standard deviations of our results in the performance on the GLUE tasks' development sets (Table 2). SDs are shown in parentheses below each value. We ran three experiments for each task.

C Additional Data for Analysis
We show the distributions of sentence length and readability mentioned in Section 4.2 in Figure 2 and Figure 3, respectively.
single-stream dual-stream single-stream dual-stream