One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks

Current multimodal models, aimed at solving Vision and Language (V+L) tasks, predominantly repurpose Vision Encoders (VE) as feature extractors. While many VEs—of different architectures, trained on different data and objectives—are publicly available, they are not designed for the downstream V+L tasks. Nonetheless, most current work assumes that a single pre-trained VE can serve as a general-purpose encoder. In this work, we focus on analysis and aim to understand whether the information stored within different VEs is complementary, i.e. if providing the model with features from multiple VEs can improve the performance on a target task, and how they are combined. We exhaustively experiment with three popular VEs on six downstream V+L tasks and analyze the attention and VE-dropout patterns. Our analyses suggest that diverse VEs complement each other, resulting in improved downstream V+L task performance, where the improvements are not due to simple ensemble effects (i.e. the performance does not always improve when increasing the number of encoders). We demonstrate that future VEs, which are not repurposed, but explicitly designed for V+L tasks, have the potential of improving performance on the target V+L tasks.


Introduction
The dominant strategy for solving Vi-sion+Language (V+L) tasks involves using Transformer models (Vaswani et al., 2017) that jointly attend over the representations of the respective modalities (Lu et al., 2019;Su et al., 2020;Li et al., 2020b;Huang et al., 2020, inter alia). While representationlearning of the text modality is comparatively straightforward using token embeddings, 1 image * Gregor is now affiliated with WüNLP & Computer Vision Lab, CAIDAS, University of Würzburg.
† Jonas is now affiliated with Google Research. 1 But still far from solved especially in multilingual settings (Rust et al., 2021;Clark et al., 2022;Xue et al., 2022) representations are more difficult to learn. Here, the common method is to use pre-trained Vision Encoders (VE). Given an image, the VE's output features are passed as inputs, together with the text embeddings, into a Transformer model. The attention mechanism then learns a cross-modal representation-space over the text and image features in order to solve the target V+L task.
Consequently, the success of a multimodal model builds heavily on the features extracted from a VE and is therefore highly dependent on the VE's architecture, training objectives (e.g. image classification, image encoding, object detection, etc.), and pre-training data. This dependency is further exacerbated for multimodal models that utilize VEs as static feature extractors (i.e. the weights of the VE are frozen), but also for models that are trained endto-end, as the biases introduced by the architecture, objectives, and data of the VE remain.
Many computer vision models can be repurposed as VEs for V+L tasks. A number of prior works have focused on identifying individual VEs that perform the best on downstream tasks (Jiang et al., 2020;Shen et al., 2022;Eichenberg et al., 2021;Zhang et al., 2021). A common premise is that a single pre-trained VE can perform the best for a target task or even serve as a general-purpose encoder for a wide range of V+L tasks. However, given that each VE is repurposed from its vision application, and not directly designed for the downstream V+L task, we hypothesize that this premise is not necessarily true. Given that all VEs differ in architecture, objectives, and pre-training data, the extracted features of multiple different VEs potentially encode complementary information, which can be combined to help improve the performance on the target task.
In this work, we aim to comprehensively analyze multi-VE models and test whether combining VEs is beneficial over a single-VE setup. With this analysis, we aim to provide more insights into whether or not a VE design, targeted explicitly for V+L tasks has the potential to outperform repurposed VEs. We focus on three popular classes of VEs in our experiments: 1) object detection models providing a feature representation of salient image parts containing objects (Region) (Ren et al., 2015;Anderson et al., 2018), 2) CNN models computing a feature map of the image for grid features (Grid), and 3) Vision Transformers (ViT) (Dosovitskiy et al., 2021) computing contextualized patch features of the image (Patch). As the downstream domain and task type can be heavily impacted by the different VEs, we probe all combinations of the three VEs on six different V+L tasks, covering retrieval, question answering, and reasoning.
To further understand the multi-VE model's behavior we analyze 1) the attention patterns across modalities and encodings, and 2) the dependency of specific VEs when performing VE-dropout during training and inference. We consistently observe improved downstream task performance of the multi-VE over the single-VE setup. However, interestingly there is not one combination of VEs that outperforms all other settings. This suggests that distinct information encoded in the respective VEs is important for different tasks. This is further highlighted when we analyze the model's attention patterns across the different VEs. These results suggest that the model composes the representations of multiple VEs, where a dominant VE is enriched by the complementary VEs.
Overall, our results and analysis suggest that VEs trained on different objectives, architectures and data can have a high impact on the multimodal model's V+L task performance. Selecting and repurposing off-the-shelf VEs is non-trivial and will likely achieve sub-par performance over a taskspecific composition of multiple VEs (i.e. we cannot rely on simple ensemble effects to improve performance), emphasizing the necessity to design VEs explicitly for V+L tasks in the future.

Related Work
Multimodal Transformer Architectures. Multimodal Transformer architectures can be divided into single-stream and dual-stream models (Bugliarello et al., 2021). The single-stream Transformer takes the concatenated visual and text tokens as input and processes them modalityagnostic, i.e. the self-attention jointly attends over the tokens of both modalities. Dual-stream models use separate Transformers for each modality that are connected through a co-attention mechanism (Tan and Bansal, 2019;Lu et al., 2019), concatenated in a single-stream model on top (Singh et al., 2022;Kamath et al., 2021), or the image model output is used asymmetrically for cross-attention in the text model .
The Faster R- CNN (Ren et al., 2015) object detector has been the dominant choice for multimodal models as a Region VE, where most methods propose to use it as a static feature extractor (Tan and Bansal, 2019;Lu et al., 2019;Su et al., 2020;Li et al., 2020b;Zhang et al., 2021;Cho et al., 2021), with the notable exception being Su et al. (2020) who backpropagate through the Faster R-CNN model. Less popular VEs are Grid (Huang et al., 2020;Kamath et al., 2021;Yan et al., 2021a;Shen et al., 2022;Eichenberg et al., 2021), and Patch Wang et al., 2022;Eichenberg et al., 2021). In contrast to Region VEs, Grid and Patch VEs are commonly fine-tuned on the target V+L task, with the notable exception being Yan et al. (2021a). Following Bugliarello et al. (2021); Hendricks et al. (2021) we focus on single-stream models as they have been shown to perform on par with dual-stream models while being easier to extend to multi-VE setups.
Comparing and Combining VEs. Recently, several works aim to compare different VEs for V+L tasks. Jiang et al. (2020) compare Region and Grid for visual QA tasks, showing that training data, objectives and other factors all affect the downstream task performance. Shen et al. (2022);Eichenberg et al. (2021) compare different pre-trained Grid and Patch VEs building on CLIP (Radford et al., 2021). Zhang et al. (2021) compare different design choices for Region VEs with Grid VEs trained on the same data. Closest to our work is the work by Yan et al. (2021b). While they also experiment with combining representations of Grid-, Patch-, and Region VEs, they only focus on the Visual Question Answering (VQA; Goyal et al., 2017) dataset and only use the combination of all three VEs. Our work provides a more extensive evaluation of different multi-VE setups while experimenting with six diverse tasks. We confirm their reported improvements but show that different combinations work best for each individual task.
Analysis of Multimodal Transformers. Our analysis methods draw inspiration from recent works that probe and analyze pre-trained multimodal Transformers for a better understanding of their different components (Bugliarello et al., 2021;Li et al., 2020a;Frank et al., 2021;Hendricks et al., 2021).  propose a range of different probing tasks to understand the inner workings of multimodal models. Li et al. (2020a) analyze how accurate the attention heads of pre-trained models can perform visual grounding. Frank et al. (2021) mask parts of the text and image input and measure how the prediction performance changes for the respective other modality to test how symmetric the learned cross-modal connection is. Bugliarello et al. (2021); Hendricks et al. (2021) evaluate and disentangle which components of multimodal pre-training proposed in different works are important for their success. While previous work has only focused on models with a Region VE, we also experiment with Grid and Patch VEs.
In summary, our work is the first in-depth study of multimodal Transformers that use multiple VEs.

Multimodal Multi-VE Transformers
Recently, the strategy to solving V+L has been dominated by Transformer-based architectures, where different cross-modal attention mechanisms have been proposed to learn multimodal representations. In this work, we follow Bugliarello et al. (2021) and focus on the single-stream architecture, which shares the attention components across all modalities, i.e. the concatenated visual and text tokens are processed modality-agnostic. This architecture achieves state-of-the-art results (Bugliarello et al., 2021) while being easily extendable to multiple VEs, by concatenating all vision tokens. We illustrate our architecture in Figure 1. Multimodal input representations. The raw data for a V+L task consists of either discrete tokens/characters (text-modality) or high-resolution pixel values (image-modality). To extract dense representations of the respective modalities we follow the standard pre-processing strategies: The text modality is tokenized using word-piece tokenization (Devlin et al., 2019) and mapped to their corresponding dense embedding representations. At the input to the first Transformer layer, positional embeddings are added to the respective token embeddings. For the vision modality, pre-trained VEs are utilized which encode the raw pixel values of the respective image into dense high-dimensional feature vectors. These VEs can either encode designated sections (e.g. Region), or an entire image (e.g. Grid and Patch). The extracted feature vectors are then passed through a multi-layer perceptron (MLP), and subsequently into the Transformer. This procedure can be repeated for any number of VEs of interest. In other words, the image features (from multiple VEs) and the text embeddings are concatenated and jointly passed through a shared Transformer model which learns to attend over the multimodal representations. V+L task training. We place a classification head on the output of the [CLS] token (following Devlin et al. (2019)) and fine-tune the model with crossentropy loss on the training data of the target task.

Experiments
We evaluate the impact of three different VEs on six downstream V+L tasks to assess the complementarity of different image representations. Here, we experiment with all possible combinations of the three VEs (i.e. single VE, 2-VE, and 3-VE setups). 2 To fairly compare the information stored in the respective VEs, we only fine-tune the multimodal models on the target V+L task in order to circumvent potentially beneficial information leaking into the multimodal model from auxiliary tasks. We therefore initialize all models with BERT weights (Devlin et al., 2019) (base-size). We note, however, that gains can be achieved when pre-training the multimodal model on auxiliary data prior to  fine-tuning on the target V+L task (Tan and Bansal, 2019;Lu et al., 2019;, inter alia).

Vision Encoders
We follow the standard approach and repurpose three pre-trained vision models as VEs. In a besteffort attempt for a fair setup, we use the current best publicly available models of similar sizes. Each VE has a designated, randomly initialized 2-layer perceptron (MLP) that maps the representations to the input of the Transformer and is trained on the target V+L task along with the multimodal Transformer weights. We keep the VE weights frozen during training. For a full summary of the VEs including pre-training data, the number of extracted tokens as well as dimensions, see Table 1.
Region VE. We utilize Faster R- CNN (Ren et al., 2015), an object detection model that outputs a list of bounding boxes and feature vectors for Regions of Interest-salient parts of the image that likely contain an object. Here we select the pretrained VinVL object detector (Zhang et al., 2021), 3 which outperforms previous object detectors on V+L tasks. We follow Li et al. (2020b); Zhang et al. (2021) and concatenate each extracted feature vector with the corresponding normalized box coordinates and width / height. We extract the top-36 regions from the VinVL object detector. Grid VE. Grid VEs linearize the grid feature map of a CNN (before final pooling or classification layers) to a list of visual tokens. Each visual token corresponds to a specific part of the image with image features on different scales (through different pooling operations and convolution sizes throughout the CNN) encoded in it. We use adaptive max pooling 4 on the feature map to reduce the number of tokens to 36 per image. We use the CLIP CNN 3 Not to be confused with the multimodal Transformer in the same work. 4 https://pytorch.org/docs/1.11/ generated/torch.nn.AdaptiveMaxPool2d (RN50x4) (Radford et al., 2021) as initialization, given it's recent success on V+L tasks (Shen et al., 2022;Eichenberg et al., 2021;Alayrac et al., 2022). Patch VE. Patch VEs use the contextualized output representations of a Vision Transformer (ViT) (Dosovitskiy et al., 2021) as visual tokens. The ViT splits an image into uniform patches, which are used as input tokens. Different from a CNN, the ViT tokens are fixed in size throughout the model but they have a global receptive field through the ViT's attention mechanism. We exclude the ViT's special classification token from the Transformer input. We also utilize the CLIP-based ViT models (ViT/B-32) (Radford et al., 2021) for our Patch-VE. We extract all 49 tokens for the CLIP ViT due to their smaller feature dimension size.

Tasks
We experiment with a set of six V+L tasks: Imagetext retrieval (Flickr30k (Young et al., 2014) and MSCOCO (Lin et al., 2014)), visual question answering (GQA (Hudson and Manning, 2019) and VQA2.0 (Goyal et al., 2017)), visual entailment (SNLI-VE (Xie et al., 2019)) and memes classification (Hateful Memes (Kiela et al., 2020)). For all experiments we report the mean and standard deviations over three random seeds and present training details and hyperparameters in Appendix A. We train all models with a single Nvidia V100 GPU. Training one model for all tasks and three seeds takes approximately 10 GPU days in total.

Results & Discussion
We report the results on the six tasks with all possible combinations of VEs in Table 2 Table 2: Mean and standard deviation over three runs on each task. We report for Hateful Memes AUROC, all other non-retrieval tasks the accuracy, and for retrieval task the mean recall at 1 between image-text and text-image retrieval. We bold the best single-and multi-VE setups and underline the overall best score. We also report the results for VE-Dropout Training (see §5.5).
that the object-centric regions are useful for QA tasks, which focus on specific elements, while the uniform grid encoding might be useful for retrieval and other tasks that look at the entire image. We leave the investigation why certain VEs are useful for specific tasks, and the role of training objectives, data, and architecture, to future work. Interestingly, the Patch VE never achieves the best performance, which aligns with previous findings by Shen et al. (2022); Eichenberg et al. (2021). These single-VE results demonstrate that each VE encodes different types of information, having an impact on the downstream V+L task performance. VEs can complement each other. When combining the representations from different VEs, we witness improvements across all V+L tasks. Interestingly, MSCOCO benefits greatly from combining the two weakest VEs (i.e. Region and Patch), surpassing their corresponding single VE results by 7.94 and 12.43 points respectively, achieving the best performance on this task. Although the Patch VE never achieves the best performance in single-VE setups, it provides complementary information in combination with the best performing VE, achieving the best overall performance for many tasks. 5 However, we see that simply using more encoders does not guarantee improvements as is evident by the 3-encoder model. While the 3-encoder model consistently achieves near-best results, it is rarely the best model (only on 1 out of 6 tasks). This result shows that simply using more encoders does not guarantee improvement (i.e. model performance is not monotonically improving with more VEs.). Hence, it is unlikely that the improvements are due to an ensemble effect.
One does not fit all. In summary, we see that neither one VE alone nor a fixed combination of VEs gives the best results for the entire breadth of V+L tasks. This shows the current limitations of repurposed vision encoders and highlights the need for encoders designed specifically for V+L tasks.

Analysis
In order to better understand how the representations are combined in different multi-VE setups, we analyze the flow of attention, phrase-to-image grounding, and the robustness to dropping VEs at test time. We overload 'cross-modality' to include both VE-text but also VE-VE interactions for simplicity. We present the analysis for the best performing model combinations in what follows but provide a full list of results in Appendix B.

CLS Attention Flow
The CLS token can be seen as the fused representation of the modalities that are used for the final classification . We can thus estimate which VEs are important for classification by  Figure 2: CLS attention (in %) to each modality/ VE averaged over all heads. We add an outline to the VE with the best single-VE results. Numbers do not add to 100% because of CLS self-attention. We present the best multi-VE results here and all other results in Figure 8 in the Appendix. We present the best multi-VE results here and all other results in Figure 9 in the Appendix.
considering which VEs the CLS token attends to. 6 Following , we compute the sum of attention from the CLS token to each modality and then average those scores over all heads. We present the CLS attention for the best multi-VE setups in Figure 2. We see that for most tasks, the VE which performed best in the single-VE setup receives the majority of VE-attention. This suggests that one VE dominates in multi-VE setups while the others are complementary.

Cross-Modal Attention Flow
The attention flow between the different modalities can indicate which VEs are used by the model to reason over the input. We assume that more attention to a modality suggests that it contains useful information for others. We compute the average attention flow between two modalities M and N for an attention head as 1 |M | m∈M,n∈N a m→n with a m→n as the attention weight from token m to n. 7 We average over all heads to aggregate.
We present the attention flow for the best multi-VE setups for each task in Figure 3. Similar to the CLS, the majority of attention is paid to the VE that achieved the better results in the single-VE experiment for that task.

Overlapping Token Surplus Attention
The attention flow between different VEs' visual tokens that overlap-i.e. encode the same part of the image-can tell us if the model combines the VEs to complement the image representation. For each attention head we therefore compute the average per-token attention from a token t of one VE to overlapping tokens I |t of another VE, and compare this-i.e. compute the surplusto the non-overlapping tokens of that VE I\t as ( 1 |I |t | i∈I |t a t→i ) − ( 1 |I\t| i∈I\t a t→i ) with a t→i being the attention weight between the tokens. We average over all tokens to get the surplus attention 7 We exclude the CLS token for this metric.  Other results can be found in Figure 11 in the Appendix of an attention head for a VE pair.
We present the results for the best performing setups in Figure 4. For most VEs we can identify heads-indicated by larger dot sizes-which attend particularly to another VE's tokens. For most settings, these heads are also those which attend to the overlapping tokens of the respective other VE. While we witness large surplus in attention for overlapping tokens between Region and Grid/Patch, this is not the case for Grid-Patch. This indicates that the complementarity of Region features is higher for the respective other VEs, which provides more evidence that training the VE on different data and objectives is important.

Visual Entity Grounding
Visual grounding is the task of matching text phrases to their corresponding parts in the image.
To analyze whether or not there are dominant VEs which learn to ground, we follow Li et al. (2020a) and count how often the highest attention weight from a text phrase is assigned to the corresponding visual token. We use the gold phrase-to-bounding box annotations available for Flickr30k (Plummer et al., 2015) and GQA. Formally, a head correctly grounds a phrase to the gold box g if the maximum attention from the last phrase token t to any of a VE's tokens I goes to any token I |g overlapping with the gold box, 8 i.e. if arg max i∈I a t→i ∈ I |g . We calculate the accuracy by counting the number of correct groundings where I |g is not empty.
We report the results for all heads of the best GQA and Flickr30k models in Figure 5. We can see that the accuracy of the dominant VE (Grid and Region, respectively) is generally higher than for Patch. While there is a clear pattern of the dominant VE achieving significantly higher accuracy, the complementary VE also achieves accuracies beyond 10%, indicating that the model learns to reason over, and utilize both VE's representations. formance on the target task. To further evaluate the importance of the respective VEs we experiment with dropping all VE-specific features during test time. The results in Figure 6 show that dropping the dominant encoder results in catastrophic performance decrease, especially for the retrieval tasks. While QA and reasoning tasks have a 20% to 40% decrease in performance, for retrieval tasks R@1 decreases by almost 100%.

VE-Dropout
However, the detrimental performance of dropping out the VE features at test time might be a result of the multi-VE models never being trained for this setting. Consequently, we train the 2-encoder models with VE-wise dropout per batch. 9 We hypothesize that this would force the model to take the complementary VE into account more while being more robust during inference. As reported in Figure 7, the robustness in terms of dropping VEs improves, however, we see a slight drop in the final task performance, as reported in Table 2. 10 9 With a uniform distribution over dropping the first, second, or no VE. 10 We notice no significant changes in the attention patterns after VE-Dropout training (see Appendix C).  Figure 7: Relative performance decrease of 2-encoder models as in Figure 6 after VE-Dropout Training.

R+G R+P G+P
(Other tasks in Figure 12 in the Appendix)

Discussion and Future Directions
Our analyses demonstrate that, while combining multiple VEs consistently outperforms single VE setups, there is not a single VE or a fixed strategy on combining VEs that works best for all tasks.
In particular, simply ensembling all VEs is rarely the optimal choice; consequently, best-performing combinations of VEs need to be identified for each individual task ( § 4.3). By further analyzing the attention patterns, we find a clear dominating VE ( § 5.5, Figures 6 & 7) that both the [CLS] ( § 5.1, Figure 2) and the multimodal tokens ( § 5.2, Figure 3) predominantly attend to, whereas the secondary VEs provide complementary information, supporting the model's overall performance. The complementarity of VEs is highlighted by analyzing the cross-modal attention patterns for overlapping parts of the image ( § 5.3, Figure 4). VEs trained on different data and objectives (e.g. Grid and Region) cross-attend to the tokens of the respective other VE that encode the same parts of the image, aggregating their information. Further, the model learns to visually ground the text representations to all VEs, as demonstrated by their attention patterns ( § 5.4, Figure 5), and the text modality aggregates information from different VEs. In summary, our results indicate that VEs, trained on different data and objectives, encode complementary information, resulting in improvements over approaches which only utilize a single VE. This indicates that VEs, explicitly designed for V+L tasks-e.g. by incorporating more diverse training data and objectives during pre-traininghave the potential to significantly impact the performance on the target V+L tasks.

Conclusion
In this work, we investigated whether different VEs-based on repurposed pre-trained vision models-encode complementary information, which improve the performance on downstream V+L tasks. We experimented with three popular VE classes Region, Grid, and Patch, and trained models with all possible combinations on six different V+L tasks. By combining VEs we are able to consistently improve over single-VE setups. When further analyzing the attention patterns we found that diverse VEs encode complementary information, which motivates future work on designing VEs explicitly for V+L tasks-e.g. by incorporating more diverse datasets, and training objectives.

A Training and Hyperparameters
We report the hyperparameters along with the taskspecific training details.

A.1 Hyperparameters
We report our hyperparameters in Table 3,4. For each task, we select the learning rate in {2e − 5, 3e − 5, 5e − 5} with the best validation performance for the model trained with all three VEs. We train all VE combinations for one task with the same hyperparameters. We use the training checkpoint with the best validation performance (computed each epoch) for testing.

A.2 Task Details
We describe the training details for each task.
Flickr30k, MSCOCO: We use a cross-encoder following Li et al. (2020b); Zhang et al. (2021). The model either receives the caption and the paired image or a random image (each with 50% chance). The task is to predict if the caption and image match. We use cross entropy as loss as the training objective. During evaluation, we compute the logits for all possible image-caption pairs and use these scores as ranking to compute the recall at k. We evaluate MSCOCO on the 1k image test set.
GQA, VQA, SNLI-VE, Hateful Memes: We train all three tasks as standard classification task with cross entropy loss. For GQA, each class corresponds to a label appearing in the train, test, or validation set. We also use the balanced training data for GQA as it produces similar results to the much larger unbalanced training set with a fraction of the training time. For VQA, we follow Li et al. (2020b) and use the top-3000 labels for classification and we train the model with a multilabel objective using the relevance scores as soft probabilities. For testing, we use the maximum logit as the single predicted class.

B Full Analysis Results
We present the full results for all VE combinations from the analysis of §5. Figure 8 shows the CLS attention, Figure 9 the attention flow, Figure 10 the surplus attention for overlapping tokens, and Figure 11 the visual grounding.

C Analysis Results after VE-Dropout Training
We present the full results for all VE combinations from the analysis of §5 after VE-Dropout training. Figure 12 show the results for VE-Dropout at test time, Figure 14 the CLS attention, Figure 15 the attention flow, Figure 16 the surplus attention for overlapping tokens, and Figure 13 the visual grounding. [CLS]