Distilled Dual-Encoder Model for Vision-Language Understanding

On vision-language understanding (VLU) tasks, fusion-encoder vision-language models achieve superior results but sacrifice efficiency because of the simultaneous encoding of images and text. On the contrary, the dual encoder model that separately encodes images and text has the advantage in efficiency, while failing on VLU tasks due to the lack of deep cross-modal interactions. To get the best of both worlds, we propose DiDE, a framework that distills the knowledge of the fusion-encoder teacher model into the dual-encoder student model. Since the cross-modal interaction is the key to the superior performance of teacher model but is absent in the student model, we encourage the student not only to mimic the predictions of teacher, but also to calculate the cross-modal attention distributions and align with the teacher. Experimental results demonstrate that DiDE is competitive with the fusion-encoder teacher model in performance (only a 1% drop) while enjoying 4 times faster inference. Further analyses reveal that the proposed cross-modal attention distillation is crucial to the success of our framework.


Introduction
Vision-Language (VL) pretrained models Lu et al., 2019;Tan and Bansal, 2019;Su et al., 2020;Zhang et al., 2021;Radford et al., 2021; learn crossmodal representations on large-scale image-text pairs and can be directly fine-tuned to adapt to various downstream VL tasks, such as visionlanguage understanding/classification (visual reasoning (Suhr et al., 2019), visual question answering (Goyal et al., 2017), etc.) and image-text retrieval (Young et al., 2014). Based on the modeling * Contribution during internship at Microsoft Research. method of the cross-modal interactions, these models can be divided into two categories.
The first category is fusion-encoder models (Lu et al., 2019;, which employ effective but less efficient Transformer (Vaswani et al., 2017) encoder to capture image and text interactions with cross-modal attention. Most models Lu et al., 2019;Zhang et al., 2021) of this category rely on an off-the-shelf object detector to extract image region features, which impedes their efficiency further. Recently, ViLT  drops the detector and directly encodes image patches like text tokens with a Vision Transformer (Dosovitskiy et al., 2021). It achieves competitive performance on VL understanding and retrieval tasks while improving efficiency. However, the Transformer-based cross-modal interaction remains an efficiency bottleneck due to the need to encode images and text simultaneously, limiting its application in tasks with massive image or text candidates.
The second category of works, including CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021), employ a dual-encoder architecture to encode images and text separately. The cross-modal interactions are modeled via a shallow fusion module, usually a multi-layer perceptron (MLP) network or dot product, which is extremely light compared to the Transformer encoder in fusion-encoder models. In addition, the disentangled encoding enables off-line computing and caching of image and text candidates, which scales to massive candidates well. These changes reduce to a much faster inference speed in both understanding and retrieval tasks, making models practical in real-life scenarios. Dual-encoder models achieve promising performance on image-text retrieval tasks. However, they fall far behind fusion-encoder models on vision-language understanding tasks, e.g., NLVR2, that require complex cross-modal reasoning.
In this work, we propose a cross-modal attention distillation framework to train a dual-encoder vision-language model. The distilled dual-encoder model achieves competitive performance for visionlanguage understanding tasks with a much faster inference speed than fusion-encoder models. In addition to the soft label distillation (Hinton et al., 2015), we introduce the cross-modal attention distillation as the fine-grained supervision for the dualencoder model (student) to better learn cross-modal reasoning. Specifically, we employ both image-totext and text-to-image attention distributions from the fusion-encoder model (teacher) for distillation. Our distillation framework can be applied to both pre-training and fine-tuning stages. During pretraining, we apply distillation objectives in imagetext contrastive learning and image-text matching tasks. In the fine-tuning stage, the task-specific knowledge of the fine-tuned teacher model is transferred to the student model.
We evaluate our models on vision-language understanding tasks and image-text retrieval tasks. Experimental results demonstrate that our distilled dual-encoder models perform competitively in visual entailment (99.9%), visual reasoning (97.8%) and visual question answering (95.5%) while having a more than 3× times faster inference speed compared to the fusion-encoder teacher model. Moreover, our proposed cross-modal attention distillation also improves the performance on retrieval tasks and even outperforms the teacher model on image retrieval. Compared to other latent features, the cross-modal attention helps the dual-encoder model learn better cross-modal reasoning ability, achieving significant gains across VL understanding tasks. Besides, the model delivered from the two-stage distillation has better performance than the single-stage distilled one.
The first line of works Lu et al., 2019;Tan and Bansal, 2019;Zhang et al., 2021;) uses a fusion-encoder to learn cross-modal interactions. These models first encode image-text pairs into vectors and then use a multi-layer Transformer (Vaswani et al., 2017) network to fuse the visual and textual representations. Most previous models extract visual features via an object detector (e.g., Faster R- CNN (Ren et al., 2015)), which needs to pretrain on expensive annotated datasets with a fixed set of object classes like Visual Genome (Krishna et al., 2017). In addition, the object detector requires high-resolution input images and brings more computation costs. Recently,  and Xu et al. (2021) directly take image pixels as inputs and feed them into convolutional neural networks to obtain the visual grid features instead of previous region features.  employs a Vision Transformer (Dosovitskiy et al., 2021) to extract image features for the multimodal fusion encoder. ViLT  directly encodes image patches by a simple embedding layer. Then a multimodal Transformer jointly encodes the visual and textual embeddings. It achieves competitive performance on VL tasks with less overhead. The fusionencoder models exhibit strong cross-modality modeling ability and achieve superior results on VL understanding tasks that require complex cross-modal reasoning, such as NLVR2 (Suhr et al., 2019). However, the fusion-encoder models still rely on a cross-modal Transformer to encode and fuse visual and textual representations simultaneously across layers, demanding a heavy computation budget and leading to a low inference speed.
Another line (Radford et al., 2021;Jia et al., 2021;Sun et al., 2021) employs a dual-encoder architecture to encode images and text separately and take dot product or an MLP network to model the interactions between images and text. Dualencoder models have the advantage of computational efficiency compared to the fusion-encoder models. The multi-head attention mechanism is only applied within tokens of the same modality, and reduces the complexity from O((N + M ) 2 ) of fusion-encoder models to O(N 2 + M 2 ), where N and M is the length of visual and textual features, respectively. Moreover, the visual or textual representations can be pre-computed and cached in real-life applications thanks to the independent encoders. Dual-encoder models achieve promising performance on image-text retrieval. However, the shallow interaction module is insufficient to handle complex VL understanding tasks that require deeper cross-modal interactions, causing significant performance degradation. In order to improve the dual-encoder models for complex VL understanding tasks, we introduce a cross-modal attention distillation framework to help the model learn deeper interactions.

Knowledge Distillation
Knowledge distillation (KD) aims to transfer the knowledge learned in a strong teacher model to a student model, making the student perform competitively. Hinton et al. (2015) employ the soft label distributions from the teacher model to train the student model. Recently, the student models can be further improved by mimicking the intermediate representations of the teacher, such as hidden states (Romero et al., 2015) and attention distributions (Zagoruyko and Komodakis, 2017). Knowledge distillation has also been widely used to compress and improve Transformer-based models across various subjects (Sun et al., 2019;Jiao et al., 2020;Wang et al., 2020aWang et al., ,b, 2021bTouvron et al., 2021). In this work, we employ the cross-modal attention knowledge of a fusion-encoder teacher model to guide the training of the dual-encoder model. Our distillation framework improves the dual-encoder model on complex VL understanding tasks.

Method
In this section, we describe our cross-modal attention distillation framework for training the dualencoder model. Figure 1 gives an overview of our method. We adopt a fusion-encoder model as the teacher and introduce the cross-modal attention knowledge and soft labels to train the dual-encoder student model. The distillation objective is applied for both pre-training and fine-tuning stages and helps the dual-encoder model learn interactions of different modalities.

Model Overview
Our distillation framework can employ different fusion-encoder models as the teacher. In this work, we adopt ViLT ) as the teacher model to conduct experiments given its simplicity and efficiency.
Input Representations Given an image-text pair (v, t) as the input, we slice the image v ∈ R H×W ×C into patches v p ∈ R N ×(P 2 C) , where N = HW /P 2 is the number of patches, (H, W ) is the input image resolution, (P, P ) is the resolution of each patch, and C is the number of channels. The input text t is tokenized into a sequence of M subword tokens by WordPiece (Wu et al., 2016) as in BERT (Devlin et al., 2019). We then prepend the special tokens [I_CLS] and [T_CLS] to the sequence of image patches and text subword tokens, respectively.
We linearly project image patches v p to obtain patch embeddings, and the final visual input embeddings The textual input embeddings H t 0 ∈ R (M +1)×D is obtained via summing the word embeddings, textual position embedding and textual type embedding: We take H v 0 , H t 0 as the visual and textual inputs for the teacher and student models.
, and the vectors are then fed into an L-layer cross-modal Transformer encoder to obtain contextual representations: where l ∈ [1, L]. The cross-modal Transformer encoder fuses representations of different modalities via the multi-head attention mechanism. Specifically, for each head a, a ∈ [1, A h ] of layer l, the attention distribution A vl a is computed via: A man performing a stunt .  where queries Q vl a and keys K vl a are obtained by linearly projecting last layer's hidden states using parameters W Q l,a , W K l,a ∈ R D×d k , respectively. d k is the attention head size. The output vectors of [I_CLS] and [T_CLS] tokens of the last layer are fed into task-specific layers to obtain predictions.

Student: Dual-Encoder Model
The dualencoder model encodes visual embeddings (H v 0 ) and textual embeddings (H t 0 ) separately via visual and textual Transformer-based encoders: The output vectors of [I_CLS] and [T_CLS] tokens of the last layer are used as the final representations of images and text. We adopt a shallow module f to fuse the two representations. For vision-language understanding tasks such as VQA, module f is an MLP network. For image-text retrieval, we use the dot product function to obtain similarity scores of image-text pairs.

Distillation Objectives
Cross-Modal Attention Distillation In order to improve dual-encoder models for capturing deeper interactions of images and text, we employ the cross-modal attention knowledge of fusion-encoder models to guide the training of dual-encoder models. Specifically, we use image-to-text and textto-image attention distributions to train the dualencoder model.
The fusion-encoder teacher model captures cross-modal interactions via the multi-head attention mechanism as shown in Equation 4. The whole attention distribution A vl T ∈ R (N +M )×(N +M ) can be split into two parts. We use N and M to indicate the length of image and text inputs. The first part is uni-modal attention (A v2v T ∈ R N ×N , A t2t T ∈ R M ×M ), which models interactions within tokens of the same modality. The second part is crossmodal attention, including image-to-text attention distribution (A v2t T ∈ R N ×M ) and text-to-image attention distribution (A t2v T ∈ R M ×N ). Cross-modal attention distributions capture interactions of visual and textual feature vectors. Since the separate encoding of dual encoders only models interactions of tokens of the same modality, we introduce cross-modal attention distillation to encourage the dual-encoder models to mimic the image and text alignments of fusion-encoder models. The crossmodal (image-to-text and text-to-image) attention distributions of dual-encoder models A v2t S , A t2v S are computed as following: where Q v S , K v S are visual queries and keys of selfattention module. Q t S , K t S are queries and keys of textual input. We recompute the cross-modal attention distributions of the teacher A v2t T , A t2v T in the same way, instead of directly splitting the orig- inal attention distributions A vl T . The cross-modal attention distillation loss is computed via: where D KL is the Kullback-Leibler divergence. Inspired by Wang et al. (2020b), we only transfer the cross-modal attention knowledge of the last layer of the teacher model.

Soft Label Distillation
In addition to mimicking the cross-modal attention distributions, we also use predictions from the teacher model as soft labels to improve the student. The soft label loss is computed as: where z S , z T are the predicted logits of student and teacher, respectively.

Two-Stage Distillation Framework
We use the proposed knowledge distillation objectives to train the dual-encoder student model under a two-stage framework, including pre-training distillation and fine-tuning distillation. In both stages, the fusion-encoder model helps the dualencoder model to learn cross-modality interactions. As shown in Table 1, we train the model with different objectives according to the characteristics of tasks.

Pre-Training Distillation
During pre-training, the dual-encoder student model is trained on large-scale image-text pairs to learn generic cross-modal representations with image-text matching, image-text contrastive, and masked language modeling tasks. The pretrained fusion-encoder model ViLT  is used as the teacher model.

Image-Text Matching (ITM)
The goal of image-text matching is to predict whether the input image and text are matched. Following ViLT , we replace the matched image with the probability of 0.5 to construct negative pairs. We employ the cross-modal attention distillation loss over the ITM input pairs and the soft label loss to train the dual-encoder model.

Image-Text Contrastive Learning (ITC)
We introduce a contrastive loss with in-batch negative sampling to optimize the shared space of visual and textual representations. Given a batch of N image-text pairs, we can obtain N matched pairs and N 2 − N negative pairs. Image-text contrastive learning aims to predict matched pairs from all possible pairs. The fusion-encoder model requires joint encoding of each pair to obtain soft labels, which results in quadratic time complexity. Therefore, we apply the cross-modal attention distillation with ground truth labels for the task for training efficiency. Specifically, we only consider the crossmodal attention distributions computed on the N matched pairs.

Masked Language Modeling (MLM)
The goal of masked language modeling is to recover the masked tokens from all the other unmasked tokens.
We use 15% masking probability as in BERT (Devlin et al., 2019). In order to improve the training speed, we use ground truth labels to train the model for MLM task.

Fine-Tuning Distillation
During fine-tuning, we use the fine-tuned ViLT as the teacher model, and perform cross-modal attention distillation on downstream task data.
Vision-Language Understanding For visionlanguage understanding tasks such as visual reasoning and VQA, we fine-tune the student model with cross-modal attention distillation and soft labels losses.
Image-Text Retrieval For retrieval tasks, we train the student with the supervision of the crossmodal attention distributions from the teacher model and ground truth labels for efficient training.

Datasets
Following previous work , we use four datasets during pre-  training: COCO (Lin et al., 2014), Conceptual Captions (Sharma et al., 2018), SBU Captions (Ordonez et al., 2011) and Visual Genome (Krishna et al., 2017). We evaluate our dual-encoder model on three vision-language understanding/classification datasets and one image-text retrieval dataset. Table 2 shows the statistics of the four datasets.
Visual Reasoning The NLVR2 (Suhr et al., 2019) dataset is a visual reasoning task, which aims to determine whether a textual statement describes a pair of images. Following previous work , we construct two imagetext pairs as inputs, each consisting of one image and the textual statement. The final representations of the two pairs are fed into the classifier layer to obtain the prediction.
Visual Entailment The SNLI-VE (Xie et al., 2019) dataset aims to predict the relationship between an image and a text description. We treat SNLI-VE as a three-way classification task as previous work .

Visual Question Answering
The task requires the model to answer a question based on an image. We evaluate on the widely used VQAv2 (Goyal et al., 2017) dataset. Following Anderson et al. (2018), we formulate the problem as a classification task with 3,129 answer candidates.
Image-Text Retrieval The task consists of two sub-tasks: image retrieval and text retrieval. We evaluate on the Flickr30K (Plummer et al., 2015) dataset and follow the split as in Karpathy and Fei-Fei (2015).

Implementation Details
The Transformer architecture of our dual-encoder model is the same as ViLT . The visual and textual Transformers both consist of 12layers blocks with 768 hidden size, and 12 attention heads. The intermediate size of feed-forward networks is 3072. Following , images are resized to 384 × 640 resolution and the patch size is 32 × 32. The maximum length of text sequence is set to 40.
For pre-training, we train the model for 200K steps with a batch size of 1024. We use the pretrained weights of ViLT  to initialize the visual and textual encoders of the dualencoder model. During fine-tuning, we train the model for 10 epochs with a batch size of 256 for VQA and SNLI-VE. For NLVR2, we train the model for 20 epochs with a batch size of 128. For Flickr30k, the model is trained for 20 epochs with a batch size of 512. We apply RandAugment (Cubuk et al., 2020) without color inversion and cutout. For both two stages, we use Adam (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.999 for optimization. The learning rate is set to 1e-4, with the warmup ration of 0.1, and linear decay. The weight decay is set to 0.01.

Vision-Language Understanding Results
We evaluate models on vision-language understanding tasks, including NLVR2, SNLI-VE and VQA. Table 3 presents the fine-tuning results on three tasks. Compared with previous dual-encoder models like CLIP (Radford et al., 2021), our model achieves much better performance across three vision-language understanding tasks, improving the averaged score from 57.83 to 73.85. Moreover, our dual-encoder model also achieves competitive performance compared with fusion-encoder models. The model retains 99.9% accuracy on SNLI-VE, 97.8% accuracy on NLVR2 and 95.5% performance on VQA with more than 3 times faster than the teacher model (ViLT). Our model even outperforms PixelBERT-R50  on NLVR2 task. Using the dual-encoder architecture requires less computation than fusion-encoder models and achieves faster inference speed. In addition, performing separate encoding enables precomputation and cache of images or text representations, which is more effective for masses of images and text. Table 3 also shows the ablation results of our method. Performing distillation on pre-training and fine-tuning stages both positively contribute to our dual-encoder model. Using cross-modal attention distillation during finetuning brings significant improvements compared with direct fine-tuning of the dual-encoder model initialized by ViLT. Introducing pre-training distil-  Table 3: Results on vision-language understanding tasks. "Std" denotes training with original ground truth labels. "KD" denotes the models trained using our distillation objectives. We report accuracy for NLVR2 development and public test set (test-P), SNLI-VE validation and test split. We report vqa-score on VQA test-dev split. † indicates our fine-tuned results of CLIP (Radford et al., 2021). The results are averaged over 3 runs for each task. The inference speed is evaluated on NLVR2 dataset. We evaluate our model and ViLT on a single P100 GPU with the same hyper-parameters. The inference speedup of other models are taken from .   lation further improves the model.

Image-Text Retrieval Results
In addition to vision-language understanding tasks, we also evaluate our method on the image-text retrieval task. Our dual-encoder student model is trained with crossmodal attention distillation and contrastive losses.  Table 6: Effects of using different distilled knowledge. "Attn" is short for attention distributions. "Whole Attn" is the combination of "Uni-modal Attn" and "Crossmodal Attn". The results are an average of 3 runs for each task.

Discussion
Effects of Different Distilled Knowledge We investigate the effects of different knowledge used in distillation. We conduct experiments on visionlanguage understanding tasks with different distillation losses during fine-tuning. The dual-encoder student models are directly initialized by ViLT. Table 6 illustrates the results across tasks. First, we find that using soft label distillation achieves better performance than ground truth labels. However, the model trained using soft labels still gives a relatively low accuracy on NLVR2 task. We further incorporate intermediate representations of the fusion-encoder model to improve the performance of dual-encoder models. We compare using hidden states and different attention distributions. Using attention distributions brings more improvements than hidden states across three tasks. We further explore which part of attention distributions is more critical, including Cross-Modal Attention and Uni-Modal Attention. As shown in Table 6, mimicking the teacher's cross-modal attention distributions achieves more improvements than the uni-modal part, which validates that the cross-modal interactions are more crucial for vision-language understanding tasks. We also find that only using cross-modal attention distributions performs better than using the whole attention distributions (crossmodal + uni-modal).
Effects of Different Layer Mapping Strategies for Distillation Inspired by Wang et al. (2020b), we perform the proposed knowledge distillation method on the last layer of the teacher and student. To validate the effectiveness of only distilling on the last layer, we compare it with the layerwise strategy. The results are shown in Table 7. Last layer distillation strategy obtains better performance on NLVR2 and SNLI-VE tasks. In addition, only using the attention knowledge of last layer requires less computation. Thus, only using last layer is a more practical way to perform our cross-modal attention distillation.

Conclusion
In this work, we introduce a cross-modal attention distillation framework to improve the performance of dual-encoder models on vision-language understanding tasks. We employ cross-modal attention knowledge of a fusion-encoder model, including image-to-text and text-to-image attention distributions, to guide the training of the dual-encoder model. Experimental results show that the distilled dual-encoder model achieves competitive performance on NLVR2, SNLI-VE and VQA, while having a much faster inference speed than fusionencoder models.