Inflate and Shrink:Enriching and Reducing Interactions for Fast Text-Image Retrieval

By exploiting the cross-modal attention, cross-BERT methods have achieved state-of-the-art accuracy in cross-modal retrieval. Nevertheless, the heavy text-image interactions in the cross-BERT model are prohibitively slow for large-scale retrieval. Late-interaction methods trade off retrieval accuracy and efficiency by exploiting cross-modal interaction only in the late stage, attaining a satisfactory retrieval speed. In this work, we propose an inflating and shrinking approach to further boost the efficiency and accuracy of late-interaction methods. The inflating operation plugs several codes in the input of the encoder to exploit the text-image interactions more thoroughly for higher retrieval accuracy. Then the shrinking operation gradually reduces the text-image interactions through knowledge distilling for higher efficiency. Through an inflating operation followed by a shrinking operation, both efficiency and accuracy of a late-interaction model are boosted. Systematic experiments on public benchmarks demonstrate the effectiveness of our inflating and shrinking approach.


Introduction
Efficiency and accuracy are two key factors of a retrieval system. In many cases, designing a retrieval system is striving to balance efficiency and accuracy. Embedding-based methods (Ordonez et al., 2011;Gong et al., 2014;Faghri et al., 2018) are early works for tackling the cross-modal retrieval. They encode each image or text into a global embedding. Then the text-image similarity is measured by the distance between their embeddings in the learned feature space. Since there are no interactions between text and image, embedding-based methods only need O(N +M ) computational complexity to encode N images and M texts. The linear computational complexity of embedding-based methods makes them scalable to large-scale crossmodal retrieval. They hence have been widely deployed in real-world cross-modal retrieval tasks.
Recently, inspired by the great success of Transformer (Vaswani et al., 2017) and BERT (Devlin et al., 2019) in natural language processing, some methods (Li et al., , 2020aLi et al., 2020b;Fei et al., 2021) investigate cross-BERT model to exploit the cross-modal attention and devise several pre-training tasks. Benefiting from cross-modal attention and pre-training, they have achieved significantly higher retrieval accuracy than their embedding-based counterparts. Nevertheless, the heavy text-image interaction from utilizing cross-modal attention leads to an O(N M ) computational complexity in encoding when calculating the similarities between N images and M texts. The quadratic computational complexity of cross-BERT methods makes them not suitable for large-scale cross-modal retrieval applications.
Several methods attempt to gain satisfactory efficiency and maintain high accuracy through trading off efficiency and accuracy. These methods can be coarsely grouped into two categories: twostage methods (Sun et al., 2021;Geigle et al., 2021;Miech et al., 2021) and late-interaction methods (Lee et al., 2018;Khattab and Zaharia, 2020;. The two-stage methods apply a retrieve-and-rerank strategy. Given a query, in the first stage, they conduct a coarse-level retrieval through an embedding-based method to obtain an initial top-t list of potentially relevant items. Then in the second stage, the items in the top-t list are re-ranked through a powerful cross-BERT model. Since the heavy interactions are only deployed in the second stage, and t is smaller than the number of total items in the corpus, the efficiency is boosted. Nevertheless, to obtain a satisfying retrieval accuracy, t should be large enough. In consequence, two-stage methods cannot achieve high efficiency as embedding-based methods.
In parallel, the late-interaction methods improve the retrieval accuracy of the embedding-based methods through lightweight text-image interac- Figure 1: Text-to-Image Recall@1 on MSCOCO1K benchmark versus retrieval complexity. Our models are not pretrained on other multi-modal datasets.
tions. To be specific, SCAN (Lee et al., 2018) and VisualSparta  only conduct word-region interactions in the late stage when the word/region features have already been extracted by the image encoder and the text encoder. Therefore, the text-image interactions in late-interaction methods are cheap. Empirically, due to only using light-weight interactions, late-interaction methods normally attain faster inference but lower accuracy than their two-stage counterparts (Sun et al., 2021;Geigle et al., 2021;Miech et al., 2021). In fact, late-interaction methods can be deployed into the first stage of a two-stage method as the alternative to the embedding-based method to further improve the performance of the two-stage methods.
In this work, we propose an inflating and shrinking approach to enhance the effectiveness and efficiency of the existing late-interaction methods. We observe that the representing capability of the lateinteraction methods is limited by the text length and the region count. For instance, given a sentence of n words and an image with m regions, SCAN and VisualSparta only have access to nm times regionword interactions between n words and m regions. To thoroughly exploit region-word interactions, we propose an inflating operation. We plug additional k codes in the input of the text/image encoder besides the word/region features. It generates m + k image vectors in the output of the image encoder and n + k text vectors in that of the text encoder. The number of region-word interactions increases from mn to (n + k)(m + k). Nevertheless, incorporating additional codes inevitably brings more computational cost and makes the retrieval slower. To boost efficiency, we propose a shrinking operation based on distilling to reduce interactions.
Through inflating followed by shrinking, we obtain two models, the base model and the fast model.
Our base model obtains a considerably higher retrieval accuracy than VisualSparta. Meanwhile, our fast model achieves a comparable retrieval accuracy as VisualSparta but takes much less latency. We visualize the efficiency and accuracy comparisons with VisualSparta and cross-BERT model including Unicoder-VL (Li et al., 2020a) and 12-in-1 (Lu et al., 2020) in Figure1. Systematic experiments on public benchmarks, including MSCOCO1K and Flickr30K, demonstrate the effectiveness of the proposed inflating and shrinking approach.

Related Works
Embedding-based methods. Early embeddingbased methods (Ordonez et al., 2011;Gong et al., 2014) depend on Canonical Component Analysis (CCA) (Hardoon et al., 2004) to project texts and images into a joint feature space. With the progress of deep learning, the architecture of mainstream embedding-based methods has evolved into a dualencoder structure (Klein et al., 2015;Wang et al., 2016;Faghri et al., 2018;Dong et al., 2019;Wang et al., 2019a) consisting of an image encoder and a text encoder. In the retrieval phase, the text-image similarity is determined by the distance between the image embedding and the text embedding generated from image and text encoders, separately.
Attention-based methods. By paying attention to key visual cues in the image and key words in the text, the attention-based methods (Huang et al., 2017;Lee et al., 2018;Wei et al., 2020;Yu et al., , 2021b achieve considerably better performance than the embedding-based methods in cross-modal retrieval. sm-LSTM (Huang et al., 2017) takes a multi-modal context-modulated attention scheme to selectively attend to a pair of instances of image and sentence. SCAN (Lee et al., 2018) computes the similarities between regions and words, and only counts the region-word pairs of high relevance. VSRN ) adopts graph convolution to attend the region features based on the textual context. MMCA (Wei et al., 2020) incorporates self-attention, which is originally used in Transformer, to enhance the region and word features. Drawn inspiration from the success achieved by BERT through pre-training, several cross-BERT methods (Li et al., 2020a;Li et al., 2020b;Fei et al., 2021) are proposed for cross-modal retrieval. They stack a few Transformer blocks and devise several pre-training tasks for facilitating multi-modal understanding such as masked language modeling, masked region modeling and text-image matching. After being pre-trained on large-scale datasets, they have achieved state-of-the-art performance in cross-modal retrieval. Nevertheless, cross-BERT methods need quadratic computational complexity, and they are slow and not scalable.
Trade-off methods. To alleviate the heavy computational burden while maintaining the high retrieval accuracy, several trade-off methods are proposed. They can be coarsely grouped into two-stage methods and late-interaction methods. Two-stage methods (Geigle et al., 2021;Sun et al., 2021;Miech et al., 2021) take a retrieve and re-rank strategy. In the first stage, they adopt an embedding-based method to conduct coarse-level retrieval and obtain a top-t list. After that, in the second stage, they re-rank the top-t list using a cross-BERT method. Since the number of items for re-ranking, t, is smaller than the total number of reference items, two-stage methods achieve higher efficiency than cross-BERT methods. In parallel, the lateinteraction trade-off methods (Khattab and Zaharia, 2020; apply a light-weight interaction in the late stage after feature encoding, and also achieve higher efficiency than cross-BERT methods with heavy interaction in encoding.

Background
Embedding-based methods (Wang et al., 2016;Faghri et al., 2018) and cross-BERT methods (Li et al., 2020a;Li et al., 2020b) are two mainstream approaches for measuring the similarity between an image I and a text T . Embeddingbased methods encode each image as well as each text into a global embedding. The text-image similarity is determined by the similarity between their global embeddings. Consequently, given N images and M texts, the embedding-based methods only take O(N + M ) complexity in encoding.
In contrast, cross-BERT methods take each textimage pair as input. Making use of self-attention operation, cross-BERT methods achieve significantly higher retrieval accuracy than embeddingbased methods. But self-attention brings significantly more computational cost. To give an example, given N images and M texts, cross-BERT methods take O(N M ) complexity to encode N M text-pairs. Thus, cross-BERT methods are prohibitively slow in large-scale retrieval. To trade off the efficiency and accuracy, researchers proposed late-interaction methods (Khattab and Zaharia, 2020;. Similar to embeddingbased methods, they only need O(N + M ) complexity in encoding. But they exploit light-weight attention in the scoring phase based on extracted local embeddings. Taking advantage of attention, they obtain higher accuracy than embedding-based methods and they are efficient since the attention is lightweight. Below we introduce embedding-based methods and late-interaction methods in detail.

Embedding-based methods
To bridge the domain gap between texts and images, embedding-based methods map texts and images to the same feature space. They normally adopt a dualencoder structure (Geigle et al., 2021;Sun et al., 2021) to encode texts and images respectively.
Image encoder. Given an image I, following previous works (Lee et al., 2018;, each image is represented by a set of m image region features R = {r 1 , r 2 , · · · , r m }. They are extracted by a Faster R-CNN (Ren et al., 2015) object detector pre-trained on Visual Genome dataset (Krishna et al., 2017). The image region features R are the input ofa Transformer encoder. The attended region features,R, are the output of the Transformer encoder: (1) We termr i (i ∈ [1, m]) as an image fragment.
Text encoder. Following BERT (Devlin et al., 2019), each text T is converted into n words, which are further embedded into n word embeddings W = {w 1 , · · · , w n }. The attended word embed-dingsW are the output of the Transformer encoder: Scoring. To measure the text-image distance based on their attended word and region feature sets,W andR, common practices are taking the first token, i.e., [CLS], to summarizeW/R into a global embedding. The text-image similarity is determined by the cosine similarity between their embeddings: s(r 1 ,w 1 ) = cos(r 1 ,w 1 ). ( Since the matching is conducted in the global embedding level, we term it as global-level matching. in a mini-batch, the text T i is only relevant with the image I i and is irelevant with other images in the batch, I j (j = i). Triplet loss aims to make the similarity between the positive text-image pair (T i , I i ) larger than that between the negative pair (T i , I j ) where j = i by a margin m: where s(T i , I j ) is the similarity between T i and I j computed as Eq. (3). m is the margin predefined, and [x] + = max(x, 0) is a clip function. Some approaches (Faghri et al., 2018;Lee et al., 2018;Geigle et al., 2021) conduct hard negative mining to enhance the effectiveness of the triplet loss.

Late-interaction methods
Recent works (Lee et al., 2018;Khattab and Zaharia, 2020;Humeau et al., 2020; boost accuracy of embedding-based methods and maintain high efficiency through textimage interaction in the late stage. They utilize the same dual-encoder structure as embedding-based methods for encoding, but they utilize attention in the scoring phase. Given a text with fragments W from the output of the encoder and an image with fragmentsR, the text-image similarity is calculated by interactions between these two bags of fragments, denoted as s(W,R). Specifically, Visualsparta  implements s(W,R) by and cos(u, v) measures the cosine similarity between the vectors u and v. As illustrated in Eq. (5), every text fragment interacts with m image fragments through computing maximum cosine similarity, and the scores from n text segments are summed up to generate the text-image similarity.
To calculate the similarity between a text-image pair, it takes m × n fragment interactions in total.
Retrieval latency. The retrieval latency consists of the encoding latency for extracting the text/image fragmentsR/W and the scoring latency for computing s(W,R) in Eq. (5). In practice, in the textto-image retrieval application, the image fragmentsR have been extracted in the offline phase before the query comes. Given a query, the encoder only needs to extract the query's fragments, taking a constant computation complexity, O(1). In contrast, the scoring is conducted between the query's fragments and N images' fragments in the corpus, taking a linear computation complexity, O(N ). Thus, in the large-scale retrieval scenario, the inference speed is mainly determined by the scoring latency.

Method
Section 4.1 introduces our inflating operation to thoroughly exploit the fragment interactions for higher retrieval accuracy. In section 4.2, we present the proposed shrinking operation to reduce the fragment interactions for higher efficiency.

Inflating
Benefiting from fragment interactions, lateinteraction methods achieve higher retrieval accuracy than embedding-based methods. Nevertheless, as shown in Eq. (5), the scale of interactions of lateinteraction methods, mn, is limited by the number of word features of the sentence (m) and that of region features from the image (n).
To exploit more informative interactions, we devise a set of synthetic tokens C T = {c T 1 , · · · , c T k } as the additional input for the text encoder and a set of synthetic tokens C I = {c I 1 , · · · , c I k } as the additional input for the image encoder. The synthetic tokens are similar to the [CLS] token used in BERT. But a single [CLS] token has a limited representation capability, and the devised synthetic tokens C I and C T have a much more powerful capability by using multiple codes. In the implementation, C T and C I are parameters of the model, which are randomly initialized and updated by back-propagating the gradients in the training phase. In this case, the input of the image encoder is a concatenation of image region features and the inflating codes, In the same way, the input of the text encoder is [W; C T ]. Then they are encoded in the same manner as Eq. (1) and Eq. (2), respectively: = [w 1 , · · · ,w n ,c I 1 , · · · ,c I k ] ∈ R (n+k)×d . Then the text-image similarity is determined by fragment-level matching betweenW inflate and R inflate in the same manner as Eq. (5).  (Khattab and Zaharia, 2020;. The proposed inflating operation plugs several codes in the input of image/text encoders to thoroughly exploit fragment-level interactions. Then the shrinking operation distills the knowledge of the inflated model and generates the shrunken base model, which is further shrunken to a fast model.

.
Through plugging additional codes in the text encoder and the image encoder, our inflated model more thoroughly exploit the fragment interactions. More precisely, when computing s(W,R) in Eq. (5), only nm word-region interactions are available. In contrast, when computing s(W inflate ,R inflate ), (n + k)(m + k) interactions are conducted. Our experiments show that more fragment interactions through inflating boost retrieval accuracy. Nevertheless, more interactions brought by inflating inevitably increases computational cost and makes the retrieval slower.

Shrinking
Different from inflating which expands the scale of interactions to enhance effectiveness, shrinking aims to reduce the interactions to boost efficiency. The idea behind shrinking is knowledge distillation, which is originally developed for classification task (Hinton et al., 2015). By exploiting the contrastive learning (Gutmann and Hyvärinen, 2010), knowledge distillation can be naturally extended to the retrieval task. Assume that, in a batch of text-image pairs {(T i , I i )} B i=1 , T i is only relevant with I i and is irrelevant with the rest. Let s t ij denote the similarity between T i and I j from the teacher model and s s ij denote that from the student model. The distillation loss is devised as where τ t and τ s are pre-defined temperature factors controlling the softness.
Our shrinking is to distill the knowledge from the inflated model with intensive interactions to the student model with fewer interactions to boost the efficiency. To be exact, we conduct the shrinking in two steps. As shown in Figure 2, in the first step, we distill the text-image similarity s(W inflate ,R inflate ) from the teacher model to the text-image similarity s(W,R) of the first student model. We term the student model from the firststep shrinking as shrunken base model. In the second step, we distill s(W,R) from our shrunken base model to s(w 1 ,r 1 ) computed in the manner as Eq. (3) from the second student model. The second student model has degenerated to the embeddedbased method, and thus it only needs once global interaction and is extremely fast. We term the second student model as shrunken fast model.
Relation with existing distilling methods. Existing knowledge distilling methods (Jiao et al., 2020) normally distill the knowledge from a large-scale teacher model to a small-scale student model for faster inference. Nevertheless, the architecture gap between the student model and the teacher model will inevitably lead to considerable losses. In contrast, in the proposed shrinking operation, the encoder used in the teacher model adopts the same architecture as that in the student model used in our shrinking operation. The only difference between the teacher model and the student model lies in the Settings Codes Frag. Inter.

Flickr30K
MSCOCO1K T-to-I R@ I-to-T R@ T-to-I R@ I-to-T R@ scale of interactions in the scoring phase. Since there is no architecture gap between the encoder in the teacher model and that in the student model, the teacher can effectively transfer its knowledge to its student in our shrinking operation.

Experiments
Datasets. MSCOCO consists of 123, 287 images, and each image contains 5 ground-truth captions. We adopt Karpathy split (Karpathy and Li, 2015) with 113, 287 images for training and 1, 000 images for testing. Flickr30K contains 31, 783 images, and each one has 5 annotated textual descriptions. Following Karpathy and Li (2015), we use 1000 images for testing.

Settings.
We conduct experiments on an NVIDIA V100 GPU with float16 operations. The input sequence length is set as 32 for the text and image encoders. The weights of text and image encoders are shared. For each image, we detect 100 bounding boxes using Faster-RCNN pre-trained on Visual Genome (Krishna et al., 2017) by Anderson et al. (2018). We cluster 100 bounding boxes into 32 clusters, and 32 cluster centers are the input of the image encoder. We train all models using the same batch size as that in the baseline experiments. For training, to save the memory and computation cost, we apply a tiny BERT model with only 3 layers of Transformer blocks. This is due to that our experiments show that 3-layer model achieves a comparable performance with 12-layer model. We train the inflated model using the triplet loss in Eq. (4).
We set the margin m as 0.2 for fast model and 0.6 for base and inflated models. In Eq. (6), the temperature τ t and τ s are both set as 12 for shrinking to a base model and set as 2 and 12 for shrinking to a fast model. The implementation was based on the PaddlePaddle deep learning framework.

Inflating
In the upper part of Table 1, we show the performance of the base model and the fast model without inflating or shrinking. The fast model is the embedding-based method relying on the globallevel matching to compute the text-image similarity as Eq.
(3). The base model exploits the fragmentlevel interactions and obtains the similarity based on Eq. (5). It is straightforward to observe from Table 1 that, the base model exploiting fragment-level interactions attains higher retrieval accuracy than the fast model with only global-level matching.
In the lower part of Table 1, we show the performance improvement by inflating the base model by a various number of codes. The codes we plug in the input of encoders will enrich the fragment interactions in the scoring, which is beneficial to retrieval. We vary the number of codes, k, from 16 to 96, leading to 48 2 to 128 2 times interactions. As shown in Table 1 that, the inflated model with more fragment interactions outperforms the base model. Generally, more codes plugged in the input of the encoder tend to yield larger improvement.

Shrinking
We first evaluate the straightforward shrinking without inflating. The base model (B) without inflating is the teacher for distilling, and the fast model (F) is the student. The experiments are shown in the first part of Table 2. As shown in the table, after shrinking (B→ F), the fast model obtains significantly higher retrieval accuracy than the fast model without shrinking. Meanwhile, after shrinking, the student model achieves a comparable accuracy as the teacher model (B).
Then we evaluate the proposed method, inflating followed by shrinking. We evaluate the performance on two different settings: (i) inflating the Settings Codes Frag. Inter.

Flickr30K
MSCOCO1K T-to-I R@ I-to-T R@ T-to-I R@ I-to-T R@   Table 2 presents the results of these two settings. First, the fast model from an inflated teacher (I →F) gains better performance than that from a base model teacher (B→F). For instance, the Recall@1 of text-to-image retrieval gets improved from 58.2 to 60.9 on Flickr30K. Second, the multi-step shrinking (I→B→F) further boosts the fast model to a higher recall@1, 62.5. And the intermediate base model (I→B) also benefits from the inflated model, which gains a 64.5 recall@1.

Efficiency
We evaluate the text-to-image retrieval latency. For embedding-based methods and late-interaction methods, the image features for retrieval have been encoded before the text query comes. Hereafter, in the retrieval phase, the whole latency consists of the encoding latency only for the query text and the scoring latency to compute the similarity between the query and all images for retrieval.  et al., 2020a). Since our fast, base and inflated model only need to encode the query text, its encoding time is invariant to the number of candidate items. In contrast, the cross-BERT method taking text-image pairs as input, which needs to encode all images (candidates) in the retrieval phase. Accordingly, as shown in Table 3, the encoding latency of our fast, base and inflated model are much less than that of the cross-BERT method, Unicoder-VL.   Table 4 shows the scoring latency. In theory, the scoring latency is in linear with the number of fragment interactions. As shown in the table, our fast model takes only 0.4ms latency in the scoring, which is much less than that of our base model, inflated model and VisualSparta . It demonstrates the significant efficiency boost brought by shrinking. Later, we will show that our fast model attains a comparable crossmodal retrieval accuracy as VisualSparta.

Comparisons with existing methods
We compare with existing methods in Table 5, which are grouped into three categories. The first category of methods, cross-BERT methods, achieve high retrieval accuracy through pretraining. But they are prohibitively slow due to quadratic complexity. Compared with them, our shrunken base model can surpass some of them, such as ViLT (Kim et al., 2021), ViLBERT (Lu Method Pre.

MSCOCO1K
Flickr30K T-to-I R@  I-to-T R@  T-to-I R@  I-to-T R@  1  5 10 1  5  10  1  5  10 1  5 10 Cross-BERT ViLBERT  √ ------58.2 84.9 91.5 ---Uni-VL (Li et al., 2020a) 63  et al., 2019), 12-in-1 (Lu et al., 2020). Note that, ViLT (Kim et al., 2021), ViLBERT  and 12-in-1 (Lu et al., 2020) are pre-trained on large-scale multimodal datasets, whereas ours is not pre-trained on these datasets. Pre-training might further improve our model, but our limited computing resources cannot afford the huge computational cost of pre-training on huge-scale datasets. We further compare with the second category of methods, two-stage methods using an embeddingbased method in the first stage and a cross-BERT method in the second stage. As mentioned, they are more efficient than cross-BERT methods. But to maintain a high accuracy, they have to re-rank a number of candidates and thus are still slow.
At last, we compare the third category of methods including embedding-based methods and lateinteraction methods. Due to a lack of text-image interactions, embedding-based methods cannot achieve competitive accuracy. In contrast, lateinteraction methods such as SCAN (Lee et al., 2018) and VisualSparta  achieve a good trade-off between accuracy and efficiency. Our base and fast models also fall into this category. Compared with the existing late-interaction methods, both our base model and fast model from inflating and shrinking achieve considerably better trade-off between efficiency and accuracy. To be specific, our base model obtains higher accuracy than VisualSparta with the comparable scale of computation cost. Meanwhile, our fast model obtains a comparable retrieval accuracy as Visu-alSparta but takes much less latency. Figure 3 visualizes the comparisons among models without inflating and shrinking, models with only inflating, models with only shrinking, and the models with inflating and shrinking. The xaxis represents the retrieval complexity determined by the number of fragments in the scoring phase. The curve with gray circles in Figure 3 denotes the performance of baseline models with a various number of interactions. In the implementation, we take the first l fragments from the image encoder and the first l fragments from the text encoder into consideration when computing the textimage score. We vary l among {1, 2, 4, 8, 16, 32} and it takes {1 2 , 2 2 , 4 2 , 8 2 , 16 2 , 32 2 } text-image interactions. When l = 1, it is equivalent to our fast model. When l = 32, it is equivalent to our base model. In parallel, the curve with green squares in Figure 3 shows the performance of models shrunken from the base model. The curve with blue triangles in Figure 3 demonstrates the performance of models through inflating the base model by {16, 32, 64, 96} codes. Moreover, the curve with red stars shows that of our base and fast models from inflating and shrinking. Comparing the orange curve with the blue curve, it is straightforward to infer that inflating effectively boosts the performance by enhancing the interactions. Besides, comparing the green curve with the blue curve, we observe that a shrunken model achieves considerably higher accuracy than its counterpart with the same complexity. At last, as shown in Figure 3, the fast and base models from inflating and shrinking achieve the best trade-off between efficiency and accuracy.

Visualization of codes in inflating
In Figure 4, we use BertViz (Vig, 2019) to visualize the attention weights of a transformer layer from an inflated model. Each connection represents the relevance of the two tokens on the two sides and the brightness of this connection represents the attention strength. For an inflated model, the first 32 tokens are text words and the other 32 tokens are plugged codes when inflating. The figure shows that the codes pay nontrivial attention to text words.

Conclusion
In this paper, we propose an inflating and shrinking approach to boost the accuracy and efficiency of cross-modal retrieval. The inflating operation plugs multiple codes in the input of the image encoder and the text encoder. It enriches the textimage interactions and improves the retrieval accuracy. The shrinking operation gradually reduces the text-image interactions through knowledge distilling to improve the retrieval speed. Systematic experiments on two widely-used public benchmarks demonstrate the effectiveness and efficiency of the proposed inflating and shrinking approach.

A Appendix
The influence of fragment interactions. To obtain the text-image similarity, our base model exploits 32 2 fragment interactions, and the fast model only uses 1 fragment interaction. We evaluate the performance of the intermediate models with {1 2 , 2 2 , 4 2 , 8 2 , 16 2 , 32 2 } fragment interactions. In the implementation, we use the first l fragments from the output of the encoder when computing the text-image score and vary l among {1, 2, 4, 8, 16, 32}. When l = 1, it is equivalent to our fast model. When l = 32, it is equivalent to our base model. The results of the models with different l are presented in Table 6. Note that we do not use inflating and shrinking in these models.
MSCOCO1K T-to-I R@ I-to-T R@  We can observe from Table 6 that more fragment interactions yield better performance. In Table 1 of our main manuscript, the inflating operation plugs 32 codes in the input of the encoder, which further increases the number of fragment interactions to 128 × 128. It yields a higher retrieval accuracy.
Alternative settings for the student in shrinking. We have discussed the default settings for the shrinking operation, "B → F", in Table 2 of the main manuscript. Additionally, we conduct experiments of shrinking a base model to several intermediate models. As shown in Table 7, the shrunken model with more interactions gets better performance after knowledge distilling. Compared with models in Table 6 without inflating and shrinking, the shrunken models get considerably improved.
Alternative settings for the teacher in shrinking. Note that, in our shrinking operation, both teacher and student adopt the dual-encoder structure for encoding. An alternative choice is using the cross-BERT as the teacher. We compare with the alternative choice in Table 8. Inter.
MSCOCO1K T-to-I R@ I-to-T R@   As shown in the table, distilling from a cross-BERT teacher to a fast model does not achieve a competitive retrieval accuracy. This is due to the large gap between the architecture of the teacher and that of the student.
Alternative pipelines for inflating and shrinking. Since we can construct intermediate models with different number of fragments and codes, there are a number of choices for inflating and shrinking. We have investigated the effectiveness the "I → B" and "I → B → F" pipelines in Table 2 of the main manuscript. In this section, we explore several other options for multi-step inflating and shrinking. We mainly change three key factors: number of additional codes, number of fragments, number of shrinking steps. We conduct experiments following three main strategies: I shrinks interactions and then removes codes; II removes codes and then shrink interactions; and III remove codes and shrink interactions at the same time. Table 9 shows the results. As shown in the table, II performs marginally better than others. Finally, 4 is selected for our final strategy, which obtains an excellent performance in a simple manner.