Hashing based Efficient Inference for Image-Text Matching

Image-text matching has been a popular research topic which bridges vision and language through semantic understanding. Recent works mainly focus on exploring the interactions between images and sentences to improve the performance without considering inference efﬁciency. Speciﬁcally, for the large scale databases, it is unacceptable to perform such time-consuming mechanisms between a query (text/image) and each candidate data-point (image/text) in the whole retrieval set during inference. To tackle this problem, we propose a novel hashing based efﬁcient inference module called HEI, which can be plugged into the existing framework to speed up the inference step without reducing the retrieval performance. In details, HEI learns to map the original datapoints into short binary hash codes and coarsely preserve the heterologous matching relationship. Thus, in the inference phase, the proposed HEI module uses the hash codes to quickly select a few candidate datapoints from the retrieval set for a given query. Then, the image-text matching model ﬁne ranks the candidate set to ﬁnd the matching datapoint. Extensive experiments on two widely used benchmark MS-COCO and Flickr30k with four baseline methods demonstrate the efﬁciency and effectiveness of our proposed HEI module.


Introduction
Language and vision understanding plays a fundamental role for human to understand the real world. A large amount of works has been proposed to bridge these two modalities. Image-text matching is one of the fundamental topics in this field, which benefits a series of downstream applications, such as visual captioning Wang et al., 2018) and visual grounding (Chen et al., * Corresponding author. 2018;Plummer et al., 2017). Specifically, given an image (text), its target is to match the closest textual description (image) for the image (text).
Early works (Karpathy and Fei-Fei, 2015;Wang et al., 2016;Niu et al., 2017;Faghri et al., 2017) achieve this goal by learning two modality-specific deep neural networks to directly map all the datapoints from the two modality into a common joint space without using attention mechanism, and then measures their similarities by feature distances in the joint space. Compared with these methods, recent works (Lee et al., 2018;Chen et al., 2020) mainly focus on incorporating variant attention mechanisms into the image-text matching models to explore the fine-grained interactions between vision and language. By using the attention mechanisms, the image-text matching models are able to filter out irrelevant information, and find the fine-grained cues to achieve a great matching performance. For example, CAMP  takes comprehensive and fine-grained cross-modal interactions into account, and also properly handles negative pairs and irrelevant information with an adaptive gating scheme to improve the matching performance.
Although existing attention mechanism based methods achieve great performance, they do not take the inference efficiency into account. Specifically, for the large scale databases, due to the attention mechanisms being time-consuming, it is unacceptable to perform such complex attention mechanisms between the query (text/image) and each candidate datapoint (image/text) in the whole retrieval set during inference. Thus, it is critical to improve the inference speed of these methods.
Intuitively, if a small candidate set containing positive datapoints can be quickly selected out, the image-text matching models can greatly speed up through only fine ranking such a small candidate set instead of the whole retrieval set. Then the key challenge is how to quickly select such a small candidate set. Hashing is widely used in the field of data search with fast retrieval speed. Besides, although it can hardly perform the accurate matching, hashing is capable of quickly selecting a high quality candidate set containing the positive datapoints.
Hence, in this paper, we propose a novel hashing based efficient inference module, called HEI, which can be plugged into the existing attention mechanism based image-text matching framework to speed up the inference step without reducing the retrieval performance. Specifically, a matching score based hashing loss is proposed, which consists of two items: one is used to make Hamming similarity between hash codes of matching datapair be as large as possible; the other item is to make the Hamming similarity between hash codes of mismatching datapair no larger than their corresponding matching score. By minimizing the proposed hashing loss, the HEI module is optimized to map the original datapoints into short binary hash codes and coarsely preserve the heterologous matching relationship between datapoints. Thus, the trained HEI module can be used to speed up the inference step without reducing the retrieval performance. Extensive experiments on two widely used benchmark MS-COCO and Flickr30k with four baselines demonstrate the effectiveness of our proposed HEI module.

Text-image Matching
Recently, many image-text matching methods have been proposed, which can be roughly grouped into one-to-one matching methods learning correspondence between the whole image and text, and many-to-many matching methods learning correspondence between the regions of image and the words of text.
The one-to-one matching methods (Wang et al., 2016;Kiros et al., 2014;Zhang and Lu, 2018;Zheng et al., 2020) mainly aim to explore the heterologous relationship globally by mapping the whole images and the full texts into a common feature space. However, owing to these methods doing not explore the correspondence between image regions and text words, it might lead to learn sub-optimal features, which will damage the textimage matching performance. By utilizing variant cross-modal attention mechanisms, many-to-many matching methods can ex-plore the correspondence between image regions and text words, thus, these attention mechanism based methods can achieve the state-of-the-art performance. For instance, BFAN  is proposed to eliminate partial irrelevant words and regions from the shared semantic in imagetext pairs to achieves state-of-the-art performance on several benchmark datasets. IMRAM (Chen et al., 2020) proposes a recurrent attention memory which incorporates a cross-modal attention unit and a memory distillation unit to refine the correspondence between image regions and text words. However, those attention mechanisms used by the manyto-many matching methods are usually complicated with high computation complexity. Hence, it is unacceptable to perform such time-consuming attention mechanisms between the query (text/image) and each candidate datapoint (corresponding to image/text) in the whole retrieval set during inference especially for the large scale databases.
Different from previous methods, our model explores hashing technology to improve the inference speed of the existing many-to-many text-image matching methods without reducing their performance.

Cross-Modal Hashing
The core of cross-modal hashing is to project the datapoints of different modalities into compact binary hash codes, meanwhile, preserve the semantic similarity of original datapoints. Then, in the crossmodal retrieval phase, the datapoints of the retrieval set can be sorted by the Hamming distance between their corresponding binary hash codes calculated by the 'XOR' operation, which has fast retrieval speed. Due to this advantage, a mount of crossmodal hashing methods have been proposed Su et al., 2019;Lin et al., 2020;Tu et al., 2020;Shi et al., 2019). For example, SDCH (Lin et al., 2020) utilizes a semantic label branches to preserve semantic information of the learned features by integrating with inter-modal pairwise loss, cross-entropy loss and quantization loss.
However, due to these hashing methods belonging to approximate nearest neighbour (ANN) searching technology, they can hardly to accurately find the matching datapoint for a query. Hence, few works explore the hashing technology for textimage matching. To our best knowledge, this is the first work to explore the use of hashing to improve the inference speed of existing attention mecha-a woman wearing a yellow raincoat is riding on a horse on a bridge in a brown mountain landscape with a few trees nism based image-text matching methods.

Methodology
As shown in Figure 1, different from the architecture of existing matching models, our framework introduces an extra hashing based efficient inference module, called HEI, which consists of an image modal hashing layer and a text modal hashing layer, and each hashing layer is a fully-connected layer with k units where k is the hash codes length.

Problem formulation and notations
Without loss of generality, suppose there are , and a text y j with its word-level textual features denoted as T j = [t j 1 , · · · , t j n ], the goal of image-text matching is to calculate a matching score s ij for the image x i and the text y j based on their features V i and T j . Moreover, if the image x i and the text y j are matching, the matching score s ij should be large, otherwise s ij should be small.
Furthermore, the goal of hashing based efficient inference module is to learn the two modalityspecific hashing layer which can map their corresponding modal datapoints into binary hash codes with the heterologous matching relationship preserved.   as the backbone, which is pre-trained on the Visual Genomes dataset (Krishna et al., 2017) by (Anderson et al., 2018), to extract the top m region proposals of the image. Then, by average-pooling the spatial feature map, a feature vector v i j ∈ R 2048 for the j th region proposal is calculated. Finally, We obtain the ddimentional region features with a linear projection layer:

To obtain the region-level visual features
where W v and b v are to-be-learned parameters, and v i j is the visual feature for the j th region of image x i .

Text word-level textual features
To obtain the textual features of a input text y j with n words, we first embed each word w i of the input text y j into a 300-dimensional vector t j i . Then, to enhance the word-level feature with sufficient context information, we use a singlelayer bi-directional GRU (Cho et al., 2014) with d-dimensional hidden to summarize information from both forward and backward directions of the input text y j : where − → h j i and ← − h j i denote hidden states from the forward GRU and the backward GRU, respectively. Then, the textual feature of the word w j i in the text y j is defined as:

General Attention Framework
Existing attention mechanism based image-text matching methods mainly learn to associate shared semantics between the region-level feature V i of image x i and word-level feature T j of text y j through variant cross-attention mechanisms to calculate the matching score s ij , which can be formulated as follows: where CAM (·; W ) denotes the cross-modal attention mechanism and W is the set of learnable parameters. For example, in BFAN , CAM (·; W ) denotes the Focal attention mechanism proposed in the original paper. Then to maximize matching scores of the matching image-text pairs and minimize the ones of the mismatching datapairs, a hinge-based triplet ranking loss with emphasis on the hard negatives are used as the loss function. Specifically, given a pair of matching image-text x i and y j , we denote their matching score as s ij ;j = argmax t =j s it denotes the hard negative when using the image to match text;ī = argmax t =i s tj denotes the hard negative when using the text to match image, then the ranking loss is formulated as: where α is the margin for the ranking loss, and Finally, after optimizing the matching model, given a query datapoint, it will be used to calculated the matching score with each datapoint in the retrieval set to find the most matching one by the cross-attention mechanism. However, the crossattention mechanism is time-consuming which means it unacceptable to calculate a matching score between the query and each point in retrieval set with the attention mechanism during inference. Thus, we propose a hashing based efficient inference module to improve the inference speed.

Hashing based Efficient Inference module
Specifically, the input of the HEI module is the fragment-level feature of datapoint, i.e., the regionlevel feature V i for an image modal input x i or the word-level feature T j for a text modal input y j . We further aggregate the fragment-level feature where w t and w v denote learnable parameter. Then by forwarding the instance-level featurev i (t i ) into the image (text) modal hashing layer, the hash codes where H x (v i ; Θ v ) denotes the image modal hashing layer and Θ v denotes the set of parameters in the image hashing layer; k is the length of hash codes; H y (t i ; Θ t ) represents the text modal hashing layer and Θ t represents the set of parameters in the text hashing layer; sgn(·) is an element-wise sign function, which returns 1 if the element is positive and returns −1 otherwise. Furthermore, the core of hashing based efficient inference module is to learn two modality-specific hashing layer to map the datapoints into binary hash codes which are used to select a few candidate datapoints from the retrieval set for an query. To achieve this goal, the learned hash codes should coarsely preserve the heterologous matching relationship between datapoints, i.e., if two datapoints are matching, then the Hamming distance between their corresponding binary hash codes should be small, otherwise it should be large.
Considering that the Hamming distance between b v i and b t j can be difined as: , where k denotes the code length. It means when 1 k b vT i b t j is close to 1, the Hamming dis- is close to k. Thus, 1 k b vT i b t j can be used to denote the Hamming similarity between b v i and b t j , and measure the heterologous matching relationship preserved by b v i and b t j . Furthermore, as the mathching score s ij ∈ [0, 1] of image x i and text y j computed by the cross-attention mechanism may preserve the heterologous matching relationship to a certain extent, then we can use it as soft target to supervise the learning of similarity between hash codes of mismatching data pairs. Owing to 1 Thus, based on these observations, we proposed a matching score based hashing loss: where N + i denotes the set of text datapoints which are matching with the image x i , and N − i denotes the set of text datapoints which are not matching with the image x i ; s ij denotes the matching score between the image x i and text y j ; It can be found that the first item of L 1 is to make 1 k b vT i b t j be close to 1, i.e., make the Hamming distance between the hash codes of matching datapair be small. The second item of L 1 is constructed to penalize the mismatching datapair that the Hamming similarity between their hash codes is larger than their matching score s ij , i.e., the goal of L 1 is to make the Hamming distance between their hash codes be large. Thus, by minimizing the hashing loss L 1 , the learned binary hash codes can coarsely preserve the heterologous matching relationship.
Furthermore, as the sgn(·) function is indifferentiable at zero and the derivation of it will be zeros for a non-zero input, the parameters of hashing model will not be updated with the backpropagation algorithm when minimizing the hashing loss function L 1 . Thus, we directly discard the sgn(·) function to ensure the parameters of our hashing model can be updated, and use tanh(·) to approximate the sgn(·) function to make each element of output of hashing layer can be close to "+1" or "-1". Then the final hashing loss function can be formulated as follows:

Inference
After training the image-text matching model and HEI module well, we can generate the hash codes ) in the retrieval set using the HEI module. When given a query image x q (text y q ), we also use HEI module map it into a hash code b v q (b t q ), and calculate the Hamming distances between ). Then, we sort the texts (images) in the retrieval set in ascending order according to the Hamming distances, and select a few of the top texts (images) as the candidate set. Finally, we only need to do the finegrained matching in the candidate set to find the matching datapoints.

Datasets
We evaluate the performance of the proposed HEI module on two public used datasets: Flickr30K (Plummer et al., 2015) and MS-COCO (Lin et al., 2014). Specifically, Flickr30k contains 31783 images collected from the Flickr website. Each image is accompanied with five human annotated sentences descriptions. Following the setting of previous works , this dataset is split into 29,000 images, 1,000 images, and 1,000 images for training set, validation set, and testing set respectively. We report the performance evaluation of image-text retrieval on 1000 testing set. MS-COCO is another large-scale image-caption benchmark which consists of about 123,287 images with each image also roughly annotated with five sentence descriptions. Following the widely used split (Karpathy et al., 2014;Chen et al., 2020), we use 113,287 images for training, 1000 images for validation and 5000 images for testing. Moreover, we evaluate our method on both the 5 folds of 1K test images and the full 5K test images for MS-COCO.

Evaluation
Following the setting in (Chen et al., 2020;, we evaluate the performance of our proposed approach by reporting Recall@K (K = 1, 5, 10) values for bi-directional matching tasks, i.e. matching texts given an image query (Text Retrieval) and matching images given a text query (Image Retrieval). The Recall computes the proportion of correct image or text being retrieved among top K results. In addition, we also record the inference time in seconds to evaluate the efficiency of our proposed HEI.

Baselines
To evaluate the performance of our proposed HEI, some state-of-the-art attention mechanism based image-text matching methods are selected as our baselines, including BFAN , CAMP , IMRAN (Chen et al., 2020) and GSMN . It should be noted that the proposed HEI focuses on exploring a novel and efficient hashing based efficient inference module that can be universally plugged into existing attention mechanism based image-text methods to speed up the inference speed rather than redesigning a new cross-modal attention mechanism to improve their matching performance.

Implementation Details
All our experiments are implemented in PyTorch and conducted on a NVIDIA Tesla V100 GPU. For representing visual modality, the amount of regions in each image is m = 36, and the dimensionality   (2)) in the GRU is also set as 1024. The length of hash codes is set as 64.
In the training phase, we first train the base crossmodal attention module for 20 epochs, then train the HEI module jointly. We adopt SGD with a minibatch size of 128 and a learning rate within 10 −2 to 10 −3 to optimize the HEI modul. The optimization algorithm for the base cross-modal attention module is the same with the ones defined in the original method, for example, when plugging HEI module into GSMN, the optimization algorithm for cross-modal attention module is Adam.

Main results
We conduct extensive experiments on Flickr30K and MS-COCO. The image-text matching results on Flickr30K, MS-COCO dataset with 1K test points and 5K test points are shown in Table 1, 2 and 3, respectively. "method"-HEI denotes the method using the proposed HEI module, for example, BFAN-HEI means plugging HEI into BFAN to speed up the inference speed. Similarly, "method"random denotes randomly selecting 50% datapoints from retrieval set as the candidate set to speed up the inference speed of the method.
Based on the results shown in these tables, the following observations can be got: (1) our proposed HEI module can greatly improve the matching efficiency of all the four baselines almost without reducing the matching performance, and even slightly improve the performance of some baselines. For example, as shown in Table 1, comparing GSMN-HEI with GSMN, GSMN-HEI achieves an increase of 0.1% on the R@1 metric in the text retrieval task, and greatly reduces the inference time from 518.32 seconds to 99.30 seconds. The reason why plugging the HEI module can slightly improve the performance maybe that for some query, there are some false positive datapoints which can misguide the image-text model, but the Hamming distance between the hash codes of queries and the ones of false positive datapoints are large, i.e., the false positive datapoints will not be selected as the candidate points. Thus, without the effect of the false positive datapoints, the image-text model can find the matching points successfully and improve the retrieval performance. (2) The proposed HEI module can map datapoints into hash codes with the original heterologous matching relationship coarsely preserved. For instance, as shown in Table 1, 2 and 3, all the methods with the proposed HEI module achieve not only better performance than the methods with the randomly selected candidate, but also lower inference time. It means that the number of datapoints in the candidate set selected by our proposed HEI module is smaller but the possibility of the candidate set containing the matching datapoint is higher.

Ablation study
To further investigate the impact of the length of hash codes, we construct there variants of HEI module with code length being 16, 32, and 128 bits with two baselines on Flickr30K, respectively. The re-    (5k) in text retrieval task and image retrieval task, respectively. Moreover, in each figure, the axis X denotes selecting how much percentage of points in the retrieval set as candidate set, and the axis Y for the red line is in the left which is the value of R@1, and the axis Y for the blue line is in the right which denotes the inference time taken for the transaction in seconds.
sults are shown in Table 4. From these results, it can be found: (1) The length of hash codes rarely influence the inference time that each baseline with a different hash code length of HEI consumes nearly the same inference time. This is because that the speed of the "XOR" operation between hash codes is far faster than the ones of the cross-modal attention mechanism. Thus, it implicitly demonstrates the availability of speeding up the inference speed of baselines by using the proposed HEI to fast select the candidate set.
(2) The matching performance first increases as the hash code length varies from 16 to 64, and then tend to be stable when the length varies from 64 to 128. Thus, for the other experiments, the hash code length is set as 64.

Efficiency and performance
We also conduct experiments to further investigate the trade-off between inference efficiency and matching performance.
As the results shown in Figure 2, with the size  Figure 3: Compare the inference time of BFAN and the one of BFAN-HEI in large retrieval set under the condition that BFAN-HEI achieves the same performance of BDAN of candidate increasing, the matching performance of BFAN-HEI (the red lines) increase rapidly and then tend to stable, and BFAN-HEI consumes more inference time (the blue lines). It can be found that when selecting only 20% of datapoints in the retrieval set as the candidate set by the proposed HEI module, BFAN-HEI can already achieve the best performance, and greatly reduce the inference time. Thus, it demonstrates the effectiveness of our proposed HEI module.

Scalability for the large retrieval set
To further investigate the scalability of the proposed HEI module for the large retrieval set, when doing experiments on the MS-COCO (1K) with the BFAN baseline, we directly use training data to expand the data volume of the retrieval set. The curves of inference time w.r.t. the volume of retrieval set are shown in Figure 3. It can be found that with volume of the retrieval set increasing, our proposed HEI module can still be used to speed up the inference speed without reducing the matching performance, which demonstrates the scalability of our proposed HEI module for large retrieval sets.

Conclusion
In this paper, we have proposed a novel Hashing based Efficient module, called HEI, which can be plugged into the existing image-text matching methods to speed up the inference step without reducing the matching performance. Extensive experiments on two widely used benchmark MS-COCO and Flickr30k with four baseline methods demonstrate the efficiency and effectiveness of our proposed HEI module.