Exploring Compositional Image Retrieval with Hybrid Compositional Learning and Heuristic Negative Mining

,


Introduction
In this paper, we explore the task of compositional image retrieval (CIR). As shown in Figure 1, CIR is aimed at retrieving a target image slightly different from a reference image, where the difference is described by a modification text. The key to CIR is to learn the cross-modal composition process from (reference image, modification text) pairs to target images. In the existing CIR models, this is usually  realized by fusing and matching image representations obtained from pre-trained vision models with text representations obtained from pre-trained language models (Perez et al., 2018;Vo et al., 2019;Chen and Bazzani, 2020;Dodds et al., 2020;Lee et al., 2021;Kim et al., 2021a;Wen et al., 2021;Anwaar et al., 2021). However, these pre-trained models are pre-trained on uni-modal data, which implies that visual concepts embodied in image representations are not aligned with semantic concepts embodied in text representations. As a result, applying these models to CIR yields limited benefit, which necessitates the application of pre-trained vision-and-language (V&L) models.
CLIP (Radford et al., 2021), a recently-proposed V&L model pre-trained on 400M image-text pairs, has exhibited strong zero-shot performance on image classification. Empirical studies (Kim et al., 2021b;Shen et al., 2022) showed that it is suboptimal to apply CLIP in a zero-shot manner to complex V&L tasks requiring cross-modal reasoning, such as visual question answering (VQA), visual entailment, and V&L navigation. Meanwhile, Shen et al. (2022) also showed that state-of-theart (SOTA) performance can be achieved on these tasks by integrating and fine-tuning CLIP. Due to the requirement for cross-modal compositionality, we believe that CIR is as complex as VQA. Therefore, we use CLIP as the backbone of our proposed CIR model, and fine-tune it together with the rest model components. There are indeed some other pre-trained V&L models than CLIP, such as AL-BEF  and BLIP . These models use a single encoder to encode imagetext combinations through cross-modal attention mechanisms, which is computationally expensive during retrieval if there are many candidate images. However, CLIP uses two encoders to separately encode images and texts so that we can encode all candidate images in advance and calculate matching scores as simple dot products, which is applicable to large-scale retrieval tasks.
Based on CLIP, we propose a novel CIR model named HyCoLe-HNM, which features hybrid compositional learning and heuristic negative mining. On the one hand, unlike the existing CIR models, which mostly focus on image compositional learning, we propose a hybrid compositional learning mechanism, which includes both image compositional learning and text compositional learning. Specifically, we not only learn the compositional matching between reference images and target images, but also utilize image-related texts as privileged information to learn the compositional matching between reference texts and target texts. On the other hand, to facilitate the contrastive optimization of hybrid compositional learning, we also propose a heuristic negative mining method to filter negative samples so that only the negative samples relevant to the positive ones are retained. Specifically, we enforce a heuristic rule to identify relevant negative samples, and thereby reduce the space complexity of negative samples from O(N 3 ) to O(N 2 ). Compared with hard negative mining methods, the heuristic negative mining method is not only more efficient, but also achieves better performance in the ablation experiments.
For optimal performance, when implementing HyCoLe-HNM, we innovatively integrate some approaches originally aimed at other tasks. Specifically, following the contrastive pre-training method of CLIP, we utilize the above mentioned privileged information to perform cross-modal representation learning. Besides, we also borrow a gated fusion mechanism from a question answering (QA) model to perform compositional fusion. Experimental re-sults show that by applying these approaches to HyCoLe-HNM, we achieve SOTA performance on multiple CIR datasets.

Model
In this section, we propose our CIR model HyCoLe-HNM. First, we provide a task definition of CIR. Then, we present a cross-modal representation learning method. Next, we propose a hybrid compositional learning mechanism and a heuristic negative mining method, from which HyCoLe-HNM takes its name. Finally, we describe the training and inference of HyCoLe-HNM.

Task Definition
CIR is to retrieve a target image t from a set of candidate images according to a reference image r and a modification text m, where m describes the change from r to t. We assume that a reference text r and a target textt, which separately embody the semantics of r and t, are provided as privileged information for training. For example,r andt can be image-related captions. However, such information is not available for inference.

Cross-Modal Representation Learning
To map images and texts into a joint representation space, we utilize privileged information in the form of image-related texts to jointly train an image encoder and a text encoder through cross-modal representation learning. As shown in Figure 2a, we use an image encoder to encode reference images and target images, and use a text encoder to encode reference texts and target texts. To benefit from V&L pre-training, we use CLIP as the backbone of both encoders. Specifically, we use the image part of CLIP, which is a vision transformer (ViT) (Dosovitskiy et al., 2020), as the backbone of the image encoder, and use the text part of CLIP, which is a GPT-like (Radford et al., 2018) language model, as the backbone of the text encoder. Besides, we also add a linear projection layer after the backbone of each encoder, and apply L2 normalization to the output of each linear projection layer. The output dimensionality of both linear projection layers is d, which is the dimensionality of the joint representation space.
To learn the joint representation space, we adopt the InfoNCE loss (Oord et al., 2018) used in the contrastive pre-training of CLIP, and apply it to both the reference side and the target side. Specif-  ically, given a mini-batch of N reference imagetext pairs {(r 1 ,r 1 ), . . . , (r N ,r N )}, we treat them as positive samples, and generate N 2 − N negative samples by replacing the textr i in each positive sample (r i ,r i ) separately with the other N − 1 texts {r 1 , . . . ,r N } − {r i }. For each of the positive samples and negative samples, we calculate the cosine similarity between the image representation and the text representation, and thereby construct a reference image-text matching (RITM) similarity matrix S RIT M ∈ R N ×N , where the element at the i-th row and the j-th column corresponds to the sample (r i ,r j ). Analogously, given a mini-batch of N target image-text pairs {(t 1 ,t 1 ), . . . , (t N ,t N )}, we construct a target image-text matching (TITM) similarity matrix S T IT M ∈ R N ×N . Obviously, the diagonal elements in the two matrices correspond to the positive samples, while the off-diagonal elements correspond to the negative samples. On this basis, we minimize the following RITM loss L RIT M and TITM loss L T IT M so that in the learned joint representation space, an image and a text are close to each other if they are paired, and apart from each other if not: (1) where τ RIT M and τ T IT M are trainable temperatures, tr(·) denotes calculating matrix trace, and softmax(·) is calculated along each row.

Hybrid Compositional Learning and Heuristic Negative Mining
The existing CIR models mostly focus on image compositional learning, which is to learn the compositional matching between reference images and target images conditioned on modification texts. In our proposed CIR model, besides image compositional learning, we also utilize privileged information in the form of image-related texts to perform text compositional learning, which is to analogously learn the compositional matching between reference texts and target texts, and thus name this mechanism hybrid compositional learning. As shown in Figure 2b, based on cross-modal representation learning, we use a fusion module to fuse modification text representations separately into reference image representations and reference text representations, which can be seen as compositional fusion. To implement the fusion module, we borrow the following gated fusion mechanism from Wang et al. (2018), which was originally proposed to address the task of QA, and apply L2 normalization to its output:  where W g and W h are trainable weight matrices, b g and b h are trainable bias vectors, f(x, y) denotes fusing y into x, ⊙ denotes element-wise multiplication, norm(·) denotes L2 normalization, and [; ] denotes vector concatenation.
As in cross-modal representation learning, we also adopt the InfoNCE loss in hybrid compositional learning. Specifically, in image compositional learning, given a mini-batch of N (reference image, modification text, target image) triples {(r 1 , m 1 , t 1 ), . . . , (r N , m N , t N )}, we treat them as positive samples, and generate N 3 − N negative samples by enumerating the other possible triples However, most of these negative samples are easy negatives, which are irrelevant to the positive samples and thus have little effect on the contrastive optimization. Therefore, we filter these negative samples to only retain the hard negatives, which are relevant to the positive samples. Instead of applying hard negative mining methods, we propose a more efficient heuristic negative mining method, which is to identify relevant negative samples by enforcing a heuristic rule: a negative sample is relevant if and only if it is different from a positive sample in either the reference image, the modification text, or the target image. As shown in Figure 3, we implement this rule as the following three operations, which reduce the space complexity of negative samples • Reference-based negative mining. For each positive sample, we select the N − 1 negative samples that only differ in the reference image. This operation yields N 2 − N relevant negative samples in total.
• Modification-based negative mining. For each positive sample, we select the N − 1 negative samples that only differ in the modification text. This operation yields N 2 − N relevant negative samples in total.
• Target-based negative mining. For each positive sample, we select the N − 1 negative samples that only differ in the target image. This operation yields N 2 − N relevant negative samples in total.
For each of the positive samples and relevant negative samples, we first fuse the modification text representation into the reference image representation, and then calculate the cosine similarity between the fusion result and the target image representation. In this way, for the above three operations, we construct three image compositional matching (ICM) similarity matrices S R−ICM ∈ R N ×N , S M −ICM ∈ R N ×N , and S T −ICM ∈ R N ×N , where the elements at the i-th row and the j-th column separately corresponds to the samples (r j , m i , t i ), (r i , m j , t i ), and (r i , m i , t j ). Obviously, the diagonal elements in the three matrices correspond to the positive samples, while the offdiagonal elements correspond to the relevant negative samples. On this basis, we minimize the following ICM loss L ICM so that the compositional matching between a reference image and a target image conditioned on a modification text is promoted if the modification text reflects the change from the reference image to the target image, and suppressed if not: where τ ICM is a trainable temperature.
Analogously, in text compositional learning, given a mini-batch of N (reference text, modification text, target text) triples {(r 1 , m 1 ,t 1 ), . . . , (r N , m N ,t N )}, we apply the same method as in image compositional learning to construct three text compositional matching (TCM) similarity matrices S R−T CM ∈ R N ×N , S M −T CM ∈ R N ×N , and S T −T CM ∈ R N ×N , and thereby minimize the following TCM loss L T CM : where τ T CM is a trainable temperature.
Since our proposed CIR model features hybrid compositional learning and heuristic negative mining, we name it HyCoLe-HNM.

Training and Inference
To train HyCoLe-HNM, we minimize the following total loss L through gradient descent: where the loss scaling factors α and β are hyperparameters. To effectively fine-tune CLIP, we set the backbone learning rate as the product of the global learning rate and a backbone activity ratio γ, which is another hyper-parameter. From a knowledge perspective, γ controls the trade-off between the knowledge transferred from CLIP and that embodied in the training data. For inference, we encode all candidate images in advance and cache the resulting representations. On this basis, for each given (reference image, modification text) pair, we perform compositional fusion, calculate the cosine similarity between the fusion result and each candidate image representation as the matching score, and thereby rank all candidate images according to the resulting matching scores.
3 Related Works

Image Retrieval
Given a query, image retrieval methods can retrieve the most similar images to the query, from an image database. In real-life scenarios, users may use different types of queries to search for an image. Conventional image retrieval methods are based on the assumption that the input query is of a single type or modality. Some examples include queries of type image (Dubey, 2021;Liu et al., 2016), text (Tan et al., 2019;Messina et al., 2021;, attribute  and sketch (Sangkloy et al., 2016;Radenovic et al., 2018;Sain et al., 2021).

Compositional Learning
The main idea behind compositional learning is to develop a complex concept by combining multiple primitive concepts (Misra et al., 2017). Compositional learning is widely explored in different cross-modal tasks, such as image captioning (Zhou et al., 2020; and VQA (Antol et al., 2015;. Recently, CIR has gained a lot of attention more specifically for fashion product search . Augmenting an image query with additional modification text input for image retrieval has been the main line of work in this area. TIRG (Vo et al., 2019) applies compositional learning to image retrieval, using a residual gating mechanism to fuse image and text representations. To compose the vision and language content, VAL (Gu et al., 2021) plugs composite transformers into convolution layers at different depths of the network. MAAF (Dodds et al., 2020) concatenates image and text tokens and passes them into a Transformer encoder-like architecture. Hosseinzadeh and Wang (2020) apply self-attention to image and text representations independently and use cross-attention fusion between the two representations. To change the image content and style based on the modification text, CoSMo (Lee et al., 2021) applies content modulator (CM) and style modulator (SM) to the reference image.
JVSM (Chen and Bazzani, 2020) jointly learns image-text representations as well as compositional representations in a unified embedding space using a multi-task learning framework. Similar to our method, privileged information is used at training time. However, unlike our method which is based on both cross-modal (image-text) and uni-modal (text-text) compositional learning, they only use cross-modal compositionality at training time. Although using the cross-modal compositional learning plays the main role in the performance of the proposed method, we show that language composi-tionality further improves the results.

Vision-and-Language Pre-Training
The recent success of Transformer-based language model pre-training (Lan et al., 2019;Clark et al., 2019) has inspired vision-and-language (V&L) pretraining in different tasks, such as VQA, image captioning, visual commonsense reasoning and image retrieval (Chen et al., 2020b;Li et al., 2020;Radford et al., 2021). The main objective of V&L pre-training is to construct a cross-modal representation space to help improve the generalizability and sample efficiency of downstream tasks by training on large-scale image-text datasets.
V&L pre-training has also been applied to CIR. CIRPLANT  uses the pre-trained V&L model OSCAR (Li et al., 2020) as the composition module. The method achieves SOTA performance on the authors' created CIRR dataset. However, its performance on FashionIQ  is sub-optimal, apparently due to the domain shift between the pre-training dataset and Fash-ionIQ.
Recently, CLIP (Radford et al., 2021) has been proposed to learn visual concepts with language supervision. It follows a late fusion design where image and text representations, encoded by independent image and text encoders, are learned using a contrastive loss. Due to the success of CLIP in different V&L tasks (Shen et al., 2022), we employ pre-trained CLIP as a backbone model for the proposed method. Experimental results show that the proposed CLIP-based text-guided image retrieval method achieves SOTA performance on different datasets.

Learning with Privileged Information
Privileged information refers to the information which is available at training time but not at test time. The paradigm of learning with privileged information was first formulated by Vapnik and Vashist (2009). The privileged information is used in different tasks such as object detection (Hoffman et al., 2016;Mordan et al., 2018), semantic segmentation (Lee et al., 2018) and image super-resolution (Lee et al., 2020) to train a stronger model. Recently, side information in the form of attributes or image caption has been used to improve the performance of image retrieval methods Chen and Bazzani, 2020). Similar to these methods, we use the attributes provided for each image as privileged information.

Datasets
To verify the effectiveness of HyCoLe-HNM, we conduct experiments on three CIR datasets, namely FashionIQ , Fashion200K (Han et al., 2017), and MIT-States (Isola et al., 2015). We pre-process these datasets into a unified format, where each data sample consists of a reference image, a reference text, a target image, a target text, and a modification text (refer to Appendix A for the data statistics and examples of each dataset). Besides, we also adopt recall-at-K (R@K) as unified evaluation metrics on these datasets, which is the percentage of data samples whose target image appears in the top-K retrieved images.

Implementation Details
We use PyTorch (Paszke et al., 2019) to implement HyCoLe-HNM, use Ray's Tune (Liaw et al., 2018) to perform hyper-parameter optimization, and use HuggingFace's Transformers (Wolf et al., 2019) to load CLIP. We construct HyCoLe-HNM separately with three versions of CLIP, namely CLIP-ViT-B/32, CLIP-ViT-B/16, and CLIP-ViT-L/14. For optimization, we apply an AdamW optimizer (Loshchilov and Hutter, 2019) with an initial learning rate of 0.0001, a weight decay factor of 0.01, and a mini-batch size of 64. The trainable temperatures τ RIT M , τ T IT M , τ ICM , and τ T CM are initialized to e −1 , the loss scaling factors α and β are separately set to 0.4 and 0.1, and the backbone activity ratio γ is set to 0.001. We optimize the model for 64 epochs on a single NVIDIA V100 GPU, where a cosine schedule is used to anneal the learning rate after 6 warm-up epochs. Besides, to improve the efficiency, we also apply mixed precision training and gradient checkpointing. For evaluation, we follow Vo et al. (2019) to group candidate images, and thereby treat candidate images in the same group as identical.
comes with two human-written relative captions, and each image comes with an attribute set. For each pair, we denote the relative captions by z 1 and z 2 , denote the attribute set of the reference image by {u 1 , . . . , u p }, and denote that of the target image by {v 1 , . . . , v q }. We generate "z 1 , z 2 ." as the modification text, generate "is u 1 , . . . , u p ." as the reference text, and generate "is v 1 , . . . , v q ." as the target text. We optimize HyCoLe-HNM on all training samples, and evaluate it on the test samples of each category. As shown in Table 1, HyCoLe-HNM outperforms the existing CIR methods by a large margin. Some retrieval examples are shown in Figure 4a.
Fashion200K is another dataset of fashion images, which fall into five categories, namely pants, skirts, dresses, tops, and jackets. Similar to FashionIQ, each image in this dataset comes with an attribute set. Following Vo et al. (2019), we traverse all possible image pairs in each category to select (reference image, target image) pairs. Specifically, we select an image pair (i 1 , i 2 ) if the attribute set of i 1 differs from that of i 2 in only one attribute. In this case, we denote the different attribute of i 1 by u, and denote that of i 2 by v. We generate "is not u, is v." as the modification text, and generate the reference text and the target text in the same way as in FashionIQ. We optimize HyCoLe-HNM on all training samples, and evaluate it on the test samples provided by Vo et al. (2019). As shown in Table 2, for R@10 and R@50, HyCoLe-HNM outperforms the existing CIR models by a large margin. For R@1, HyCoLe-HNM is comparable with the SOTA CIR models when using the base CLIPs (CLIP-ViT-B/32 and CLIP-ViT-B/16), but much better when using the large CLIP (CLIP-ViT-L/14).
Some retrieval examples are shown in Figure 4b.

MIT-States
MIT-States is a dataset of object images, where each image comes with a noun specifying the object name and an adjective describing the object state. Following Vo et al. (2019), we traverse all possible image pairs to select (reference image, target image) pairs. Specifically, we select an image pair (i 1 , i 2 ) if i 1 and i 2 have the same noun but different adjectives. In this case, we denote the noun by o, denote the adjective of i 1 by u, and denote that of i 2 by v. We generate "is not u, is v." as the modification text, generate "u o." as the reference text, and generate "v o." as the target   Table 3, HyCoLe-HNM outperforms the existing CIR models, where the advantage is more significant when using the large CLIP than when using the base CLIPs. Some retrieval examples are shown in Figure 4c.  To probe the performance contribution from each design point of HyCoLe-HNM, we conduct the following five ablation experiments. As shown in Table 4, in each ablation experiment, we change the corresponding design point, and report the resulting overall performance on each dataset, which is the average value of the required R@Ks on the test samples of that dataset.

Ablation Study
• For the hybrid compositional learning mechanism, which includes both image compositional learning and text compositional learning, we disable text compositional learning by setting the loss scaling factor α to 0, which is applied to the TCM loss L T CM . As a result, we observe a slight performance drop on all datasets.
• For the heuristic negative mining method, which is based on heuristic rules and thus more efficient than hard negative mining methods, we replace it with a hard negative mining method. As a result, we observe a significant performance drop on FashionIQ and Fashion200K, and a slight one on MIT-States.
• For the gated fusion mechanism, which is borrowed from a QA model to implement the fusion module, we replace it with a simple addition operation. As a result, we observe a significant performance drop on all datasets.
• For privileged information, which is in the form of image-related texts and applied to cross-modal representation learning and text compositional learning, we disable its application by setting the loss scaling factors α and β to 0, which are separately applied to the TCM loss L T CM and the sum of the RITM loss L RIT M and the TITM loss L T IT M . As a result, we observe a slight performance drop on all datasets.
• For the fine-tuning of CLIP, which is controlled by the backbone activity ratio γ, we examine two extreme cases. On the one hand, we freeze CLIP by setting γ to 0. On the other hand, we make CLIP fully-trainable by setting γ to 1. As a result, we observe a significant performance drop on all datasets in both cases.  heuristic negative mining. Experimental results show that HyCoLe-HNM achieves SOTA performance on three CIR datasets, namely FashionIQ, Fashion200K, and MIT-States. In the future, we plan to re-rank the top few candidate images retrieved by HyCoLe-HNM through certain crossmodal attention mechanisms, which we believe can further improve performance.

Limitations
Besides conducting experiments on FashionIQ, Fashion200K, and MIT-States, which are all comprised of natural images, we also conduct experiments on another CIR dataset CSS (Vo et al., 2019), which is comprised of synthetic images. However, on CSS, the performance of HyCoLe-HNM is inferior to that of TIRG (R@1: 67.3% vs 73.7%). Since the backbone of HyCoLe-HNM is CLIP, which is pre-trained on natural image-text pairs, we conjecture that the reason behind this under-performance is the domain shift between the natural images used to pre-train CLIP and the synthetic images in CSS used to train HyCoLe-HNM.