Learning Relation Alignment for Calibrated Cross-modal Retrieval

Despite the achievements of large-scale multimodal pre-training approaches, cross-modal retrieval, e.g., image-text retrieval, remains a challenging task. To bridge the semantic gap between the two modalities, previous studies mainly focus on word-region alignment at the object level, lacking the matching between the linguistic relation among the words and the visual relation among the regions. The neglect of such relation consistency impairs the contextualized representation of image-text pairs and hinders the model performance and the interpretability. In this paper, we first propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations. In response, we present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions from the two modalities mutually via inter-modal alignment. The IAIS regularizer boosts the performance of prevailing models on Flickr30k and MS COCO datasets by a considerable margin, which demonstrates the superiority of our approach.


Introduction
Cross-modal retrieval, including image-text retrieval, video-text retrieval, etc., has long been an important downstream task in cross-modal representation learning. Image-Text Retrieval (ITR) aims at modeling the similarity of image-text pairs and recalling the most relevant one. It remains quite challenging due to the heterogeneity of the data and the semantic gap between two different modalities. To bridge this gap, neural networks are responsible for learning global representations of images * Corresponding Author 1 Our code is available at https://github.com/ lancopku/IAIS A guy with a red shirt walking. Intra-modal Self-attention Disagreement ☹ A bad case of intra-modal self-attention disagreement Self-attention Attention distribution Figure 1: The upper part shows a comparison of previous object-level alignment and our relation-level alignment. The symbol ↔ denotes alignment and denotes the self-attention stems from a query. The lower panel gives a bad case of inconsistent textual and visual relations. The region of "a red shirt" pays considerable attention to the region of the dog, which does not benefit the matching and is inconsistent with the self-attention of the corresponding phrase. and texts in a joint semantic space and aligning the images and texts with the same semantics (Faghri et al., 2018;Kiros et al., 2014). A straightforward way to enhance the alignment is to enforce the local matching between the object-oriented words and the corresponding image regions, and then leverage the object co-occurrence statistics Zhang et al., 2020a) in the pairs for inference. Previous studies incorporate auxiliary knowledge source like scene graphs  or object tags  to explicitly indicate the cross-modal mapping. Other researches try to establish fine-grained interaction on cross-modal attention to reinforce the focus from words to their most relevant regions, and vice versa Wang et al., 2019;Messina et al., 2020;Lee et al., 2018;Zhang et al., 2020b;.
However, such word-region alignment at object level serves only as the basis because it mainly focuses on the local semantics but lacks the matching of global features like the intra-modal relation. The intra-modal relation refers to the correlation of items within a textual or visual sequence. More specifically, given a sentence and an image that describe the same scene and are highly matched, the correlation of the items in the textual sequence should also agree with the correlation of the corresponding items in the visual sequence. But such constraint of relation consistency is neglected in previous works, which hinders performance and interpretability of the models. To corroborate this, we conduct a case study on Flickr30k Entities dataset (Plummer et al., 2015) to probe the agreement of relation-level semantics in pre-trained models like UNITER . We utilize the self-attention distribution as a representation of the intra-modal relations (Clark et al., 2019;Htut et al., 2019;Kovaleva et al., 2019).
As shown in Figure 1, the attention distributions grouped by the annotated object of the given text and image are in disagreement with each other. Specifically, the attention distribution in the linguistic modality is reasonable. However, in the visual modality, the region "a red shirt" pays inappropriate attention to the region of the dog that doesn't appear in the text, which impairs the representation of this visual item, i.e., "a red shirt" under the condition of the corresponding text. Such mismatched attention distributions suggest that the model represents the same concept with inconsistent semantics, which misleads the model to reduce the estimated similarity of the positive pairs and further leads to the wrong predictions that they are unmatched. What's even worse is that in practice, the input regions of the existing methods are extracted by a pre-trained object detector like Faster R- CNN (Ren et al., 2015). The visual features are much noisier due to over-sampling Anderson et al., 2018), which necessitates a stronger regularizer to guide the alignment of the intra-modal relations.
Motivated by the above observations, we pro-mote the semantic alignment from object level to relation level. We leverage self-attention matrix to characterize the relation of items within one modality, and design Intra-modal Self-attention Distance (ISD), a novel metric to measure the consistency between textual and visual relations. Our empirical analysis illustrates that the ISD and the model performance on image-text retrieval are highly correlated, which verifies our hypothesis and inspires us to minimize the semantic distance between intramodal self-attentions in training. Accordingly, we propose a new regularized training method called Inter-modal Alignment on Intra-modal Selfattentions (IAIS) to calibrate two intra-modal attention distributions mutually via inter-modal alignment, which helps learn better contextualized representations for image-text pairs. The model performance of image-text retrieval on Flickr30k and MS COCO datasets is improved by a considerable margin with IAIS, which demonstrates the superiority of our proposal.

Measuring Semantic Distance between Intra-modal Relations
In this section, we present a formal definition of intra-modal relation alignment (Section 2.1). Such alignment requires extracting the visual and linguistic items corresponding to all objects and sorting them in the same order to make their self-attention distributions comparable. We first introduce the mechanism for multimodal attention calculation, and then present the method of attention weight extraction for constructing comparable intra-modal self-attentions (Section 2.2). Finally, we propose a metric named Intra-modal Self-attention Distance (ISD) to quantify the relation consistency. We conduct an empirical analysis on prevailing models to verify the correlation of the model performance and our metric (Section 2.3).

From Intra-modal Relation to Self-attention
Given a sequence O = [o 1 , · · · , o N ] of N objects appeared in an image-text pair, the linguistic and visual representation of such object sequence can be written as L = [l 1 , · · · , l N ] and V = [v 1 , · · · , v N ], respectively. Each item l i , v i with the same index refers to the same object o i . 2 For every object, its relation to the others is depicted in both the linguistic and the visual modality. From a linguistic view, we regard the following textual self-attention distribution as the relation R l i stems from l i : where a l i →l j is the attention weight from l i to l j . Similarly, the relation R v i from the view of the visual modality can be written as Consequently, we can achieve relation-level alignment by narrowing the semantic distance, e.g., Kullback-Leibler Divergence, between the linguistic and visual self-attention distribution for all objects from i = 1 to N : In the original self-attention matrix, however, the attention weights of specific objects are scattered and disordered. We need to extract the target weights and reorder them to construct comparable attention distributions R l i and R v i .

Intra-modal Self-attention Reconstruction
In this subsection, we first introduce the vanilla multimodal attention mechanism and then present a specific way of attention weight extraction. Consider models of single-stream Transformerbased architecture like UNITER . The model consists of a stack of Transformer layers with attention mechanism (Vaswani et al., 2017) and is responsible for encoding image-text pairs into feature representations. Given Q, K, V ∈ R N ×d , the matrix of N query, key and value vectors with dimension d, respectively, the attention function Att(Q, K, V) is defined as: Here, σ is a row-wise, scaled softmax and S is a matrix of attention scores that measure the similarity between every pair of query and key vectors. Let L and V denote the linguistic and the visual modality, respectively. Given a textual sequence X L of N L tokens and a visual sequence X V of N V regions, the input X = [X L X V ] in the single-stream architecture is a concatenation of two sequences with length N = N L + N V . Accordingly, the query and  Figure 2: An example of calculating Intra-modal Selfattention Distance (ISDa) for a matched image-text pair. Two inputs in the pair both contain the object of "two surfers" and "the waves". For self-attention matrix S LL and S VV from each modality, we extract object-orientated patches according to the annotations and summarize it with the Cps operation (Eq. (7)) to synthesize new matrices S VV . Finally, we use our ISDa metric to measure their semantic distance. key matrix 3 can be written as where W Q and W K are learnable parameters. Furthermore, the attention score matrix S ∈ R N ×N can be organized into four submatrices (Bugliarello et al., 2020): The matrices S LL and S VV on the diagonal represent the linguistic and the visual intra-modal self-attention, respectively. S LV and S VL on back-diagonal represent the inter-modal attention scores from text to image, and the opposite. We regard the self-attention σ (S LL ) and σ (S VV ) as depictions of the intra-modal relations. Each row of the matrix represents the relation stemming from one linguistic or visual item to the others within the same modality.
To construct the comparable intra-modal selfattention matrices, we leverage the object annotations in the Flickr30k Entities dataset (Plummer 517 et al., 2015) to extract the tokens, regions, and attention weights with respect to the target objects. As shown in Figure 2, the text and the image both contain annotated objects of "two surfers" and "the waves". The linguistic object sequence can be written as L = [l 1 , l 2 ] = ["two surfers", "the waves"]. These two objects derive four intrinsic relations and can be described by four patches in the original linguistic self-attention matrix S LL . For clarity, we define an operation Ext(S, o i , o j ) that extracts the patch of attention scores in matrix S from the object o i to o j . Accordingly, the relation from "two surfers" to "the waves" can be denoted as Ext (S LL , l 1 , l 2 ). To describe the relation with a single value instead of a sub-matrix, we further construct an operation Cps(·) to summarize the attention patch S ∈ R M ×N to a scalar via column-wise sum and row-wise average: After the above processing, we complete the extraction of the linguistic self-attention S LL through grouping the items by annotated object. The extraction of visual self-attention S VV is similar and the final results are denoted as S VV , we propose a metric called Intra-modal Self-attention Distance with annotation (ISDa) to quantify their semantic gap at the relation level. We define the following symmetric matrix-based Kullback-Leibler Divergence (m-KL) for measuring the distance between two matrices A and B: where (·) i stands for the i th row-vector in the matrix and KL denotes the Kullback-Leibler Divergence. Accordingly, the final ISDa metric for S We present our algorithm for the calculation of ISDa in Algorithm 1.
Algorithm 1: Intra-modal Self-attention Distance with Annotation (ISDa) Input: Intra-modal self-attention matrices SLL, SVV Input: Linguistic object sequence L Input: Visual object sequence V for linguistic object li in L do for linguistic object lj in L do To study the correlation between the ISDa metric and the model performance, 4 we conduct an empirical analysis on UNITER . As shown in Figure 3, the ISDa decreases during the training phase while the model performance continues to increase. They are strongly correlated with a Pearson's correlation coefficient of -0.60. After the middle stage of training, the curve of the model performance and ISDa tends to be flat, suggesting that merely optimizing the task-oriented loss function while neglecting the constraint of relation consistency hinders the model from achieving better performance. To eliminate the bottleneck, we can minimize the ISD in the training phase as a regularization to induce further improvement for the ITR task and better the model interpretability. 3 Inter-modal Alignment on Intra-modal Self-attentions (IAIS) In this section, we propose a new regularized training method, Inter-modal Alignment on Intra-modal Self-attentions (IAIS), for image-text retrieval. Our goal is to enhance the semantic alignment of relations by minimizing the distance between two intra-modal self-attentions (ISD). In practice, given the original visual and linguistic input sequence V = [v 1 , · · · , v N V ], L = [l 1 , · · · , l N L ] with the scattered items, 5 there are no object annotations and the region features extracted by Faster R-CNN are much noisier Anderson et al., 2018), which results in difficulty in grouping the attention weights by ground-truth object. The ISDa thus cannot be used directly as the objective function to minimize.
To tackle this problem, we regard the input sequence from one modality (e.g., the visual sequence V) as an anchor. For every item in the anchor sequence, we extract its corresponding representation from the other modality (e.g., one item or a collection of items in the linguistic sequence L) to reconstruct a mirrored sequence. After that, the items and their relations within the anchor sequence have a one-to-one correspondence with the items and relations within the mirrored sequence, which makes the intra-modal self-attentions derived from the two sequences comparable. In the next two subsections, we propose two methods, singular alignment and distributed alignment, to accomplish the attention extraction and reconstruction. The former establishes a one-to-one mapping between linguistic and visual attention weight, while the latter establishes a distributed mapping. Besides, we design two losses L IAIS as a surrogate of the ISDa to measure the semantic distance between intra-modal self-attention matrices. Finally, we incorporate the surrogate loss minimization as a regularization to calibrate intra-modal self-attentions mutually and achieve the relation-level alignment.

Singular Alignment
For every item in the anchor sequence, singular alignment utilizes the inter-modal attention to find its most relevant item from the opposite modality. As the inter-modal attention score quantifies the similarity between the items from two modalities, the visual and the linguistic item with the highest score can be aligned with each other. For example, given the i th visual item v i and the inter-modal attention matrix S VL , the similarities between v i and all the linguistic items are depicted in S VL [i, :], i.e., the i th row of the matrix. Hence the most relevant linguistic item for v i can be denoted as l i * , where i * = arg max S VL [i, :]. Accordingly, for every weight a v i →v j in the original visual self-attention matrix S VV , its corresponding weight a l i * →l j * in the linguistic self-attention matrix S LL can be extracted by the following operation: 6 a l i * →l j * = Ext (SLL, li * , lj * ) , as a singular alignment. After all the extractions, we reconstruct a mirrored matrix S

Distributed Alignment
As singular items from different modalities may not be able to give a full representation for each other, we further propose distributed alignment, which utilizes a collection of linguistic items as a representation of a visual item, and vice versa. Specifically, given two visual items v i and v j , we regard the inter-modal attentions σ(S VL [i, :]) 7 from v i to all linguistic items and σ(S LV [:, j]) 8 from all linguistic items to v j as a kind of features. Hence the original similarity S VV [i, j] = a v i →v j between v i and v j can also be modeled as a dot-product of their distributed attention features from the cross-modal view: σ(S VL [i, :]) · σ(S LV [:, j]). Such distributed 7 The i th row of SVL. 8 The j th column of SLV . alignment leverages the language as a bridge to draw implicit connections within the visual modality, which can be intuitively regarded as the backtranslation (Sennrich et al., 2016) for multimodal. As shown in Figure 4, the distributed version of mirrored self-attention matrix can be constructed by a matrix multiplication of two inter-modal attention matrices: Similar to the version of singular alignment, the distributed IAIS loss can be written as:

Relation Alignment as Regularizer
With the IAIS loss, the surrogate of semantic distance between two intra-modal self-attentions, we present a new regularized training method to enhance the relation alignment for image-text retrieval. Our final loss is two-fold. The first is the task-orientated margin loss: where [x] + = max(0, x) and α is a preset margin. N p and N n denote the number of positive and negative pairs. S i and S j are the similarity scores of a positive and negative image-text pair, respectively. The second is the IAIS loss for all positive pairs that quantifies their relation distance. The IAIS loss is computed based on the attentions from the last Transformer-layer, and it can be either the singular alignment version (Eq. (11)) or the distributed alignment version (Eq. (13)). To summarize, our final final loss can be formalized as: where λ t is a hyper-parameter w.r.t training steps t to balance two loss items. Since our relation-level alignment is based on mappings between linguistic and visual items, it is beneficial to focus on the item-level alignment at the previous training stage via the task-orientated loss. Accordingly, we utilize Training Signal Annealing  to gradually incorporate the signal of the IAIS loss and design the following exponential schedule:  Table 1: Results of image and text retrieval on Flickr30k and MS COCO. R@K corresponds to whether the ground truth is recalled among top K results. * denotes the results of UNITER taken from  and † denotes our reproduction. IAIS-singular and ISA-distributed denote the singular and distributed version of the proposed relation-leve alignment, respectively.
Here T is the total training steps during fine-tuning phase and t is the current step. As a pluggable regularizer, our IAIS method does NOT incorporate any extra parameters and additional data collection yet empowers the models to capture the higherlevel semantics of relation consistency efficiently.

Benchmark Datasets
We conduct experiments on the Flickr30k (Young et al., 2014) and MS COCO (Lin et al., 2014) datasets. Flickr30K contains 31K images collected from the Flickr website, with five textual descriptions per image. We follow Karpathy and Li (2015) to split the data into 30K/1K/1K training/validation/test splits. MS COCO consists of 123K images, each accompanied with five humanwritten captions. Following Karpathy and Li (2015), the data is divided into 82K/5K/5K training/validation/test images.

Fine-tuning Settings
Due to the limitation of computing resource, we only incorporate IAIS regularization in the phase of fine-tuning instead of pre-training. We use the base (12 layers) and the large (24 layers) version of UNITER , one of the most prevailing large-scale pre-trained models, as our baseline and backbone for IAIS. We follow the fine-tuning setting and hyper-parameter configuration of the original paper. 9 The margin in Eq. (14) is 0.2. For each positive instance, 31 hard negative instances are sampled on the text and image side, respectively, and as each batch contains 8 different 9 https://github.com/ChenRocks/UNITER positive instances, the batch size is 512. The learning rate is 5e-5 and the training steps are 5000 for both base and large models. All experiments are run on 8 NVIDIA V100 GPUs.

Main Results
The main results of the UNITER performance with and without our IAIS regularization are reported in Table 1. Our methods of both singular and distributed version surpass the baseline by a considerable margin. The average improvement over all datasets and models is 4.49.
There are also some interesting findings: (1) Compared with image retrieval, the model performance on text retrieval is boosted by IAIS more remarkably with an average improvement of 3.50. Note that each image in both datasets is paired with five ground-truth sentences, and our IAIS regularizer helps the model capture the common relations for the image and the corresponding texts so that more ground-truth texts can be successfully retrieved.
(2) The improvement on UNITER-base is 17.2% higher than that on UNITER-large. A consistent result can be found in Table 2, which demonstrates various relation distance metrics of fine-tuned models. The ISDa of UNITER-large is smaller than that of UNITER-base, indicating UNITER-large learns more about the relation consistency due to its large capability while there is still room to improve the relation alignment with our IAIS method. (3) The relative improvement brought by the singular version of IAIS is 7.0%, higher than that of the distributed version. The ISDa and L    relation coefficient of 0.779, which is also higher compared to L (d) IAIS with 0.774. Besides, our empirical analysis in Figure 5 shows that it is slightly easier to optimize the L (s) IAIS , indicating it is a better surrogate of ISDa.

Effect of Anchor Modality
In Section 3.3, we leverage both the linguistic and the visual input as the anchor sequence to reconstruct the mirrored sequence from the opposite modalities. To study the impact of the anchor modality, we conduct an ablation study and the results are listed in Table 3. Compared to using language as the anchor modality, i.e., only L IAIS-L is incorporated, the overall model performance is 2.1 higher when vision is taken as the anchor. An explanation is that the description capability of visual regions is more concrete and powerful. However, introducing both L IAIS-V + L IAIS-L to the final loss can achieve a further improvement of 2.22, which indicates the necessity of such combination.

Effect of Annealing Schedule
Besides the exp schedule in Eq. (16) for training signal annealing, we also try other schedules: • log schedule: λ t = 1 − exp (− t /T × γ); • linear schedule: λ t = t /T ; where γ is chosen from {5, 10}. All the schedules are shown in Figure 6.  Table 3: Ablation study on the Flickr30k dataset. "-L" denotes that only L IAIS-L is incorporated, which regards language as the anchor modality. Similar for "-V". We compare the results of five schedules for IAIS signal annealing. The results in Figure 8 show that the exp schedule with scale γ = 5 achieves the best performance.

Effect of Layer to Apply IAIS
We also apply IAIS on different layers of UNITERbase. As illustrated in Figure 9, the optimal way is to apply IAIS on the last layer. We speculate that it is more important to learn relation alignment in the deeper layers because the attention in the deeper layers has a bigger impact on the final output, while the effect of the attention in shallow layers might fade away due to the normalization.

Case Study
We further discuss the advantage of our proposed relation-level alignment. Figure 7 shows two visualization examples of the intra-modal selfattentions from the Flickr30k Entities dataset. With IAIS regularization, the model is instructed to concentrate on the common relations within the linguistic and visual sequence, yielding more calibrated and consistent self-attention distributions.

Related Work
In this section, we introduce the task of image-text retrieval and review the representative studies of  large-scale multimodal pre-trained models.
Image-Text Retrieval Image-Text Retrieval (ITR, Barnard et al., 2003;Barnard and Forsyth, 2001), also known as Image-Text Matching, is one of the popular and challenging Languageand-Vision (V+L) tasks. Given image-text pairs, the prevailing approaches project them into a joint representation space, on which cosine or dot-product similarities are defined, and recall the most relevant one according to the similarity.
Multimodal Pre-trained Models The development of the transformer-based large-scale pretraining paradigm sweeps across the area of multimodal learning and achieves many state-of-the-art results on V+L tasks like Image Captioning, Visual Question Answering, Visual Commonsense Reasoning, etc. Recent prevailing multimodal pre-trained models can be categorized into singlestream Lin et al., 2020;Su et al., 2020;Lin et al., 2021) and two-stream Tan and Bansal, 2019; models. Given a piece of text and an image, the former architecture concatenates the features of tokens and regions and learns their joint representations with one transformer model, while the latter embeds the textual and the visual input separately with two independent intra-modal transformers and then utilizes an inter-modal transformer to reinforce cross-modal interactions via cross-modal attention modules.

Conclusion
In this paper, we promote the semantic alignment for cross-modal retrieval from the object level to the relation level. We propose a surrogate metric to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations. Furthermore, we present a regularized training method IAIS to calibrate intra-modal selfattentions mutually by minimizing the ISD metric. Our method improves both the performance and the interpretability of large-scale pre-trained models. Note that, without object annotation in practice, the singular and distributed version of the IAIS loss only provides a coarse-grained attention distribution alignment. We leave the elaborate design of ISDa proxy function for future work.