Normalized Contrastive Learning for Text-Video Retrieval

Cross-modal contrastive learning has led the recent advances in multimodal retrieval with its simplicity and effectiveness. In this work, however, we reveal that cross-modal contrastive learning suffers from incorrect normalization of the sum retrieval probabilities of each text or video instance. Specifically, we show that many test instances are either over- or under-represented during retrieval, significantly hurting the retrieval performance. To address this problem, we propose Normalized Contrastive Learning (NCL) which utilizes the Sinkhorn-Knopp algorithm to compute the instance-wise biases that properly normalize the sum retrieval probabilities of each instance so that every text and video instance is fairly represented during cross-modal retrieval. Empirical study shows that NCL brings consistent and significant gains in text-video retrieval on different model architectures, with new state-of-the-art multimodal retrieval metrics on the ActivityNet, MSVD, and MSR-VTT datasets without any architecture engineering.


Introduction
With the advent of large-scale multimodal data and transformer-based architectures, cross-modal contrastive learning has contributed to the recent advances in multimodal representation learning (Luo et al., 2021;Radford et al., 2021).Cross-modal contrastive learning provides a simple, yet highly effective approach for learning representations from multimodal data without supervisions.In particular, CLIP (Contrastive Language-Image Pretraining) (Radford et al., 2021) learns image-text representations using vision and text transformer encoders on millions of image-text data from the Web and demonstrates that the learned vision and text encoders can perform zero-shot transfer on various vision tasks.More recently, CLIP4Clip (Luo et al., 2021) extends the pretrained CLIP model and finetunes it on text-video datasets for embedding-based text-video retrieval, achieving state-of-the-art performance.
In embedding-based retrieval, the retrieval probabilities of each text or video instance are defined by their embedding similarity to their cross-modal queries.When there is one-to-one correspondence between text and video instances as in many standard benchmarks, one would expect that the retrieval probabilities of a video summed over the text queries should be normalized to 1 and vice versa, so that in overall, all text and video instances are equally represented during retrieval.However, we show that in practice, cross-modal contrastive learning suffers from significant normalization errors of the sum retrieval probabilities of each instance (Fig. 2).This suggests that many test instances are either over-or under-represented during retrieval, leading to high false positive and false negative rates (Fig. 3) and consequently harming the text-video retrieval performance.
To address this problem, we propose Normalized Contrastive Learning (NCL) which computes instance-wise biases using the Sinkhorn-Knopp algorithm (Cuturi, 2013) and adjusts the cross-modal embedding similarity scores so that the sum retrieval probabilities of each instance are properly normalized to 1 (Fig. 1).At test time where the test queries are not known a priori, we show that we can approximate the test query distribution by storing a subset of train queries during training in a queue.We show that this approach consistently reduces the normalization errors (Fig. 2) and thereby improves the text-video retrieval performance significantly.
We evaluate NCL on text-video retrieval on the ActivityNet, MSVD, and MSR-VTT datasets.Empirical results show that NCL consistently improves both text-to-video and video-to-text retrieval across all datasets on different base architectures, including those of CLIP4Clip (Luo et al., 2021) and SSB the instance-wise biases that normalize the sum retrieval probabilities so that all instances are fairly represented (Patrick et al., 2021) and further advances stateof-the-art text-video retrieval performance without any architecture engineering.In summary, our contributions are: 1. Revealing that cross-modal contrastive learning suffers from incorrect normalization of the sum retrieval probabilities of each instance and how its text-video retrieval performance is impaired by this problem (Sec.3.2).
2. Proposing the novel approach of Normalized Contrastive Learning (NCL) to address the normalization errors in cross-modal contrastive learning.NCL computes instancewise biases using the Sinkhorn-Knopp algorithm (Cuturi, 2013) to adjust the cross-modal similarity scores so that every instance is fairly represented during retrieval (Secs.3.3 to 3.4).
3. Establishing new state-of-the-art results on text-video retrieval tasks on multiple benchmark datasets (ActivityNet, MSVD and MSR-VTT) without any architecture engineering.Moreover, NCL brings consistent and significant gains across different base model architectures including CLIP4Clip (Luo et al., 2021) and SSB (Patrick et al., 2021) (Sec. 4).

Related Work
Contrastive representation learning InfoNCE (Noise Contrastive Estimation of mutual Information) has been developed for learning unsupervised representations of natural images (Wu et al., 2018;Oord et al., 2018;Tian et al., 2019;Chen et al., 2020).It has quickly gained popularity due to its simplicity and effectiveness, leading the state-ofthe-art advances in visual representation learning.Specifically, it first samples two different views of an image using random data augmentations.The views of a same image constitute a positive pair, while negative pairs consist of views of different images in a mini-batch.The encoder is then trained to minimize the cross-entropy loss to classify the positive pairs from the set of negative pairs based on the embedding similarity between the representations.
Recently, the scope of contrastive learning has been extended to multimodal data (Zhang et al., 2020;Miech et al., 2020;Radford et al., 2021;Amrani et al., 2021;Luo et al., 2021).In crossmodal contrastive learning, different modalities of the data are mapped to a shared embedding space using separate encoders for each modality, where their cross-modal similarity is defined.In particular, CLIP (Radford et al., 2021) trains its image and text encoders on millions of image-text pairs using cross-modal contrastive learning and demonstrates that the learned representations can perform zero-shot transfer on various vision tasks.Li et al. (2021Li et al. ( , 2022) ) further improves the vision-language contrastive learning by incorporating additional language modeling and image-text matching losses.
Aside from InfoNCE where each example is assigned its own class label, a recent line of works (Asano et al., 2020b,a;Caron et al., 2020) assume that the data can be clustered into K latent classes.
They generate pseudo labels for the data by solving an optimal transport problem with entropy regularization, using the Sinkhorn-Knopp algorithm (Cuturi, 2013).The encoder is then trained to predict the pseudo-labels of the data.In contrast to previous work that used the Sinkhorn-Knopp algorithm for self-labeling, in this work we use the algorithm to compute the instance-wise biases that properly normalizes the sum retrieval probabilities of each text or video instance so that all instances are fairly represented during retrieval.
Embedding-based text-video retrieval maps video and text data into a shared multimodal embedding space where the similarity between a textvideo pair is defined as the cosine similarity of their embeddings.Many recent works build upon pretrained vision and text encoders (Liu et al., 2019;Miech et al., 2019Miech et al., , 2020;;Liu et al., 2021;Amrani et al., 2021;Patrick et al., 2021;Croitoru et al., 2021;Luo et al., 2021) and finetune their models on downstream text-video datasets.For example, Liu et al. (2019) aggregates multiple expert features including objects, motion, appearance, and audio using a collaborative gating mechanism and minimizes ranking loss to align video and text embeddings for retrieval.Similarly, Gabeur et al. (2020) applies self-attention to video expert features to get video-level representations.On the other hand, Croitoru et al. (2021) trains multiple teacher models using different pretrained text encoders and distills their knowledge to a student model.Patrick et al. (2021) introduces an auxiliary task of reconstructing the caption of a video from other similar videos in a mini-batch to improve the learning of multimodal representation.Bain et al. (2021) tailors vision transformer architecture to train the model on both image and video together and applies curriculum learning by gradually increasing the number of frames the vision encoder takes.Liu et al. (2021) builds hierarchical transformers and performs hierarchical cross-modal contrastive matching at both the feature and semantic levels.In particular, CLIP4Clip (Luo et al., 2021) achieves state-of-the-art performance on text-video retrieval by loading the pretrained CLIP model (Radford et al., 2021) and finetuning it on text-video datasets using cross-modal contrastive learning.However, we show that CLIP4Clip significantly suffers from incorrect normalization of the sum retrieval probabilities of each instance and propose Normalized Contrastive Learning to address this problem.

Cross-modal Contrastive Learning
We start with a brief introduction of crossmodal contrastive learning.Given a batch of B ground-truth text-video pairs, video and text encoders map the input to embedding vector pairs {(t 1 , v 1 ), ..., (t B , v B )} where each embedding lies on the unit hypersphere S D .Cross-modal similarity between a text-video pair is defined as the cosine similarity of their embeddings.As the embeddings have the unit norm, the cosine similarity is simply equivalent to their inner product: Text-to-video (t2v) and video-to-text (v2t) retrieval distributions are defined based on the crossmodal similarity as (2) where γ is a temperature parameter that controls the concentration of the distributions.In text-tovideo retrieval, text embedding t i serves as a query for video embedding v i and vice versa.Cross-modal contrastive learning (Radford et al., 2021;Luo et al., 2021) minimizes the sum of textto-video and video-to-text cross-entropy losses: In particular, CLIP (Radford et al., 2021) is trained on millions of image-text pairs using cross-modal contrastive learning.CLIP4Clip (Luo et al., 2021) finetunes the pretrained CLIP model for text-video retrieval, achieving state-of-the-art performance.

Normalization Errors in Retrieval
When there is one-to-one correspondence between the video and text instances as in many standard benchmarks, one would expect that the retrieval probabilities for a video v j summed over the text queries to be 1, i.e., i P t2v (v j |t i ) = 1, ∀j and  5)) of the sum retrieval probabilities at test time.The baseline is CLIP4Clip (Luo et al., 2021) using cross-modal contrastive learning.Normalized Contrastive Learning (NCL) consistently reduces the normalization errors across the datasets.The remaining errors of NCL comes from approximating the unknown test query distribution with a subset of train queries.
vice versa: i P v2t (t j |v i ) = 1, ∀j, so that all instances are equally represented during retrieval.
However, Fig. 2 shows that in practice, CLIP4Clip (Luo et al., 2021) trained using crossmodal contrastive learning suffers from significant normalization errors of the sum retrieval probabilities, where we define the text-to-video normalization error as average the absolute deviation of the sum of video retrieval probabilities from 1: where N is the number of test text queries.The video-to-text normalization error is defined in a symmetrical manner.Incorrect normalization of the sum retrieval probabilities compromises the retrieval performance: underrepresented than the average.It will have higher chance of not being retrieved by the true query t j (false negative) overrepresented than the average.It will have higher chance of being wrongly retrieved by irrelevant queries t k for k = j (false positive).
Figure 3 illustrates these phenomena on Activi-tyNet, demonstrating how normalization error is correlated with false negative and false positive rates in retrieval.Given that cross-modal contrastive learning suffers from significant normalization errors (Fig. 2), this suggests that its retrieval performance is substantially impaired by incorrect normalization of the sum retrieval probabilities.Figure 3: False negative/positive rates vs. sum of retrieval probabilities i P t2v (v j |t i ) for a video in textto-video retrieval on ActivityNet.The false negative/positive rates rapidly increase as the sum of retrieval probabilities of a video deviates from 1

Normalized Contrastive Learning
To address the problem, we propose Normalized Contrastive Learning (NCL) that normalizes the sum retrieval probabilities of each instance so that all instances are equally represented during retrieval.First, we introduce instance-wise biases and define the adjusted cross-modal similarity score as where a i is a text bias for text i and b j is a video bias for video j.These instance-wise biases adjust the overall weights of the instances during retrieval.For example, a positive video bias b j will increase the overall retrieval probabilities for v j , while a negative video bias will decrease the overall retrieval probabilities for v j .Therefore, by setting instance-wise biases a i , b j to appropriate values, we can properly normalize the retrieval probabilities of text and video instances.NCL utilizes the Sinkhorn-Knopp algorithm (Cuturi, 2013) to compute the optimal values of biases for text: {a * 1 , . . ., a * B }, and videos: {b * 1 , . . ., b * B }. Specifically, given a non-negative matrix M ∈ R m×n + , the Sinkhorn-Knopp algorithm computes the normalization vectors α ∈ R m + , β ∈ R n + using fixed-point iterations such that is a valid transportation polytope, i.e., In other words, P represents a joint probability distribution for two random variables X, Y such that P (X = i, Y = j) = p ij with uniform marginal constraints P (X = i) = 1/m, P (Y = j) = 1/n.For retrieval, this means that all instances are equally represented given the set of queries.In this work, we focus on the standard setting where there is one-to-one correspondence between the video and text captions; hence we assume uniform marginal priors for the instances.However, note that the Sinkhorn-Knopp algorithm generalizes to arbitrary prior distributions P (X), P (Y ) (e.g.see Asano et al. (2020a)) and if a video is matched to multiple text captions or vice versa, we can easily modify the priors accordingly so that its sum of retrieval probabilities scales proportionally to the number of matching queries.Specifically, the RHS of Eqs. ( 8) to (9) will be placed with simplex vectors r and c that represent the marginal distributions of text and video instances.The Sinkhorn-Knopp algorithm will normalize the retrieval distributions so that the sum retrieval probabilities of each instance normalizes to its marginal weight.
Given the text-video similarity matrix S = {s ij } ∈ R B×B , s ij = t i , v j /γ, we define the non-negative matrix as M = exp(S).After computing the normalization vectors α, β for M using the Sinkhorn-Knopp algorithm, the optimal values of the text and video biases are derived as Figure 4 gives a PyTorch-style implementation of the algorithm for computing the instance-wise biases using fixed-point iterations.We set the number of fixed point iterations to 4 and find that the residuals are sufficiently small for our experiments.NCL uses the computed instance-wise biases to adjust the cross-modal similarity score for retrieval: We can easily verify that the adjusted similarity score of Eq. ( 12) defines properly normalized retrieval distributions such that The proof is in Appendix A.
During training, NCL computes the optimal biases for the current batch of examples and use the adjusted similarity score (Eq.( 12)) for learning.for similarity matrix S. The computed bias vectors are used to adjust the cross-modal similarity scores (Eq.( 12))

Normalization at Test Time
If the test query distribution is known a priori, we can readily construct the test similarity matrix where N is the number of test instances.In this case, we can directly compute the optimal biases a * , b * that normalizes the retrieval weights of the test instances using the algorithm of Fig. 4. In this case, the normalization error will be exactly zero.
In general, however, the test queries may not be known in advance.In such cases, we propose to approximate the unseen test query distribution with a subset of train queries.For this purpose, we introduce two query queues that store the last K train queries during training, one for text and video queries each.These query queues can be easily integrated with the existing training loops with negligible computational overhead.The stored queries are only used at test time, as plug-in approximations to the unknown test query distributions.At test time, we normalize the retrieval probabilities of test instances using the subset of train queries stored in the query queues.For text-to-video retrieval, for example, we first construct a pseudo similarity matrix S ∈ R K×N with K train text queries in the queue and N test video instances.We then apply our normalization algorithm (Fig. 4) to compute the video biases b1 , . . ., bN for the test videos.Video-to-text retrieval is handled in a symmetrical manner.We use the computed biases ã1 , . . ., ãN and b1 , . . ., bN to adjust the cross-modal similarity scores (Eq.( 6)) between the test queries and test instances during retrieval.Due to the approximation error from using a subset of train queries as a substitute of actual test queries, the normalization error will be nonzero.However, Fig. 2 shows that the proposed approach still consistently reduces the normalization errors on all datasets.
The computational computational complexity of the test time normalization is O(KN ) where K is the size of the train query queue and N is the number of test instances, scaling linearly with the test set size.Given that embedding-based retrieval already has O(N 2 ) complexity, NCL does not increase the overall test time complexity as long as K = O(N ).We study the effects of our approximation and the query queue size on retrieval performance in Sec.4.2.

Experiments
We evaluate Normalized Contrastive Learning (NCL) on multimodal retrieval on popular textvideo datasets including ActivityNet, MSVD, and MSR-VTT and report recall-at-K (R@K) metrics (higher is better), median and mean rank (lower is better).The main goals of the experiments are: 1. Compare NCL to state-of-the-art models in text-video retrieval using different model architectures (Sec.4.1) 2. Analyze the effects of the proposed test time normalization method and the size of the query queue on NCL's text-video retrieval performance (Sec.4.2) For a fair comparison, we assume that the test query distribution is not known in advance and use train query queues for test time normalization.
Datasets ActivityNet (Krishna et al., 2017;Fabian Caba Heilbron and Niebles, 2015) is a collection of 20,000 YouTube videos.Following (Zhang et al., 2018;Gabeur et al., 2020), we concatenate the text descriptions of a video into one paragraph and perform video-paragraph retrieval on ActivityNet.We use the 'val1' split for evaluation which contains 5K videos and paragraphs.The train split has 10K video-paragraphs.MSVD (Chen and Dolan, 2011) has 1,970 videos with each video having approximately 40 captions.The train split has 1200 videos, with 100 videos in the validation and 760 videos in the test split.In MSVD, each video in the test set has multiple captions associated with it.MSR-VTT (Xu et al., 2016) consists of 10,000 videos with 20 text captions per each video.We use the 9K train split with 180K captions and the 1K test split following (Yu et al., 2018).The test split only contains one caption per video.
Architecture We use the state-of-the-art architecture of CLIP4Clip (Luo et al., 2021) with the code released by the authors.CLIP4Clip is based on the pretrained CLIP model (Radford et al., 2021) which consists of transformer-based vision and text encoders trained on large-scale image-text data.
For the experiments, we adopt the mean pooling (meanP) architecture for aggregating the frame features as it was shown to deliver the most consistent performance on both text-to-video and video-totext retrieval across different datasets (Luo et al., 2021).We do not add any trainable model components and use the CLIP4Clip architecture as-is.
Implementations details are described in Appendix B.

Comparison to State of the Art
We compare NCL with the state-of-the-art multimodal retrieval models (Croitoru et al., 2021;Patrick et al., 2021;Liu et al., 2021;Luo et al., 2021) on ActivityNet, MSVD, and MSR-VTT.NCL uses the same architecture as CLIP4Clip (Luo et al., 2021).Tables 1 to 3 summarize the results.On all datasets, NCL brings significant improvements on state-of-the-art recall metrics in both textto-video and video-to-text retrieval across different datasets.On ActivityNet, NCL gives 13% and 10% relative gains on text-to-video and video-totext R@1, respectively, compared to the previous state-of-the-art of CLIP4Clip (Luo et al., 2021).Especially, the gain on MSVD video-to-text retrieval is substantial with more than 23% relative boost in R@1 and the mean rank being reduced by more than half.This may be attributed to the significant imbalance between the number of video queries and text captions in the MSVD test set with 670 videos and 28K captions, making the proper normalization of the caption retrieval probabilities more crucial.In addition, NCL improves most of the recall metrics on MSR-VTT.We emphasize that these results are achieved without any architecture engineering and the additional computational overhead introduced by NCL is negligible.

Analysis of NCL
Tab. 5 studies the effect of the approximation proposed in Sec.3.4.Using the test queries directly for normalization gives the best performance (last row).This is the oracle setting where the normalization errors become zero.However, in general, the test queries may not be known in advance and we approximate the unknown test query distribution using a subset of train queries stored in the query queues.We observe that this approximation reduces the test normalization errors (Fig. 2) and brings significant gains compared to the CLIP4Clip baseline (Luo et al., 2021) even without a prior knowledge of test queries.Still, there is a considerable gap when compared to the oracle.
Figure 5 presents the effect of the train query queue size on text-video retrieval performance on ActivityNet.The performance improves with growing query queue size and plateaus at about 16K queries.Based on this result, we set the query queue size to 16384 in our experiments.

Conclusion
We've presented Normalized Contrastive Learning (NCL) to improve cross-modal contrastive learning for retrieval.NCL applies the Sinkhorn-Knopp algorithm to normalize the retrieval probabilities of text and video instances so that each instance is fairly represented during retrieval.When the test queries are not known a priori, NCL approximates the test query distribution with a subset of train queries stored during training.Empirical studies show that NCL consistently reduces the normalization errors and brings significant gains in stateof-the-art text-video retrieval performance without any architecture engineering.Moreover, the gains are consistent over different model architectures.
For future work, it will be worthwhile to explore if NCL can help retrieval tasks in other domains such as image and text, and whether it can improve general representation learning for downstream tasks such as classification.

Limitations
The proposed approach of Normalized Contrastive Learning (NCL) is broadly applicable to general embedding-based retrieval tasks on unimodal or multimodal domains.However, the scope of the Architecture.SSB uses the pretrained T5-base model (Raffel et al., 2020) for its text encoder.For vision, it first extracts motion and appearance features using the 34-layer R(2+1)-D model (Tran et al., 2018) pretrained on IG65M (Ghadiyaram et al., 2019) and ResNet152 (He et al., 2016) pretrained on ImageNet (Deng et al., 2009), respectively.It then concatenates the motion and appearance features for its visual input.SSB applies CNN and RNN networks on its text and visual features, respectively, followed by transformer pooling layers.For more details, we refer to the original paper (Patrick et al., 2021).
Implementation details.In our experiments, we use the SSB model (Patrick et al., 2021) without additional pretraining on HowTo100M (Miech et al., 2019).SSB employs max-margin triplet ranking loss with hard negative mining (Faghri et al., 2018) and we replace the margin loss with the NCL loss (Eq.31 to 33 in the paper).The temperature parameter for NCL is set to 0.07 following (Chen et al., 2020).NCL uses query queues of size 16384 to store the queries during training.The scale of the NCL loss is multiplied by 15 in order to approximately match the scale of the original margin loss.We apply dropout with ratio 0.1 for MSVD and MSR-VTT, and 0.0 for ActivityNet.For the rest of the hyperparameters, we follow the setting used in (Patrick et al., 2021).

D Multimodal Embedding Space of Cross-modal Contrastive Learning
Many recent works (Radford et al., 2021;Luo et al., 2021;Xu et al., 2021;Miech et al., 2020;Bain et al., 2021) have demonstrated the promise of crossmodal contrastive learning for multimodal data.However, its behavior and properties in multimodal environments have nor been well-understood until now.In this section, we study the multimodal embedding space of the CLIP4Clip (Luo et al., 2021) model trained using cross-modal contrastive learning.
Figure 7 visualizes the video-text embedding space of the CLIP4CLIP model at initialization on the MSR-VTT dataset.The video and text embeddings do not overlap with each other, being highly clustered to their modalities in the embedding space.This is surprising, as the success of contrastive learning in the unimodal setting of natural images has previously been attributed to the alignment and uniformity of the embeddings (Wang  19) to ( 20)).µ v , µ t denote the modal means of video and text embeddings and v i , t i represents the displacements from their respective modal means.The text and video embeddings are highly clustered to their modalities and do not overlap with each other.and Isola, 2020).The figure shows that the videotext embeddings are neither well-aligned with each other nor evenly distributed on the unit hypersphere.This can also be confirmed from the average embedding similarities between the modalities in Fig. 8, which shows that within-modal embedding similarities are still high even after finetuning on MSR-VTT.In addition, the relatively low average similarity between text and video suggests that the embeddings of different modalities do not overlap as depicted in Fig. 7.This problem may be due to the distribution shift caused by finetuning the pretrained CLIP model on MSR-VTT.While finetuning the model longer might alleviate this issue, we find that longer finetuning harms the retrieval performance due to overfitting.
To analyze the implications of this phenomenon on cross-modal contrastive learning, we decompose the embeddings as where µ v , µ t denote the modal means of video and

Figure 1 :
Figure 1: Overview of the proposed approach.(a) Cross-modal contrastive learning suffers from incorrect normalization of the sum retrieval probabilities of each text/video instance (b) Normalized contrastive learning computes the instance-wise biases that normalize the sum retrieval probabilities so that all instances are fairly represented

Figure 2 :
Figure 2: Normalization errors (e.g.Eq. (5)) of the sum retrieval probabilities at test time.The baseline is CLIP4Clip(Luo et al., 2021) using cross-modal contrastive learning.Normalized Contrastive Learning (NCL) consistently reduces the normalization errors across the datasets.The remaining errors of NCL comes from approximating the unknown test query distribution with a subset of train queries.

Figure 4 :
Figure 4: PyTorch-style implementation of Sinkhorn-Knopp algorithm for computing the optimal biases a, b for similarity matrix S. The computed bias vectors are used to adjust the cross-modal similarity scores (Eq.(12))

Figure 5 :
Figure 5: Query queue size vs. text-to-video R@1 on ActivityNet.The performance improves as the size of query queue increases and plateaus at about 16K

Figure 7 :
Figure 7: TSNE visualization of the multimodal embedding space of CLIP4Clip (Luo et al., 2021) on MSR-VTT.The figure perceptually illustrates the modal mean decomposition (Eqs.(19) to (20)).µ v , µ t denote the modal means of video and text embeddings and v i , t i represents the displacements from their respective modal means.The text and video embeddings are highly clustered to their modalities and do not overlap with each other.

Table 1 :
Multimodal retrieval results on ActivityNet evaluated on the val1 split (Fabian Caba Heilbron and Niebles,

Table 3 :
Multimodal retrieval results on MSR-VTT evaluated on the 1K test split

Table 4 :
(Patrick et al., 2021)sing the Support Set Bottleneck (SSB)(Patrick et al., 2021)architecture on Activ-ityNet, MSVD and MSR-VTT.We report the results without additional pretraining on the HowTo100M dataset.The results show that NCL brings consistent gains across all datasets regardless of the base architecture