RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval

Video language pre-training methods have mainly adopted sparse sampling techniques to alleviate the temporal redundancy of videos. Though effective, sparse sampling still suffers inter-modal redundancy: visual redundancy and textual redundancy. Compared with highly generalized text, sparsely sampled frames usually contain text-independent portions, called visual redundancy. Sparse sampling is also likely to miss important frames corresponding to some text portions, resulting in textual redundancy. Inter-modal redundancy leads to a mismatch of video and text information, hindering the model from better learning the shared semantics across modalities. To alleviate it, we propose Redundancy-aware Video-language Pre-training. We design a redundancy measurement of video patches and text tokens by calculating the cross-modal minimum dis-similarity. Then, we penalize the highredundant video patches and text tokens through a proposed redundancy-aware contrastive learning. We evaluate our method on four benchmark datasets, MSRVTT, MSVD, DiDeMo, and LSMDC, achieving a significant improvement over the previous stateof-the-art results. Our code are available at https://github.com/caskcsg/VLP/tree/main/RaP.


Introduction
Text-video retrieval computes the semantic similarity between a text query and candidate videos, ranking more similar videos higher.Video-language pre-training can jointly learn the representation of video and text, allowing cross-modal similarity computation to be more effective and efficient, so it has been widely explored in text-video retrieval (Bain et al., 2021;Li et al., 2022a;Lei et al., 2021;Gorti et al., 2022).Videos are composed of dozens or hundreds of consecutive frames, usually containing much redundant information, already known as * The first two authors contribute equally.† Corresponding author.temporal redundancy.(Lei et al., 2021) proposes to sparsely sample frames from videos to alleviate temporal redundancy without incurring any drop in effect, followed by many works (Bain et al., 2021;Li et al., 2022a;Gorti et al., 2022).
In addition to intra-modal redundancy, i.e., temporal redundancy, there is inter-modal redundancy between video and text.Some previous works (Zhu and Yang, 2020;Chen et al., 2020;Wang et al., 2022;Li et al., 2022a) focus on modeling finegrained alignment, which can alleviate inter-modal redundancy to some extent.But they have not categorized and analyzed inter-modal redundancy in details.We summarize inter-modal redundancy into two categories: visual redundancy and textual redundancy, as shown in the example in Figure 1.Visual redundancy refers to the redundant information beyond textual semantics that exist in sparsely sampled frames.In contrast to highly generalized text, multiple video frames tend to contain portions that are semantically irrelevant to the text.Textual redundancy refers to the redundant portions in the text that are irrelevant to sparsely sampled frames.Sparsely sampling from the video will probably miss important frames associated with some text portions.
Figure 2: Redundancy-aware video-language pre-training method.Sparsely sampled frames are mapped into video embedding and patch features through multiple layers.Similarly, the text is also mapped into text embedding and token features.The dis-similarity matrix between the patch and token features is used to calculate the redundancy.We take the minimum value by row/column as the redundancy of each patch/token, respectively.Patch/token redundancy is then used for weighted patch-to-text/token-to-video contrastive learning to reduce the impact of high-redundancy patches/tokens.
Inter-modal redundancy will lead to the mismatch of video and text semantics, preventing the model from better learning shared semantics across modalities.Visually redundant pixels are encoded into the video embedding, and pre-training aligns the text embedding with the redundant video embedding, pushing the text embedding away from the correct text semantics.Similarly, pre-training aligns the video embedding with the redundant text embedding, pushing the video embedding away from the correct video semantics.Methods to alleviate redundancy through fine-grained alignment (Zhu and Yang, 2020;Chen et al., 2020;Wang et al., 2022) mainly rely on offline object detectors to extract objects or tags from sampled frames.These methods are based on the assumption that the objects extracted from the sampled frames are related to the text description, or that the tags extracted in one frame relate to other frames.However, there is uncertainty in the correlation between multiple frames, especially with sparse sampling.In addition, object detectors have the drawbacks of inaccurate detection, a limited number of categories, and unable to perform end-to-end optimization in video language pre-training.
To better alleviate the problem of inter-modal redundancy, we first propose a redundancy measure-ment, as shown in Figure 2. In video-language pretraining, each video frame is split into patches, and a text is tokenized into tokens.Take the Figure 1-(a) as an example.The green patch is low-redundant because it relates to "two", "men" and "fighting" tokens.In contrast, the red patch is high-redundant because it corresponds to no token.Therefore, the redundancy of a patch depends on how well it corresponds to the tokens in the text.In other words, a patch is high-redundant if it has low semantic similarity to all tokens.So we use the minimum dis-similarity between a patch and all tokens as its visual redundancy.Symmetrically, the redundancy of a token depends on how well it corresponds to the patches in the video, and we use the minimum dis-similarity between a token and all patches as its textual redundancy.
To reduce the impact of high-redundant patches on learning text embedding or tokens on learning video embedding, we then propose redundancyaware contrastive learning.We take patch-text pairs as additional positives in video-to-text constrastive learning and assign smaller weights to pairs with high-redundant patches.Similarly, we take tokenvideo pairs as additional positives in text-to-video constrastive learning and assign smaller weights to those with high-redundant tokens.Specially, the weight equals (1 − redundancy) in calculation.
Combining the above two points, we propose Redundancy-aware Video-language Pre-training (RaP) method, which is end-to-end trainable without relying on object detection, as shown in Figure 2. We evaluate RaP on four text-video retrieval datasets, MSRVTT, MSVD, DiDeMo, and LSMDC, achieving a significant improvement over the previous state-of-the-art results.Sufficient ablation studies also confirm the effectiveness of RaP: Our contributions can be summarized as follows: 1. We summarize the inter-modal redundancy in video-language pre-training and propose a measurement for the redundancy.
2. We propose redundancy-aware contrastive learning to alleviate the two inter-modal redundancy and facilitate high-quality modelling of the shared semantics.
3. Experimental results show that our method significantly improves the state-of-the-art results on multiple text-video retrieval datasets.

Related Work
Video-language pre-training aims to learn joint representations between video and language.Videos consist of consecutive frames and often contain visually similar redundant information.Redundant information will bring two problems to videolanguage pre-training.One is the extra computational overhead, and the other is that the semantics of video and text cannot be well aligned.
To alleviate the problem above, prior approaches (Li et al., 2020;Luo et al., 2020;Miech et al., 2020Miech et al., , 2019;;Sun et al., 2019;Zhu and Yang, 2020) use offline tools to extract video features, but cannot achieve end-to-end pre-training.Clip-BERT (Lei et al., 2021) efficiently trains the video encoder end-to-end using only a few sparsely sampled frames.Later, some well-performing methods (Bain et al., 2021;Li et al., 2022a;Wang et al., 2022;Fu et al., 2021) adopt the sparse sampling strategy to alleviate temporal redundancy to reduce computational overhead.Since we focus on mitigating the misleading caused by redundant information, rather than temporal redundancy, we follow (Lei et al., 2021) to use sparsely sampled frames as input to the video encoder.
To better align video and text, some recent works introduce fine-grained alignment in video-language pre-training (Zhu and Yang, 2020;Chen et al., 2020;Wang et al., 2022;Li et al., 2022a).Those works identify regions with objects in video via offline trained object detectors or prompters.Then they align regions containing the objects with the text description, which alleviates inter-modal redundancy implicitly.Unlike them, we explicitly propose an efficient redundancy measure that quantifies the impact of different redundancy.The most related work is OA-Trans (Wang et al., 2022), which extracts objects and tags from an anchor frame with an offline detector, and uses the objectrelated regions or tags as additional input to reduce redundancy.Unlike OA-Trans (Wang et al., 2022), our learnable redundancy measurement matrix can be optimized end-to-end in pre-training.We do not rely on an offline detector, so we do not suffer the drawbacks of offline detectors.

Backgroud
This section introduces some background knowledge of video-language pre-training, including video-text input, encoders, and contrastive learning for training.

Text-Video Input
Video-language pre-training methods use textvideo pairs as raw input, where the text is a description of the video.A video V is sparsely sampled to get K frames, obtaining a sequence of frames {F k } K k=0 Before being fed into the video encoder, each frame F k will be divided into N sized frame patches {P k n } N n=0 .Frame patches are then mapped into input embeddings {Pe k n } N n=0 via projection, where Pe k 0 is an additional [CLS] embedding to learn the global semantics of frame F k .Similarly, before being fed into the text encoder, a text T will be tokenized into L consecutive tokens and projected into token embeddings {T e l } L l=0 , where T e 0 is an additional [CLS] embedding to learn the global semantics of text.

Text-Video Encoders
Video Encoder We use visual transformer (ViT) (Dosovitskiy et al., 2020) as the video encoder to process each frame F k separately.ViT takes frame patch embeddings {Pe k n } N n=0 as input, and output frame patch features {Pf k n } N n=0 corresponding to the N + 1 positions .Then, we perform mean pooling operation on the features of the same position across K frames.We further transform the pooled feature of each position into a shared normalized low-dimensional (e.g.256-dim) space, obtaining the video patch feature: Unless otherwise specified, we will refer to the P n as patch feature for convenience hereafter.Particularly, P cls denotes the global video embedding in the [CLS] position.
Text Encoder We use BERT (Devlin et al., 2018) as a text encoder to process text, which takes embeddings {T e l } L l=0 as input.The output embedding of each corresponding position of BERT is transformed into the above low-dimensional space as the token feature embedding {T l } L l=0 .Unless otherwise specified, we will refer to the T l as token feature for convenience hereafter.Particularly, T cls denotes the global text embedding in the [CLS] position.

Video-Text Contrastive Learning
Following CLIP (Radford et al., 2021), we align video and text features into a comparable shared embedding space via contrastive learning.Given the normalized video embedding P cls and normalized text embedding T cls , the similarity function between video V and text T is: We aim to assign higher similarity scores to matched video-text pairs.Therefore, in contrastive learning, we take matched video-text pairs as positives and all other pairs formed a batch as negatives.Given a batch with B matched pairs {P i cls , T i cls } B i=1 , the video-text contrastive loss of each pair {P i cls , T i cls } consists of two symmetric terms, one for video-to-text contrastive learning: and the other for text-to-video contrastive learning: 4 Redundancy-aware Video-language Pre-training In this section, we introduce our proposed Redundancy-aware video-language pre-training in details.An overview of our approach can refer to Figure 2. First, we introduce how to measure crossmodal redundancy.Then, we introduce how to reduce the impact of redundancy on video-language pre-training.

Cross-modal Minimum Dis-similarity as Redundancy
We use the similarity function in equation ( 1) to calculate the cross-modal dis-similarity between a patch feature P n and a token feature T l : As shown in Figure 2, we calculate the dissimilarity between all non-[CLS] patch features {P n } N n=1 and all non-[CLS] token features {T l } L l=1 , resulting in a dis-similarity matrix M with dimension N × L. Each row of M denotes the dis-similarities between a patch feature P n and all non-[CLS] token features {T l } L l=1 .We take the minimum value of M by the row as the visual redundancy of each patch: Symmetrically, each column of M denotes the dissimilarities between a token feature T l and all non-[CLS] patch features {P n } N n=1 .We take the minimum value of M by the column as the textual redundancy of each token:

Redundancy-aware Video-Text Contrastive Learning
We use the redundancy of patches and tokens to improve the contrastive learning process.
Redundancy-aware video-to-text contrastive learning In the original video-to-text contrastive learning, there is only one positive text sample for a given video.To reduce the impact of textual redundancy on video embedding learning, we treat all non-[CLS] tokens in the positive text as positives too, but we assign higher weights to lowredundancy token features in the loss calculation: , where w l = 1−tr l .The weighted loss constrains video embeddings to pay more attention to lowredundancy token features while ignoring the highredundancy ones.
Redundancy-aware text-to-video contrastive learning Symmetrically, in the original text-tovideo contrastive learning, there is only one positive video sample for a given text.To reduce the impact of visual redundancy on text embedding learning, we treat all non-[CLS] video patches in the positive video as positives too, but we assign higher weights to low-redundancy patch features in the loss calculation: , where the w n = 1 − vr n .The weighted loss constrains text embeddings to pay more attention to low-redundancy patch features while ignoring the high-redundancy ones.
Redundancy-aware contrastive learning Overall, redundancy-aware contrastive learning (RaCL) in both directions constrains the embeddings of one modality to focus more on the low-redundant local features of the other modality.Therefore, RaCL allows different modalities to guide mutually to learn the correct shared semantics.The RaCL loss is defined as the sum of losses in both directions: Video-text pre-training usually trains some auxiliary tasks to help convergence, such as LM tasks, The losses of these tasks are collectively referred to as L others .So the total loss is calculated as: , where λ is a balance hyperprarameter.We simply set A to 1.0.

Backbone Network
We use the (Li et al., 2022b) network as the backbone network for our video-language pre-training, with a ViT video encoder and a BERT text encoder.Since this is not the core of this paper, we leave the auxiliary tasks to the appendix A. Considering that our method is an optimization of video-text contrastive learning, our method can also be effective on other backbone networks using contrastive learning.
LSMDC (Rohrbach et al., 2015) is a clip dataset of 118,081 videos, each with a caption description.
The length of the video varies from 2 seconds to 30 seconds.In LSMDC, the training split consists of 101,079 videos, the validation split consists of 7,408 videos.We report results on the test split which contains 1,000 videos.

Implementation Details
During pre-training, we conduct experiments on 64 NVIDIA V100 GPUs using PyTorch framework (Paszke et al., 2017).We initialize our video encoder with ViT-B/16 (Dosovitskiy et al., 2020) with 12 layers.The text encoder is initialized by BERT base (Devlin et al., 2018).We randomly sample 4 frames from each video and resize each frame to 256×256.Then we split each resized frame into
During the fine-tuning stage, we perform our experiment on 8 NVIDIA V100 GPUs.We sparsely sample 4 frames and resize them to the same video frame size(256*256) as the pre-training stage.The learning rate is initialized as 1e-5.For each benchmark dataset, we select a checkpoint according to the results of the validation split and use the checkpoint on the test split.For MSRVTT9k without a validation split, we train the model for 10 epochs and choose the final checkpoint.For inference, following (Li et al., 2022a), we uniformly sample 8 frames for each video to ensure reproducibility.

Experimental Results
We show results on the four text-video retrieval datasets.Since not every previous work has evaluated zero-shot and fine-tuning performance on all datasets, the baseline methods may not be the same for different datasets.In general, our method achieves significant improvements over the compared methods on all datasets.

MSVD Results
Tabel 3 compares RaP with existing methods on MSVD.The zero-shot performance of RaP surpasses SupportSet and is even slightly higher than FiT's fine-tuning results on R1.After fine-tuning, RaP achieves more than 6.3%, 6.4%, and 3.3% improvement over other fine-tuned models in R1, R5, and R10 scores.

LSMDC Results
Due to the ambiguity of text descriptions, LSMDC is a more challenging dataset, and the results of previous methods are relatively low.Table 4 shows that RaP outperforms all previous methods in fine-tuning setup, proving RaP's generalization ability in complex scenarios.
From the performance on these different datasets, RaP significantly outperforms the previous methods, which illustrates the importance of reducing the impact of inter-modal redundancy in videolanguage pre-training.Meanwhile, the improvement also justifies our redundancy measurement and validates the effectiveness of redundancyaware contrastive learning.

Ablations and Analysis
Effect of Redundancy-aware contrastive learning To further verify the effectiveness of redundancy-aware contrastive learning, we compare the experimental results of the model with and without RaCL on the MSRVTT dataset.As shown in the table 5, removing the RaCL model will decrease performance.Therefore, RaCL plays Effect of the number of frames We further explore how using different frame numbers in sparse sampling during pre-training affects the performance of RaP: We compare sampling 1, 2, 4, and 8 frames from each video, respectively.We report zero-shot results of the model on the MSRVTT dataset.As shown in Tabel 6, when sampling no more than 4 frames, the model's performance increases as the frame number increases.However, when the number of sampled frames increases to 8, the results begin to drop.We believe this is due to the excessive visual redundancy included when sampling more than 8 frames, making it more difficult for the model to learn.Perhaps larger training data can alleviate this problem, which is left to be explored in future work.
Effect of the coefficient λ In Equation 10, a coefficient λ is used to balance the RaCL loss in the total loss L. As shown in the Table 7, we list the zero-shot results of the model on the MSRVTT dataset when we vary the λ values.We find that using λ = 1.0 performs the best on R1.
Qualitative Analysis We provide further visualization for qualitative analysis.Specifically, we visualize the weights map between [CLS] embedding from one modal and non-[CLS] embeddings from another, which is a bidirectional process.

Conclusion
In this paper, we summarize the inter-modal redundancy in video language pre-training and propose a redundancy measurement.Then, we propose redundancy-aware contrastive learning to alleviate redundancy.Significant improvements on sev- eral text-video retrieval datasets justify our redundancy measurement and validate the effectiveness of redundancy-aware contrastive learning.

Limitations
Taking the maximum similarity as the weight is a relatively weak constraint.When the redundant information exceeds a specific limit, it may lead to a decrease in the performance of the model.We guess that introducing an attention module to generate weights in future work may improve the model's performance under high redundancy.
Figure 4: The Pre-training tasks.In addition to our designed redundancy-aware contrastive learning task, there are auxiliary tasks: a video-text matching task, a video-grounded language model task and two intra-modal contrastive learning tasks.

Figure 1 :
Figure 1: Examples of inter-modal redundancy.(a) Visual redundancy: the pixels in the red box are redundant with respect to the text description.(b) Textual redundancy: the token "baseball" in red font does not correspond to any portion in the video frame.

Figure 3 :
Figure 3: Visualization of the effect in identifying redundancy.We train the model without RaCL as our baseline.(a) Visual redundancy: the text [CLS] embedding is a query, and the video non-[CLS] patch embeddings are keys.The darker the blue, the higher the redundancy.the darker the red, the lower the redundancy.(b) Textual redundancy: the video [CLS] embedding is a query and the text non-[CLS] token embeddings are keys.The higher the value, the lower the redundancy.
Fig 3-(a) shows the visualization of the weights allocated to each patch.Compared with the scattered concerns of the baseline model, RaP filters out redundant patches and pays more attention to patches that match the text.For example, RaP focuses on the two men fighting in the lower-left corner of the left frame, and also RaP pays attention to the player in the middle of the right frame.The weights of each token are visualized in Fig 3-(b).Compared with the baseline model, RaP increases the weight of text entities that appear in the frame and decreases the weight of missing entities in the frame.Through RaCL, the [CLS] embedding in RaP can filter redundant information and focus on relevant information between modalities, confirming Rap's significant improvement in the text-video retrieval task.

Table 3 :
Comparisons with text-to-video retrieval stateof-the-art methods with zero-shot and fine-tuning setups on MSVD.We treat each sentence as the textual query.♣: Methods using WebVid2M and CC3M datasets(5.5M).

Table 5 :
Ablation study of the newly proposed RaCL.We report the results on MSRVTT.

Table 6 :
Ablation study of the number of frames.We report zero-shot results on MSRVTT.

Table 7 :
Ablation study of coefficient λ.We report zeroshot results on MSRVTT.