LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling

Recent large-scale video-language pre-trained models have shown appealing performance on various downstream tasks. However, the pre-training process is computationally expensive due to the requirement of millions of video-text pairs and the redundant data structure of each video. To mitigate these problems, we propose LiteVL, which adapts a pre-trained image-language model BLIP into a video-text model directly on downstream tasks, without heavy pre-training. To enhance the temporal modeling lacking in the image-language model, we propose to add temporal attention modules in the image encoder of BLIP with dynamic temporal scaling. Besides the model-wise adaptation, we also propose a non-parametric pooling mechanism to adaptively reweight the fine-grained video embedding conditioned on the text. Experimental results on text-video retrieval and video question answering show that the proposed LiteVL even outperforms previous video-language pre-trained models by a clear margin, though without any video-language pre-training.


Introduction
The increasing popularity of videos in various social media has aroused the interest in efficient modeling of videos and their connections with other modalities like texts.Video-language modeling targets at learning a shared multimodal semantic space for videos and texts to facilitate downstream tasks like text-video retrieval and video question answering (VideoQA).Previous video-language modeling usually relies on pre-training on a large-scale video-text pair, via video-text contrastive learning (Luo et al., 2021;Li et al., 2022a;Gorti et al., 2022), video-text matching (Fan et al., 2019;Luo et al., 2020;Li et al., 2022a), masked language modeling (Devlin et al., 2019) and masked frame modeling (Sun et al., 2019); or extra object detectors to extract fine-grained visual features (Zhu and Yang, 2020;Chen et al., 2020).However, both pretraining on video-text pairs and using an off-theshelf detector are computationally expensive and inefficient.Inaccurate detection results on limited categories may also lead to inferior performance.
In the unimodal video domain, TimeSformer (Bertasius et al., 2021) fails to pre-train video encoder directly on large-scale video datasets, but obtains good performance by initializing from a pre-trained Transformer-based ViT (Dosovitskiy et al., 2021) image encoder and training additional temporal attention modules directly on downstream video tasks.Similarly, ViViT (Arnab et al., 2021) also takes advantage of the already well-learned spatial visual representation in a pre-trained ViT model, and effectively adapts it for video tasks by directly fine-tuning on comparatively small downstream video datasets.
Inspired by TimeSformer and ViViT, in this paper, we also consider extending an image-language pre-trained model for video-text tasks without pretraining on video-text pairs.This requires us to not only leverage the already-learned alignment between spatial visual information and text in the image-language models, but also capture the additional temporal dependency efficiently.Thus we propose a simple yet efficient video-language model LiteVL, initialized from a recent pre-trained image-language model BLIP, but with both modelwise and feature-wise enhancement of temporal information.For model-wise enhancement, we propose to explicitly insert temporal attention layers with learnable scalings into the original image backbone, which can be adjusted for each downstream task.For feature-wise enhancement, we design a non-parametric pooling method to learn fine-grained spatial-temporal video features conditioned on the text description.
Empirical results on various tasks demonstrate that the proposed model, LiteVL, outperforms previous state-of-the-art methods by a clear margin, even without any video-language pre-training or the usage of object detectors.In particular, LiteVL achieves 50.8%R1 score on MSRVTT-9k dataset in the task of text-video retrieval, and 42.9% accuracy on MSRVTT-QA dataset in the task of video question answering.Visualizations also demonstrate that our LiteVL captures important spatialtemporal information with fine-grained video-text alignment.
2 Related Work

Vision Transformers
Transformers (Vaswani et al., 2017), originally designed for natural language tasks, have recently been applied to the computer vision domain to model images and videos (Dosovitskiy et al., 2021;Liu et al., 2021;Touvron et al., 2021;Wang et al., 2021;Han et al., 2021).ViT (Dosovitskiy et al., 2021) is one of the most representative vision transformers, which processes each image as a sequence of image patches, and achieves remarkable performances on various image tasks.
Compared with image tasks, video understanding is more challenging because the additional temporal dimension brings a more complicated spatialtemporal information.To model the intertwined dependency of the spatial and temporal dimensions efficiently, video Transformers TimeSformer (Bertasius et al., 2021) and ViViT (Arnab et al., 2021) use the parameters of a well-trained image Transformer for initialization, and further design different variants of spatial-temporal attention mechanisms to capture the spatial-temporal dependencies.

Video-Language Modeling
The core of video-language models lies in modeling the interaction between the two modalities.Depending on whether using video-text pairs for pre-training, existing video-language models can be divided into two categories.
The main branch of works explicitly designs the spatial-temporal structure in video encoders, and pre-train with abundant video-text pairs to directly align videos and texts.Among these works, ALPRO (Li et al., 2022a), Frozen (Bain et al., 2021), and BridgeFormer (Ge et al., 2022) use We-bVid2M (Bain et al., 2021) which contains 2.5M video-text pairs collected from the web for pretraining.Image-text pairs like CC3M (Sharma et al., 2018) and VG (Krishna et al., 2016) are also often used to enhance the spatial information in this alignment.NoiseEST (Amrani et al., 2021), VideoCLIP (Xu et al., 2021), and CLIP4Clip (Luo et al., 2021) pretrain the model with a large-scale dataset HowTo100M (Fan et al., 2019) which contains 136M video-text pairs.The other branch of works does not rely on video-text pre-training.Instead, they extend a pre-trained image-text model to extract video features, and directly fine-tune on downstream tasks.In ClipBERT (Lei et al., 2021), BLIP (Li et al., 2022b), andX-Pool (Gorti et al., 2022), each video is viewed as a collection of images, whose representations obtained from an image encoder are then used to represent the video for later interaction with the text.

Method
In this section, we propose a method to efficiently extend an image-language pre-trained model to a video-language model, without pre-training on video-text pairs or the use of object detectors.
In Section 3.1, we first introduce the model architecture.Then we propose to enhance the temporal information in both model-wise and featurewise manners.For model-wise enhancement, we propose to insert temporal attention layers with learnable scalings into the original image backbone (Section 3.2).For feature-wise enhancement, we design a non-parametric pooling method to learn fine-grained spatial-temporal video features conditioned on the text description (Section 3.3).

Model Architecture
As is shown in the Figure 1, our framework contains three parts: a video encoder, a text encoder, and a video-grounded text encoder.We initialize our framework based on the recently proposed image-language model BLIP (Li et al., 2022b) trained over massive image-text pairs.Video Encoder.To enhance the temporal dependency of the video encoder, following TimeSformer (Bertasius et al., 2021), we insert additional temporal attention modules into the original BLIP image encoder, whose weights are initialized with the original spatial attention modules (Figure 1a).We use the Divided Space-Time Attention proposed in TimeSformer.We first compute the temporal attention by comparing each patch with the patches at the same spatial location in different frames, and then compute the spatial attention in each frame separately.

Linear Projection
Two teams play soccer on a field.Video-grounded Text Encoder.This encoder shares the parameters with the unimodal text encoder.Moreover, to fuse the video features from the video encoder, one additional cross-attention layer is added between the self-attention layer and the feed-forward network for each transformer layer.Following BLIP (Li et al., 2022b), we also use a special [Encode] token before the text sequence, and its output embedding is used as the multimodal representation of the video-text pair.

Dynamic Temporal Scaling
We wish to preserve the spatial representation and its alignment with the text encoder learned by the image-language pre-trained model, as well as learn temporal expressivity for video-language tasks.As will be shown in Table 6, directly using TimeSformer yields better results than the original ViT.To provide more sufficient temporal expressiveness of the video encoder, we propose to learn a set of scalings that dynamically adjust the newly inserted temporal attention modules according to each specific task, as shown in Figure 1b.
Specifically, denote the output feature after the temporal attention at the l-th Transformer layer as we add a learnable scaling factor α l,t ∈ R with a tanh-gating mechanism as: where γ l,t is a learnable scalar initialized at 0. Then the scaled output of temporal attention V TAttn l is calculated as: before the residual connection.Note the [CLS] token is kept but not involved in the computation of scaling.The choice of tanh-gating ensures that α l,t ranges from 0 to 2. Initially, our model is equivalent to TimeSformer (i.e., α l,t =1, treat each frame equally), and then explicitly reweight the frames in each transformer block during the finetuning stage.When α l,t reduces to 0, the video encoder degenerates to the ViT used in the original BLIP model, which does not consider any temporal dependency in extracting the video features.

Text-dependent Pooling
Before interacting with the textual modality via cross-attention or self-attention, previous methods directly concatenate the features from all frames with equal importance (Li et al., 2022b;Ge et al., 2022), or aggregate the video features with heuristic mean/max pooling methods spatially or temporally (Luo et al., 2021;Li et al., 2022a).However, not all frames or spatial positions are equally representative for the whole video, and different frames or positions have different semantic similarities to the textual query (e.g., textual description in textvideo retrieval tasks or textual question in video question answering tasks.For example, given a video description "a golf player is trying to hit the ball into the pit", the video encoder is expected to focus on the object of interest (i.e., ball) and the motion of hitting across the frames.As illustrated in Figure 2, we design a nonparametric text-dependent pooling to reweight the video features spatially and temporally depending on the corresponding textual query, enabling finegrained video-text alignment.
Specifically, given a video with T frames, each frame is pachified into S patches, and a [CLS] token is inserted before the ST patches.Denote the original output embedding of the video encoder as V L ∈ R (1+ST )×D .V ft ∈ R T ×D and V fs ∈ R S×D are the video features pooled by averaging V L along the spatial and temporal dimension, respectively.Note that the feature of [CLS] token is not involved in averaging.Denote t cls ∈ R D as the output embedding of the [CLS] token obtained from the text encoder.
Intuitively, the more similar a visual feature is to the text description, the more representative it is for understanding the content of the whole video.Thus we compute the similarity between the ℓ 2 normalized features of each frame in V norm ft and the text feature t norm cls , and reweight the features in V ft as: where ⊙ means element-wise multiplication, and τ is the temperature which controls the sharpness of the weight distribution.We multiply the weights from the softmax function by the number of frames T , such that the sum of total weights keeps the same as direct concatenation.Similarly, we compute the similarity between the ℓ 2 normalized features of each spatial position in V norm fs and the text feature t norm cls , and reweight the features in V fs as: The final aggregated video feature to be fed to the video-grounded text encoder is a concatenation of Ṽft , Ṽfs , and the original video feature V L : Remark 1.Besides using the text to reweight the aggregated features after spatial pooling (i.e., V ft ) and temporal pooling (i.e., V fs ), one simple baseline is to directly concatenate them with the original features V L to compose V f as: We dub it as vanilla pooling.Despite its simplicity, this pooling achieves competitive performance (Table 6).

Training Objectives
After obtaining the aggregated video features V f , we feed them to each cross-attention layer of the video-grounded text encoder.Consider a training batch with B video-text pairs.For the k-th videotext pair, denote the ℓ 2 normalized output embeddings of the [CLS] tokens from the video encoder and the text encoder as v k cls and t k cls1 , respectively.The output embedding of the [Encode] token of the video-grounded text encoder is denoted as t k enc .Text-Video Retrieval.Contrastive learning alone has recently been found to learn better representations than its predictive counterpart in multi-modal pre-training (Radford et al., 2021).When used together with the predictive counterpart (Li et al., 2022a), it also boosts the performance.To align the video encoder and text encoder, we also utilize both the contrastive and predictive learning objectives.We apply contrastive learning over the output representations of the video encoder and the text encoder by optimizing a symmetric InfoNCE loss.The video-to-text contrastive loss L v2t is: , where τ c is a learnable temperature parameter initialized as 0.07.Similarly, the text-to-video contrastive loss L t2v is: .
The video-text contrastive loss is defined as: Following Li et al. (2022a), besides the contrastive loss, we also use a video-text matching loss L vtm , which predicts whether a pair of video and text is matched or not.For the k-th video-text pair, we map the joint video-text embedding t k enc to a two-class probability p k vtm , and calculate L vtm as: where y k vtm is a 2-dimensional one-hot vector representing the ground-truth label, and CE(•, •) is cross-entropy loss.The in-batch negatives used for L vtm are mined based on the contrastive similarity following Li et al. (2021).The overall training objective is: where y k ans is a K-dimensional one-hot classification label.

Experiments
In this section, we evaluate the efficacy of the proposed LiteVL on the text-video retrieval and video question answering (VideoQA) tasks.We initialize the weights of LiteVL from BLIP (Li et al., 2022b), which uses a ViT-B/16 as the image encoder, a BERT base as the text encoder with additional crossattention layers for the image-grounded text encoder.We use both BLIP variants pre-trained on 14M and 129M image-text pairs, respectively, and the corresponding LiteVL initialized from them are dubbed as LiteVL S and LiteVL L , respectively.
During fine-tuning, we randomly sample 8 and 16 frames per video for retrieval and VideoQA tasks, respectively.While in the inference stage, the frames are uniformly sampled.Following previous works (Ge et al., 2022;Gorti et al., 2022) and BLIP's pre-training setting, we resize each of the raw frames to 224×224 before feeding them into the model.For the text-dependent pooling, the temperature τ in Eq.( 3) is set to 1.0 by default.More detailed training details and hyperparameters are in Appendix A.

Text-Video Retrieval
Datasets and Metrics.We finetune on two textvideo retrieval datasets: (i) MSRVTT (Xu et al., 2016) consists of 10k videos and 200k text captions.Each video is paired with about 20 manuallylabeled captions, and lasts about 10 to 32 seconds.There are two widely used ways to split the dataset, i.e., MSRVTT-7k (Miech et al., 2019) and MSRVTT-9k (Gabeur et al., 2020), which have 7k videos and 9k videos for training, respectively.For a comprehensive comparison with previous works, we use both splits that share the same 1k testing videos (Bain et al., 2021).(ii) DiDeMo (Hendricks et al., 2017) consists of 10k Flickr videos annotated with 40k text captions.We evaluate text-video retrieval following Lei et al. (2021), where all captions for the same video are concatenated into a single query.
We evaluate text-video retrieval by R@k and MdR following Bain et al. (2021).R@k means the recall (%) through k retrieval efforts.MdR represents the median rank of the retrieved video.
Comparison with BLIP.Previous BLIP concatenates the image features of all frames as the aggregated video feature and feeds it to the imagegrounded text encoder.In Table 1, we show the comparison between our proposed LiteVL and the original BLIP as well as its variants with increased resolution (i.e., 384×384), pre-training on COCO (Lin et al., 2014) and fine-tuning setting.As can be seen, though inherited from the BLIP model, our proposed LiteVL clearly outperforms the original BLIP due to the explicit temporal modeling in both model-wise and feature-wise manners.In particular, LiteVL S improves the R1 of the bestperformed BLIP (14M) variant by 2.0, 2.6, and 2.4 points on MSRVTT-7k, MSRVTT-9k, and DiDeMo, respectively.
Comparison with Other Methods.Table 2, Table 3 and Table 4 show the comparison between LiteVL and recent methods on text-video retrieval on MSRVTT-7k, MSRVTT-9k and DiDeMo, respectively.On all three datasets, LiteVL surpasses previous works by a clear margin, including methods requiring heavy video-text pre-training (e.g.,

Dynamic Temporal Scaling
In Table 6, we compare the proposed dynamic temporal scaling against using (i) constant scaling α l,t = 1 in Eq.( 2), which reduces to directly using TimeSformer as the video encoder; and (ii)   constant scaling α l,t = 0 in Eq.( 2), which reduces to using the ViT image encoder as the video encoder.As can be seen, the proposed dynamic scaling learned upon each task performs better than the two special cases.By adopting TimeSformer (α l,t = 1) instead of ViT (α l,t = 0), the performance is boosted since the temporal dependency is considered via the additional temporal attention module.With the proposed lightweight temporal scaling to adjust the frame-level importance according to each specific task, the performance is further improved.
Visualization.As shown in Figure 3, we visualize the average of the learned scalings γ l,t of each layer in the video encoder for both retrieval (i.e., MSRVTT-7k, MSRVTT-9k) and VideoQA (i.e., MSRVTT-QA, MSVD-QA) tasks.For all datasets, the average scaling is lower than 0 at the first layer and then shows an upward trend as the depth increases.This indicates that the shallow layers focus more on understanding the content of each frame, and pay less attention to temporal dependency among different frames.When the depth increases, the spatial feature of each frame becomes more global (Dosovitskiy et al., 2021), and the model gradually seeks to learn the temporal dependencies among them.

Text-dependent Pooling
In Table 6, we compare our proposed textdependent pooling in Section 3.3 against several baseline pooling methods using different combinations of the original features V f , spatially pooled features V ft and temporally pooled features V fs .
As can be seen, compared with using only the original features, using either additional spatially or temporally pooled features improves the performance, and combining both of them further boosts performance.When coupled with the reweighting mechanism in Section 3.3, our proposed LiteVL obtains the best performance.
In addition, since the visually or temporally pooled features have much smaller sizes than the original features, using them merely increases the computation or memory cost of the cross-attention module of the video-grounded text encoder.The extra computation or memory cost incurred here is theoretically relatively acceptable.
Effect of Temperature in the Text-dependent Pooling.We vary the temperature τ between 0.01 and 5.0 of the text-dependent pooling to study the effect of the temperature in the text-dependent pooling.As is shown in Table 7, when τ equals 1.0, both text-video retrieval and video question answering achieve the best performance.Therefore, the temperature τ of this pooling method is set to 1.0 for all datasets by default.
Visualization.To better understand the effect of text-dependent pooling, we use LiteVL S to visualize the video-text pair from MSRVTT-7k testing set and their corresponding temporal weights (g t in Eq.( 3)).As show in Figure 4, when the changes among different frames are relatively large, the proposed text-dependent pooling encourages the model to assign higher weights to frames better described by the caption.For instance, in the first example, the second and fourth frames are more related to the caption "Three kids sing together on the voice."and assigned higher weights.
On the contrary, as can be seen from the last two examples in Figure 4, when the different frames only differ in minor changes and each frame is similarly close to the caption, the learned weights for each frame are also similar.For these cases, we further study the more fine-grained spatialtemporal dependencies using the Grad-CAM (Sel-

Inputs (Video frames and caption) Temporal weights
Caption: A golf player is trying to hit the ball into the pit.
Caption: Three kids sing together on the voice.
Caption: A man is folding paper.
Caption: A horse is walking around.Caption: A man is folding paper.
Caption: A man swimming in the pool.
Caption: A horse is walking around.varaju et al., 2017) visualizations.We compute Grad-CAM using the cross-attention maps averaged over all attention heads in the 8-th layer (a specialized layer in grounding) of the video-grounded text encoder.The gradients are acquired by maximizing the video-text matching score in Eq.( 7).
As can be seen in Figure 5, the proposed LiteVL effectively captures the minor changes among dif- ferent frames.This also indicates that our proposed text-dependent pooling provides fruitful information for the video-grounded text encoder.More visualizations are in Appendix C.

Extension to Other Image-language
Pre-trained Models In this work, we choose BLIP to initialize our proposed model mainly because (i) it performs well on various downstream image-language tasks; and (ii) it can be regarded as a single-stream and dualstream hybrid structure.Its dual-stream part allows efficient inference for cross-modal retrieval tasks, while its cross-attention allows deep cross-modal interaction for tasks like VQA.
On the other hand, the proposed dynamic temporal scaling and text-dependent pooling can also be applied to the dual-stream model like CLIP (Radford et al., 2021).For this setting, we also conduct a simple experiment.For CLIP, we use the proposed text-dependent pooling on top of the video features.As CLIP relies on the global features for retrieval, instead of concatenation in Eq.( 5), we compute a weighted average of the reweighted features.Compared with a recent work CLIP4Clip which also extends CLIP for video retrieval, CLIP with our proposed method improves the best CLIP4Clip-meanP method by 1.9% and 1.7%, for the R1 and R10 on MSRVTT-7k, respectively.

Scaling to Larger-scale Retrieval Tasks
Since the test set sizes for all three retrieval datasets used in this work are relatively small, we compute a pairwise VTM score s vtm for all video-text pairs during inference.However, the speed of inference in this approach will be slow when the size of dataset is huge in real-world scenarios.
In this section, we provide a more efficient retrieval solution.Specifically, we first compute the video-text similarity score s vtc for all video-text pairs, then we take the top-k candidates and calculate their VTM score for ranking.Such method can speed up inference, because the k can be set to be very small compared with the test set size.Table 8 shows that using this efficient two-stage retrieval solution has negligible performance degradation.

Conclusion
We propose LiteVL, a video-language model without heavy video-language pre-training or object detectors.LiteVL inherits the already-learned alignment between the spatial visual information and textual information, from a pre-trained imagelanguage model.Then, an extra temporal attention with dynamic temporal scaling is proposed to learn the temporal dynamics in the video.We also introduce a non-parametric pooling method which aggregates video features conditioned on the text description, enabling fine-grained video-language alignment.Empirical results show that our LiteVL outperforms the state-of-the-art methods trained with much more training data.

A Detailed Fine-tuning Setups
We list the detailed fine-tuning setups on each dataset in Table 9 and Table 10.For all downstream datasets, we resize each frame to 224×224 unless otherwise stated.Following ALPRO (Li et al., 2022a), we randomly select N v frames from the video, with parameters Nv 2 frames from the first and second half of the video, respectively.We use RandomAugment (Cubuk et al., 2020) on the frames sampled from each video.For all experiments, we use the same random seed (e.g., 42) to ensure reproduction.

B Comparison with Previous Work
To align video and text features, previous approaches can be generally divided into two categories.On the one hand, common dual-stream models CLIP4Clip (Luo et al., 2021) fuse global video feature from global mean pooling or the [CLS] token, and then interact fused video feature with text feature based on a simple multilayer perceptron head on the top.On the other hand, the crossattention module is adopted where the key/value are obtained from the aggregated video feature, and the query is obtained from the text feature.Previous methods mainly use two ways to aggregate the original output video features V L into V f in the video-grounded text encoder: (i) keep the original features without modification; (ii) apply mean pooling over the spatial or temporal dimension.We provide a more detailed comparison with related works in Table 11.We list how previous works extract the video features used for the alignment with text features.In addition, the video  encoder, text encoder, and pre-training data used by differnt methods are also provided.

C More Qualitative Results
We provide more visualizations of temporal weights g t in Figure 6.To better understand how text-dependent pooling affects the decision, we take a closer look at when the proposed text-dependent pooling changes the decision over vanilla pooling (Remark 1).We find that the temporal weights of the changed decisions have a clearly higher standard deviation than the unchanged ones, indicating that text-dependent pooling tends to change the decisions when the different frames are more dissimilar.For instance, for the first case in Figure 6, its caption "The girl shows the boys her medal in this cartoon" is mainly related to the middle two frames.By assigning higher importance to these two frames, the proposed textdependent pooling makes a correct decision while the vanilla pooling fails.

Figure 1 :
Figure 1: (a) The architecture of LiteVL.The model is initialized from the pre-trained image-language model BLIP, but is equipped with additional temporal attention modules and text-dependent pooling, to quickly adapt to video-language downstream tasks without pre-training.(b) The proposed dynamic temporal scaling, which adjusts the scale of the newly-added temporal attention according to each downstream task.

Figure 2 :
Figure 2: Illustration of text-dependent pooling.We reweight the pooled spatial and temporal video features based on the similarities between the normalized text feature t norm cls and visual features V ft and V fs .

Figure 3 :
Figure 3: Average temporal scalings γ l,t for different frames of each layer in the video encoder of LiteVL S trained on different tasks.

Figure 4 :
Figure 4: Bar plots of temporal weights in textdependent pooling.Video frames more related to the caption are assigned with higher weights.

Figure 5 :
Figure 5: Grad-CAM visualizations on the crossattention maps corresponding to highlighted keywords.

Table 1 :
Comparison of LiteVL and BLIP on text-video retrieval tasks.The default resolution is 224×224 per video frame, and the superscript "384" means increasing it to 384×384.The subscript "coco" means training with an extra COCO retrieval dataset.† means zero-shot inference used in BLIP by default.

Table 2 :
Results of text-video retrieval on the test split of MSRVTT-7k.† means zero-shot results reported by the original BLIP paper.

Table 3 :
Results of text-video retrieval on the test split of MSRVTT-9k dataset.

Table 4 :
Results of text-video retrieval on the test split of DiDeMo dataset.

Table 6 :
Ablation studies on dynamic temporal scaling and text-dependent pooling of the proposed LiteVL.

Table 7 :
Effect of different temperatures (τ ) in the text-dependent pooling on LiteVL S .

Table 8 :
Effect of using s vtc to filter top-k (k=100) candidates and calculate their s vtm score for ranking.