CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13 to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE.


Introduction
Video temporal grounding (Anne Hendricks et al. 2017;Gao et al. 2017) aims to localize temporal moments relevant to a given natural language (NL) query from an untrimmed video, and has become an essential task in multi-modal video understanding with its wide applications in video retrieval, video editing and video question answering.Video grounding is challenging since it relies heavily on the accurate understanding of textual query and video, as well as multi-modal alignment between them.

NL query
In the room Temporal grounding for long videos is especially highlydemanded due to the flourishing growth of online videos in quantity and length.However, existing methods (Zhang et al. 2020a,b) mainly focus on datasets (Gao et al. 2017;Krishna et al. 2017;Lei, Berg, and Bansal 2021) with relatively short videos ranging from 0.5 to 2.5 minutes on average.They typically downsample videos via uniform sampling to a fixedlength frame sequence.Thus, directly adopting the existing methods for long video grounding raises several major challenges: (1) entire long videos with variant lengths are hard to be modelled without decreasing the sampling rate, leading to high computational cost during inference; (2) a large number of moment candidates from long videos makes their precise multi-modal alignment with the NL query more challenging.These issues result in information loss in both temporal aspect (i.e., fewer visible frames caused by down-sampling) and contextual aspect (i.e., weaker matching ability to the object and scene in each frame content, as it is perturbed by numerous frames).As the motivating example shown in Fig. 1, accurate grounding requires a coarse-grained localization of relevant video segments from long videos (e.g."in the room" v.s."outdoor"), and fine-grained alignment to the contents in the frames (e.g., "women" and "holding").
To address the aforementioned challenges, we propose CONE, a window-centric coarse-to-fine alignment framework for long video temporal grounding.CONE flexibly handles long video inputs without decreasing the sampling rate, accelerates inference speed, and enhances coarse-tofine multi-modal alignment for precise temporal grounding.To support long video inputs with variant lengths, we slice the entire long video into several fixed-length candidate windows via the sliding window approach.Centering at the sliding window, we further propose to enhance training and inference via modelling coarse-grained (inter-window) to finegrained (intra-window) alignment between videos and texts.
At coarse granularity, the window irrelevant to the text is hard to be identified by the model, which results in the unreliable ranking of inter-window proposals.Therefore, modelling the inter-window semantic variance is important and is enhanced during training via a contrastive learning mechanism.CONE also accelerates inference speed via the prefiltering strategy of candidate windows.Specifically, to discriminate the window containing relevant information, we conduct contrastive learning by selecting a contrastive negative window and designing inter-window contrastive loss to enhance training.Moreover, to accelerate inference when the number of candidate windows increased with longer video length, we present to pre-filter the candidate windows according to their semantic matching score to the NL query computed by the contrastive vision-text pre-trained model.
At fine granularity, the multi-modal alignment between the detailed content (e.g., scene and object) in intra-window moment proposals and the NL query is much more challenging when the length of video features increases.To solve this problem, CONE incorporates a novel fine-grained ranking strategy, which utilizes the powerful multi-modal alignment ability of a large-scale contrastive pre-trained model (e.g., CLIP), to compute the fine-grained matching score of each video frame and the NL query.These scores are further fused to rank the intra-window moment proposals.
With the above design, CONE has the following advantages: (1) flexibly handling long video inputs with high efficiency and (2) enhancing coarse-to-fine multi-modal alignment and alleviating information loss.CONE is evaluated on two benchmarks based on long videos.The state-of-the-art results and significant performance boosts (3.13%→6.87%on MAD, and 10.46%→13.46% on Ego4d-NLQ in terms of R1@IoU=0.3)on two datasets demonstrate the effectiveness of CONE.Further analysis shows that CONE increases the inference speed by 2x for Ego4d-NLQ and 15x for MAD compared to inference on all candidate windows, without sacrificing its performance.
Contributions Our contributions are threefold.(1) We propose a window-centric learning framework to support temporal grounding for long videos without decreasing the sampling rate.(2) We strengthen coarse-to-fine multi-modal alignment via a coarse contrastive learning and fine-grained ranking.(3) Experiments demonstrate significant performance improvements on two long video grounding tasks, and exhibit higher efficiency in handling long video inputs.

Task Definition
Since never-ending video streams in real applications put higher demand on grounding on longer videos, in this paper, we will study the task of video temporal grounding (VTG) in a challenging setting with long video inputs.
Taking a natural language (NL) query Q and a long video V as the inputs, the task of VTG requires the system to locate the matched moment M from the video V relevant to the query Q. Formally, video V = [v 1 , v 2 , . . ., v Lv ] is a sequence of uniformly sampled video frames, and L v denotes the length of the sampled frames.The NL query Q = [q 1 , q 2 , . . ., q Lq ] is a sequence of tokens with sentence length L q .The moment M is a sub-sequence of V that relevant to Q.This task is challenging especially for long videos because the acceptable length L v of current models is limited, leading to the difficulties in processing the entire video.Moreover, the accurate multi-modal alignment between each v i and Q is also harder when L v increases.

Moment-DETR
To reliably generate candidate moment proposals for temporal grounding, we choose Moment-DETR (Lei, Berg, and Bansal 2021) as the backbone of our framework because of its advanced performance and compact end-to-end framework.Moment-DETR is a Transformer-based method complied with encoder-decoder architecture.
Specifically, it takes the concatenated features of video frames and query tokens as the model input which is encoded by a Transformer-based encoder with outputs E enc ∈ R (Lv+Lq)×d .The Transformer-based decoder aims to generate moment proposal features E dec ∈ R N ×d with given E enc and a set of N trainable moment queries (fixed number of trainable embeddings) as the inputs.Then, it has two prediction heads: (1) a saliency prediction head for frames, which computes saliency scores of frames S(V, Q) ∈ R Lv through a linear layer; (2) a proposal prediction head, which uses one FFN layer and three FFN layers.The first FFN layer is used to predict a two-class proposal score for moment proposal i: (p i , 1 − p i ) indicating whether it matches the ground truth, while the three FFN layers are to predict the moment center coordinate and the width of each proposal, which is further used to compute the start (b i ) and end (e i ) position of proposal i.

Approach
We present the proposed CONE for long video temporal grounding.As the pipeline shown in Fig. 2, we first slice the long video into several fixed-length video windows using sliding window approach ( § 3.1).Centering at the windows, we make accurate multi-modal alignment in the coarseto-fine manner.At the coarse-grained level, we pre-filter ( § 3.2) the candidate windows to accelerate inference and we conduct inter-window contrastive learning ( § 3.3) to capture inter-window semantic variance through training.At the fine-grained level, we generate candidate intra-window proposals with Moment-DETR ( § 2.2) and rank the proposals with a fine-grained matching ( § 3.4) mechanism.

Window-based Video Slicing
To flexibly handle long videos without decreasing the sampling rate and alleviating temporal information loss, we first slice the entire video into several video windows via a sliding window approach.A sliding window with window length L w is used to be slid on the entire video to derive a set of N w fixed-length video windows where w b i is the start index of window i.Specifically, we slide the window by increasing w b with window stride L w /2 to guarantee that each moment is covered by two windows.Intuitively, not every window is correlated with the NL query, so we refer the positive window to the window overlapping with the ground-truth moment, and the negative window otherwise.

Inter-window Pre-filtering
Lengthy video input will be sliced into a sequence of numerous windows for inference.If the number of windows is N w , the model needs to conduct the whole encoding-prediction process for N w times, which will become computationally infeasible with increased video length, especially when the model has enormous parameters.Therefore, it is necessary to increase the inference speed by reliably reducing the windows.We propose to pre-filter the candidate windows by a contrastive vision-language pre-trained model, like CLIP (Radford et al. 2021) or EgoVLP (Lin et al. 2022a).
Pretrained Contrastive Model CLIP and EgoVLP are pre-trained with multi-modal contrastive learning, aiming to align the visual representation with its related text representation.So it excels at multi-modal alignment and is suitable for efficient matching.We adopt it to compute the video fea-tures V and the text features Q beforehand.
where [CLS] is a special token at the beginning of text tokens.Since these features can also be shared as the inputs of Moment-DETR, it will not cause extra computation costs compared with the original Moment-DETR.
The multi-modal alignment score a j = v j •q [CLS] is computed via the efficient dot product between j th video feature and the text feature.And the window-level matching score A i is the maximum score of all the frames in window i: We rank all windows with A i and select the top-k windows for inference.Thus, we reduce the number of candidate windows from N w to a constant k to guarantee a controllable computation cost and accelerate inference.

Inter-window Contrastive Learning
We conduct contrastive learning to help the model in discriminating the semantic difference between positive and negative windows.Directly adopting Moment-DETR for long video in our framework is impractical because it lacks the ability to identify relevant windows, which hinders the reliable ranking of inter-window proposals.
Contrastive learning is a feasible way to solve this issue.In the scenario of temporal grounding, we expect the model to recognize the negative windows by giving lower confidence (saliency) scores to the proposals (frames) residing in the negative window: where p (+/−) denotes the positive/negative proposal score.W + and W − denote the positive and negative window.S() is the saliency scoring function ( § 2.2).Then, we design two-level contrastive losses: (1) proposal-level loss and (2) frame-level loss with a randomly sampled negative window.
For the proposal-level contrastive loss (L p ), we formulate it as a classification loss for predicting whether the window is relevant to the NL query: For the frame-level loss L f , we set the average saliency scores for frames located in the positive window is larger than the maximum saliency score of frames in the negative window over a margin δ: So the overall contrastive loss is L c = L p + L f .For the positive window, we also add the moment localization loss with L1 and IoU loss as in Moment-DETR.

Intra-window Fine-grained Ranking
Moment-DETR exploits the self-attention mechanism to perform multi-modal fusion over the sequence of video and text features.However, with the increased length of video inputs, the fine-grained attention between each video frame and the text query will be weakened by many other perturbed frames, resulting in contextual information loss.And the fine-grained content (e.g., object and scene) in each frame plays an essential role in VTG.
To remedy this issue, we propose a novel ranking strategy to enhance the fine-grained multi-modal alignment with a matching score computed with a contrastive vision-text pretrained model (described in § 3.2).Specifically, we first precompute the features for video frames and text query with the contrastive pre-trained model as in Eq. (1).
Visual Adapter.With a lightweight visual adapter on the top of CLIP, we adopt adapter-based tuning to adapt the representations from the general contrastive model to the data distribution of the current downstream task.Inspired by Gao et al. (2021), our main idea is to add an additional bottleneck layer to learn the task-adaptive visual features and conduct residual-style blending with the original pre-trained features.The lightweight adapter complies with a 2-layer FFN followed by ReLU.The i th adapted visual feature is: vi = Adapter(v i ) + v i .Then, the proposal feature for the j th proposal is computed with the mean pooling of all the adapted video features in it: h j = Mean ([ vbj , . . . , vej ]) For adapter training, we denote the positive proposal (with feature h pos ) as the ground-truth one, and the negative proposals are the other in the same batch.We follow the standard contrastive learning and use the NCE loss.
Ranking Score Computation.Finally, we aim to conduct fine-grained ranking for proposals in the window.For the j th proposal, the final proposal ranking score is fused with two components: (1) proposal scores generated from Moment-DETR and (2) fine-grained matching score m j computed with CLIP-based proposal feature: . The former models the correlation between proposals by the Transformer-based architecture, while the latter focuses on fine-grained content matching between frames in the proposal and the text query.We perform min-max normalization for these two types of scores for a more stable ranking.The final ranking score r j is the sum of two normalized scores: where N p is the total number of candidate proposals.

Experiments
We conduct experiments to explore the effectiveness of CONE from the following aspects: (1) model comparison with SOTA methods ( § 4.3); (2) ablation study to analyze the impact of each component and different variants ( § 4.4); (3) efficiency analysis of acceleration with pre-filtering ( § 4.5) and (4) qualitative analysis ( § 4.6).Implementation details are given in Appendix A.

Dataset
We conduct comprehensive experiments on two representative large-scale benchmarks on long video temporal grounding: Ego4d-NLQ (Grauman et al. 2022) and MAD (Soldan et al. 2022).The data statistics are summarized in Table 1.
Ego4d-NLQ is a subtask of the Ego4d dataset.Ego4d is a large-scale egocentric video understanding benchmark, where 931 camera wearers worldwide record their daily activities in hundreds of scenarios.The videos involved have variant lengths ranging from 3.5 min.to 20 min.The NL query is designed to retrieve the relevant moment from the episodic memory of camera wearers, and involves 13 question types for locating different types of information.
MAD is a large-scale long video temporal grounding benchmark with videos (ranging from 47 min.to 202 min.)at an average length of 110.8 min coming from full-length cinema movies.The text queries in the training set are derived from translated movie audio descriptions from professional narrators, while the NL queries in the evaluation set are derived from the LSMDC data (Rohrbach et al. 2017) with higher quality and preciser temporal boundaries.

Experimental Settings
Evaluation Metric.Following the standard setup of VTG, we adopt the standard metric Recall@k at IoU = θ.Recall@k (R@k ) at IoU = θ means the percentage of queries, having at least one prediction among the top-k predictions, whose temporal IoU with ground-truth is larger than the threshold θ (0.3 or 0.5).Note that there is only one ground-truth answer for each query in both datasets.

Model Comparison
Baselines.We compare CONE to the following methods: ( Results on Ego4d-NLQ.Table 2 reports the performance comparison on the validation set and the blind test set of Ego4d-NLQ.In terms of all the metrics, CONE outperforms these baselines by a large margin.In terms of R1@IoU=0.3and R5@IoU=0.3, the absolute performance gains are +3.31% and +11.49% on the val.set, and +3% and +6.92% on the blind test set, respectively.CONE also achieves consistent performance gains using different types of features, showing the better generalization ability.− −−− →16.11% improvement on R1@IoU=0.3and R5@IoU=0.3).

Results on MAD. The main results on MAD are shown in
From the two tables, the SOTA performance on two benchmarks demonstrates the effectiveness of CONE in long video temporal grounding, and verifies the importance of coarse-to-fine multi-modal alignment.We speculate the The fine-grained ranking brings better matching ability of relevant content in every frame and reduces contextual information loss.

Model Analysis
Ablation Study.Ablation studies are conducted to unveil the effectiveness of each component in CONE.We consider three components: (a) contrastive loss ( § 3.3), which is eliminated by only training on positive windows without our contrastive loss; (b) fine-grained ranking fusion ( § 3.4) can be removed by taking only the proposal score from Moment-DETR for ranking; (c) visual adapter ( § 3.4) is removed by using general CLIP-based features (1) for relevance score computing.The results are shown in Table 4.Note that the full CONE model refers to the first row, and the baseline model (i.e., Moment-DETR built with CLIPbased features) refers to the last row.From the table, we highlight the following findings: • Ablating visual adapter leads to a performance drop from 14.15% to 12.62%, which indicates that domainadaptation of visual features is essential in modelling task-specific semantic variance.• Eliminating fine-grained ranking also harms the performance (row 2 vs row 3), showing that capturing finegrained semantic alignment benefits accurate grounding.Table 4: Cumulative ablation study on the val.set of Ego4d-NLQ and MAD datasets, taking R1@IoU=0.3as the metric.
• Further removing contrastive loss (row 3 vs row 4) leads to significant performance drop of 4.47% on Ego4d-NLQ dataset, which reveals that identifying the inter-window semantic variance is critical for reliable proposal ranking.
12.9 13.8 13 Ego4d-NLQ R@1 R@10 R@50 6.4 6.7 6.9  Influence of window length.Figure 3 exhibits how different window lengths affect the overall performance of CONE.We observe that increasing the window length indeed brings performance variance.Longer window length (more visible frames in a single window) can model the interaction between more frames, but can also weaken the multi-modal attention between the NL query and every single frame as the performance drops significantly with the largest window size.We find a better trade-off between window length and performance, i.e., nearly 48 seconds ( 90 video features) for Ego4d-NLQ and nearly 25 seconds (125 video features) for MAD through this analysis.

Efficiency Analysis
Figure 4 shows the overall performance with respect to different number of windows after the pre-filtering stage on the validation sets of Ego4d-NLQ and MAD.Theoretically, the inference time of proposal generation and ranking is reduced linearly with smaller pre-filtered window number.It is necessary to find a trade-off between the efficiency and performance.We can observe that the performance of 12.4 13.7 14.1 14.1 14.2 14.2 17.9 26.9 28.4 29.0 29.0 28.9 17.9 Ego4d-NLQ R@1 R@5 R@10 5.7 6.9 6.8 6.7 6.7 6.6 6.5 7.8 13.4 14.3 14.2 14.1 13.9 13.8 CONE is relatively stable when pre-filtering number becomes 10 for both datasets.This observation shows that our pre-filtering and fine-grained ranking strategy enables powerful multi-modal alignment for accurate ranking and largely improves the efficiency of long video temporal grounding.The average window numbers for full videos (before filtering) are 23.3 and 588 for Ego4d-NLQ and MAD respectively.If we set the pre-filtered window number to 10 for Ego4d-NLQ and 30 for MAD (better trade-off values obtained from Fig. 4), the total numbers of windows for inference are reduced by 2.3x and 19.6x, with a marginal performance variance (R@1) by -0.1% and +0.2% for Ego4d-NLQ and MAD benchmarks respectively.In the real practice, we consider the effect of implementation methods and pre-filtering time.When CONE is inferred using one P100 GPU, its running time (w/o feature extraction and post processing time) is largely reduced (80 s -2x − − →39 s and 276 min. -15x − − →18 min.)on the val.set of Ego4d-NLQ and MAD.

Qualitative Analysis
We conduct qualitative analysis by giving examples shown in Fig. 5 to analyze the effect of contrastive learning and fine-grained ranking.
Example A compares the pure Moment-DETR (a) and the model trained with inter-window contrastive learning (b).From Example A, it can be obversed that the model trained with contrastive learning is capable of discriminating semantic irrelevant window by giving a lower score compared with the positive window and successfully rank the correct moment to the 1 st , while the Moment-DETR gives an equally high score to the proposal in the negative window and wrongly rank the correct prediction to the 8 th rank.
Example B compares the predictions of the full CONE (c) and CONE without fine-grained ranking (b).The most essential clues in the text query are fine-grained content (e.g., "vegetable").The example reveals that by involving the finegrained ranking, CONE has a better ability to align the fine-grained content in the video frames with the text query, while the ablated model fails to locate the key visual information and leads to contextual information loss.

Related Work
Video Temporal Grounding.Early works on VTG task adopt two-stage propose-then-rank pipeline, where the proposals are generated by window (Gao et al. 2017;Anne Hendricks et al. 2017) or proposal generation module (Xu et al. 2019;Chen and Jiang 2019).Emerging works exploring end-to-end trainable methods can be further divided into two categories according to the absence or presence of the proposal.Proposal-free methods directly predict start/end timestamps without generating proposals.Some directly regress to the boundary (Yuan, Mei, and Zhu 2019;Ghosh et al. 2019), while others predict the start/end position of the moment boundary (Zhang et al. 2020a).Propose-andrank methods integrate the proposal generation and ranking into an end-to-end framework.Some works adopt an anchorbased framework (Chen et al. 2018;Zhang et al. 2020b;Soldan et al. 2021), while others use the moment queries with a decoder module to produce fixed-length proposals (Lei, Berg, and Bansal 2021).Readers can refer to this excellent survey paper (Zhang et al. 2022).Nevertheless, these approaches typically tackle relatively short videos (0.5 to 2.5 minutes on average), while the long videos in our challenging setting can last from several minutes to three hours.So directly adopting them leads to severe information loss.
Long-form Video Modeling.Long-form video modeling is an emerging challenge due to its urgent need in realworld applications, and has been recently investigated in action classification (Wu et al. 2019;Wu and Krahenbuhl 2021;Islam and Bertasius 2022), temporal action localization (Feng Cheng 2022) and video retrieval (Lin et al. 2022b).The common challenges of long videos are modeling long-range temporal dependency, efficiency issues and accurate multi-modal alignment (if language involved).
Previous methods for these tasks explore feature memory bank (Wu et al. 2019;Feng Cheng 2022) to learn temporal dependency by merging past and future features from memory bank.Other methods extract object-level representations and capture the long-range interactions between tracked objects (Wu and Krahenbuhl 2021), or use a structured statespace sequence layer to model temporal dependency (Islam and Bertasius 2022).Recently, the challenges of long video temporal grounding are posted (Grauman et al. 2022;Soldan et al. 2022), but related methods are less explored.Unlike the methods on other tasks that focus on long-range dependency, our work targets to flexibly handle long videos with higher efficiency and effectiveness, and proposes a novel coarse-tofine alignment framework to alleviate information loss.Contrastive Learning.Contrastive learning was originally designed to enhance representation learning by discriminating relevant object representations from irrelevant ones (Misra and Maaten 2020; He et al. 2020).CLIP (Radford et al. 2021) adopts contrastive pre-training to model the relevance of image and text pairs, and is shown to excel at multi-modal alignment (Luo et al. 2021).Some works (Chen et al. 2021;Wang et al. 2021) use contrastive learning to help the model in discriminating differences between samples via carefully designed contrastive loss.Our work concurrently adopts a contrastive pre-trained model for multimodal alignment, and designs a contrastive learning mechanism to identify inter-window semantic variance.

Conclusions
We present CONE, a window-centric COarse-to-fiNE alignment framework for efficient and effective long video temporal grounding.CONE enables flexible processing of arbitrary long video inputs, with a window-centric slicing strategy.The introduced inter-window contrastive learning mechanism helps the model in semantically discriminating relevant window, and the inter-window pre-filtering strategy accelerates inference speed.We also introduce fine-grained ranking strategy to strengthen the fine-grained multi-modal alignment for accurate intra-window proposal ranking.CONE achieves state-of-the-art results and brings a significant performance boost on two large-scale temporal grounding benchmarks for long videos.Analysis shows that the pre-filtering strategy can accelerate inference speed by 2x and 15x on Ego4d-NLQ and MAD benchmarks respectively while keeping the SOTA performance.

Figure 1 :
Figure 1: An example of long video temporal grounding, which requires coarse-to-fine multi-modal alignment.

Figure 2 :
Figure2: An overview of our coarse-to-fine framework CONE.CONE first slices the video with sliding window approach ( § 3.1).At the coarse-grained level, it accelerates inference with a pre-filtering mechanism ( § 3.2) and enhances training with inter-window contrastive learning ( § 3.3).At the fine-grained level, it enhances fine-grained multi-modal alignment for accurate proposal ranking ( § 3.4).
1) 2D-TAN(Zhang et al. 2020b) converts visual features into a 2D feature map.It then fuses the query feature with each proposal feature via a temporal adjacent network and finally predicts each proposal to the real IoU value.(2) VL-SNet(Zhang et al. 2020a) exploits a context-query attention module to fuse video and query token features and adopts a conditioned span predictor with the query-guided highlighting module to directly compute the probabilities of start/end boundaries without proposals.(3) CLIP(Radford et al. 2021) first generates pre-defined proposals with sliding window, and finally computes the dot product of the query feature and mean proposal visual feature as the proposal score.(4) VLG-Net(Soldan et al. 2021) uses the graph convolution networks to model both individual modalities and the aggregation of cross-modal context.

Figure 4 :
Figure 4: Performance (Recall k@IoU=0.3) on datasets w.r.t different window numbers for efficiency analysis.Inference time reduces w.r.t window number.

Figure 5 :
Figure 5: Two examples on the Ego4d-NLQ dataset.We study 3 settings: (a) baseline Moment-DETR; (b) CONE w/o finegrained ranking and (c) the full CONE.The major difference between (a) and (b) is inter-window contrastive learning.

Table 1 :
Statistics of two benchmarks.N vocab is word vocabulary size, L video denotes average video length, L query denotes average word number in text, L moment and δ moment are the average and median ground-truth moment length.
outstanding results are brought by the following reasons: (1) CONE processes the entire video without decreasing sampling rate, which alleviates temporal information loss; (2)

Table 3 :
Soldan et al. (2022)est set of MAD dataset.Results of CLIP and VLG-Net are reported bySoldan et al. (2022).