TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

Large-scale video-language pre-training has made remarkable strides in advancing video-language understanding tasks. However, the heavy computational burden of video encoding remains a formidable efficiency bottleneck, particularly for long-form videos. These videos contain massive visual tokens due to their inherent 3D properties and spatiotemporal redundancy, making it challenging to capture complex temporal and spatial relationships. To tackle this issue, we propose an efficient method called TEmporal-Spatial Token Aggregation (TESTA). TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame. TESTA can reduce the number of visual tokens by 75% and thus accelerate video encoding. Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video encoder block. We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks. Experimental results show that TESTA improves computing efficiency by 1.7 times, and achieves significant performance gains from its scalability in processing longer input frames, e.g., +13.7 R@1 on QuerYD and +6.5 R@1 on Condensed Movie.


Introduction
Video-language modeling aims to learn semantic alignment between video and language in a joint representation space (Xu et al., 2021;Lei et al., 2021) to facilitate downstream tasks including text-video retrieval, video question answering (VideoQA), and video captioning.Unlike text, which can be represented concisely as a sequence of words with dense semantics, video input consists of much longer sequences due to its 3D properties and the redundancy in space-time information (He et al., 2021;Tong et al., 2022).In fact, the number of visual tokens processed by Transformer-based models (Fu et al., 2021;Cheng et al., 2022;Ye et al., 2022;Li et al., 2021a;Wang et al., 2022b) can be over 150× more than text tokens. 2 This poses an efficiency bottleneck for video-language understanding, especially for long-form videos lasting more than 30 seconds (Wu and Krähenbühl, 2021;Sun et al., 2022).
To encode long videos within limited computing budgets, previous approaches can be broadly categorized into two types: (1) Sparse Sampling (Lei et al., 2021;Sun et al., 2022;Lei et al., 2022).This method reduces the number of visual tokens by sampling very few frames from the raw video. 3owever, sparse sampling sacrifices rich temporal dynamics and storyline information, which limits model performance.(2) Offline Encoding (Luo et al., 2021;Bain et al., 2022).It allows processing more frames within the same computation budgets by constraining the interaction between visual tokens.It first uses an off-the-shelf image encoder (Dosovitskiy et al., 2020;Radford et al., 2021) to encode each frame independently, then uses a temporal module to aggregate all the frame features.However, the frame features encoded offline may not be well adapted to downstream tasks in various domains.Additionally, the postaggregation mechanism also prohibits the full fusion of frame features (Cheng et al., 2022).Considering that both sufficient input frames and full temporal-spatial modeling in an end-to-end manner are pivotal for optimal performance, a natural question arises: Are there better approaches to achieve efficient video coding without compromising on either of these aspects?In this paper, we propose an efficient method named TEmporal-Spatial Token Aggregation (TESTA) inspired by Token Merging (ToMe) (Bolya et al., 2022).Specifically, TESTA samples input frames densely, but progressively aggregates similar visual tokens during video encoding to reduce the token number and computational overhead.As shown in Fig. 1, our aggregation operates separately in temporal and spatial dimensions, allowing for the merging of similar frames as well as similar patches within each frame.This reduces ToMe's complexity from ) 2 ), making it more efficient for encoding longer videos.After aggregation, around 75% visual tokens can be reduced and thus the video encoding is accelerated.To achieve this, we use the bipartite matching algorithm.Specifically, we select a set of tokens and then find their most similar counterparts from the remaining set.Finally, we aggregate the features of these pairs through mean pooling.This aggregation-based mechanism has three advantages: First, it does not incorporate additional parameters and is amenable to parallelism, which significantly improves the training and inference efficiency; Second, our method (1) adaptively condenses video semantics rather than directly discarding input information, (2) retains full end-to-end spatiotemporal fusion, which both ensure the performance.Third, compared to convolution-based feature down-sampling methods (Liu et al., 2021;Li et al., 2021c), our aggregation trajectory can be easily tracked and recovered.The aggregated tokens often correspond to higher-level semantics (e.g., objects, scenes, and events), making them more interpretable and even grounded in language.
Building upon TESTA, we design a pre-trained video-language model with a temporal and spatial token aggregation module in each video encoder block.We evaluate our model on paragraphto-video retrieval and long-form VideoQA tasks.When using an equal number of input frames, our model improves computing efficiency by 1.7 times while maintaining comparable performance.When accessing more frames, our model exhibits strong scalability and achieves significant performance gains compared to previous state-of-the-art methods (e.g., +13.7 R@1 on QuerYD and +6.5 R@1 on Condensed Movie).

Related Work
Video-Language Pre-trained Models.Benefitting from large-scale video-text datasets (Bain et al., 2021;Xue et al., 2021) and advances in Transformer model design (Gorti et al., 2022;Ren et al., 2021;Fu et al., 2021;Zellers et al., 2021;Wang et al., 2022a), pre-trained Video-Language Models (VidLMs) (Chen et al., 2022;Sun et al., 2022;Cheng et al., 2022) have demonstrated impressive performance in video-language understanding tasks.VidLMs typically comprise a video encoder and a text encoder, which encode video-text pairs into a shared feature space to learn the semantic alignment between video and language.Additionally, a text decoder can be added after the video  encoder for tasks such as video captioning and VideoQA (Yan et al., 2022;Zhang et al., 2020).
Efficient Video Transformer.A Transformerbased video encoder typically pachifies each video into massive visual tokens, which will cause prohibitive computation costs for full self-attention with quadratic computational complexity.Therefore, research on efficient video Transformers has always been active.Representative work like TimeSFormer (Bertasius et al., 2021) and ViViT (Arnab et al., 2021) propose to factorize the spatial and temporal dimensions of the input, then separately apply spatial and temporal attention.Video Swin Transformer (Liu et al., 2021) keeps the joint temporal-spatial attention but restricts it within a local 3D window.Orthogonal to the advances of efficient Transformer architectures, our TESTA aggregates token features from the spatial and temporal dimensions, which reduces the size of input features for each Transformer block and can further boost the efficiency of video encoding.
Feature Aggregation in Video Transformers.
Existing feature aggregation methods can be broadly categorized into two branches.Temporally, frame features can be encoded by a pre-trained image encoder and aggregated using self-attention, joint-attention, or mean pooling for post-temporal modeling purposes (Bain et al., 2022;Luo et al., 2021).Spatially, previous work explored merging similar patches in the image or aggregating tokens into additional proxy tokens (Bolya et al., 2022;Shi et al., 2023;Cao et al., 2023;Xu et al., 2022;Ryoo et al., 2021;Marin et al., 2021).In contrast, we propose a unified mechanism to simultaneously aggregate frames and patches.Our method gradually aggregates features during video encoding, improving efficiency while ensuring sufficient interaction between features in both space and time.

Method
In this section, we first introduce our videolanguage pre-trained model and its architecture in § 3.1.To improve the efficiency of encoding longform videos, we propose a novel temporal-spatial token aggregation mechanism ( § 3.2).Finally, we present the pre-training objectives in § 3.3.

Model Architecture
Inspired by prevalent VidLMs (Li et al., 2022(Li et al., , 2021b)), our model consists of three encoders and one decoder for video-language representation learning.Figure 2 shows the model architecture.
Text Encoder.The text encoder is a uni-modal encoder similar to BERT (Devlin et al., 2019).A [CLS] token is prepended at the beginning of the input text to represent its global feature.
Video-grounded Text Encoder.This is a crossmodal encoder.Compared to the uni-modal text encoder, we add a cross-modal module to each encoder layer to enable information flow from video to language.We insert an [ENC] token before the input text to condense the cross-modal information from both video and language.
Video-grounded Text Decoder.This is a crossmodal decoder with causal self-attention for autoregressive text generation.
Video Encoder.This is a uni-modal encoder.Given a raw video, the visual input V ∈ R T ×H×W ×3 is a sequence of T RGB frames of size H × W sampled from the video.Each frame is split into L non-overlapping patches4 following ViT (Dosovitskiy et al., 2020).To represent the global video feature, an additional [CLS] token is also used.Our video encoder is similar to TimeS-Former (Bertasius et al., 2021) with the Divided Space-Time Attention.Specifically, each video encoder block captures the temporal relations across frames using Temporal Attention and fuses the spatial information of objects, scenes, etc., within each frame using Spatial Attention.In contrast to TimeS-Former, we improve the efficiency of video encoding by equipping each video encoder block with a Temporal Aggregation Module and a Spatial Aggregation Module, which we will introduce in § 3.2.

Temporal-Spatial Token Aggregation
Videos have heavy spatiotemporal redundancy (He et al., 2021;Tong et al., 2022).On one hand, some activities (e.g., conversations) can persist across multiple frames with little visual variations.On the other hand, some scenes like background often contain numerous indistinguishable patches in each frame.Aggregating these similar frames and patches can simplify video feature representation and accelerate video encoding.Accordingly, we introduce a Temporal Aggregation Module (TAM) and a Spatial Aggregation Module (SAM), i.e., the yellow modules in Figure 2.After each aggregation, TAM reduces R T frames while SAM reduce R S patches, where R T and R S are hyper-parameters to control the tradeoffs between performance and efficiency.TAM and SAM are incorporated into each block of the video encoder, aggregating tokens progressively to reduce their number.For the i-th Transformer block, let V ∈ R T i ×L i ×D represents the input video feature, where T i , L i , D denote the number of frames, the number of patches per frame, and the dimension of the token feature, respectively.The output video feature after temporal and spatial token aggregation is V , resulting in a smaller size and reducing the computing burden for subsequent blocks.After the forward process with M encoder blocks, the final number of visual tokens is reduced to

Objects for Aggregation
Our video encoder based on TESTA involves two types of tokens for aggregation: patch tokens and frame tokens.Recall that each frame is divided into a sequence of patches, which are treated as patch tokens.To ensure a formally unified aggregation algorithm, we define frame tokens as pseudo tokens to represent each frame by averaging all the patch tokens within it.When merging two frame tokens, the corresponding L patches [p ].As our aggregation strategy is agnostic to the token type, we refer to both patch tokens and frame tokens as "tokens" throughout the rest of the paper, without loss of generality.

Aggregation Strategy
Recall that given a sequence of N tokens, our target is to reduce R tokens after each aggregation operation. 5To achieve this, we can greedily merge two tokens with the highest similarity and then repeat R times, or merge N tokens into N −R clusters using clustering algorithms such as k-means (Lloyd, 1982).However, these iteration-based methods are not suited for parallelism and can slow down encoding speed (Bolya et al., 2022).Therefore, we resort to the bipartite matching method.We first partition the N tokens into two disjoint sets A and B with R and N − R tokens, respectively.The R tokens in the set A are selected elaborately as the tokens to be reduced.For each token in the set A, we find its most similar token from the set B, then merge them by averaging their features.As a result, the remaining N − R tokens in the set B form a new sequence as the output.
For similarity calculation, we utilize the attention keys (K) of tokens as features and measure their similarity using cosine similarity.The attention keys contain summarized information intended for use in QKV self-attention, yielding accurate similarity measures (Bolya et al., 2022).
In practice, we introduce two aggregation algorithms, i.e., importance-based aggregation and geometry-based aggregation.
Importance-based Aggregation.In this algorithm, we pick out the least important R tokens into the set A for aggregation, so as to minimize the negative effects of token reduction.The importance of the token x i is measured by the following score function S i , which is defined as the product of the attention it receives from the other tokens N j=1,j̸ =i A ji : where A ji is the attention score from token x j to x i , Q and K represent Queries and Keys in selfattention, respectively.
Geometry-based Aggregation.In practice, we notice that adjacent tokens have a larger similarity and should be merged.However, these adjacent tokens also have similar importance scores and thus are prone to be grouped into the same set in importance-based strategy, which hinders their aggregation.To address this issue, we partition the N tokens in an alternative way inspired by Bolya et al. (2022), thus assigning adjacent tokens to different sets A and B. As shown in the left panel in Figure 2, for each token x (A) i in the set A, we find its most similar token x (B) j from the set B to construct a pair (x j ) and record their similarity.After that, we select R pairs with the greatest similarity and merge the two tokens in the top-R pairs.Finally, we concatenate the tokens in the two sets back into one sequence as the output.
The above aggregation algorithms are parameterfree, and can be easily plugged into a Transformerbased video encoder.We conduct our aggregation during both training and testing.Although the token similarity calculation brings additional computing overhead, it is negligible compared to the efficiency gained by reducing token numbers.

Novelty over Token Merging
Our work is inspired by Token Merging (ToMe) (Bolya et al., 2022), which also proposes to reduce video tokens by merging similar ones.However, we differentiate ourselves from ToMe in two significant ways: Video Token Definition.ToMe uses joint spacetime tokens (2 × 16 × 16 cubes), while our TESTA defines frame tokens (representing entire frames) and patch tokens (16 × 16 2D patches) for decoupled aggregation.This tailored token design is more efficient for modeling long-form videos.
Aggregation Method.ToMe performs global aggregation over all tokens, resulting in a complexity of O(( T 2 H 16 W 16 ) 2 ).This becomes impractical for long-form video and causes out-of-memory issues beyond 16 frames.In contrast, TESTA uses divided aggregation in time and space, reducing complexity to O(T 2 + ( H 16 W 16 ) 2 ).This allows efficient encoding of much longer videos (more than 128 frames under the same computation quota).The divided scheme also better captures spatial and temporal semantics, resulting in improved performance on long-form video understanding tasks (to be shown in § 4.7).

Pre-training Objectives
We use the following three classic pre-training objectives, i.e., video-text contrastive loss, video-text matching loss, and captioning loss.Please refer to Appendix A for more details.

Implementation Details
To pre-train our TESTA model, we start by initializing it with the BLIP (12-layer ViT-B/16) checkpoint (Li et al., 2022), with the exception of the temporal attention, which is copied from the spatial attention weights.We use around 5M image-text and video-text pairs from two datasets for pre-training.See Appendix A for more details.
For downstream fine-tuning, we uniformly sample either 32 or 96 frames, each with a resolution of 224 × 224 pixels (196 patches per frame with a patch size of 16).To achieve approximately a 50% reduction in computation cost, we employ different hyper-parameters for aggregation.Specifically, for 96-frame inputs, we set R T to 4 and R S to 8, while for 32-frame inputs, R T is 1 and R S is 12.We use geometry-based aggregation by default since it achieves better performance.Please refer to Appendix B for more fine-tuning details.

Downstream Task Setups
We finetune and evaluate TESTA on two downstream tasks of paragraph-to-video retrieval and

Paragraph-to-Video Retrieval
Table 1 demonstrates the performance of TESTA on two challenging and under-explored paragraphto-video retrieval datasets, QuerYD and Condensed Movie, which involve videos with lengthy durations (over 200 seconds on average).For 32-frame video inputs, TESTA achieves Recall@1 of 77.0 on QuerYD and 21.5 on Condensed Movie, surpassing previous SOTA methods by 7.3 and 3.1, respectively.In terms of computational complexity, TESTA exhibits a significantly lower GFLOPs of 420 compared to Frozen (Bain et al., 2021) and VINDLU (Cheng et al., 2022).While LF-VILA (Sun et al., 2022) operates with even fewer GFLOPs (298), it necessitates feature aggregation within a fixed local window, which can potentially undermine semantic integrity after concentration.
The results are reported on QuerYD with 96 frames.Avg.represents average recall across R@1, R@5, and R@10.
R@1 on average compared to LF-VILA).
Given the importance of incorporating more input frames for long video understanding tasks, we finetune TESTA using 96-frame inputs and further promote R@1 to 83.4 on QuerYD and 24.9 on Condensed Movie.This exhibits strong scalability of our model (see Appendix D for a detailed analysis).Additionally, we report the results of TESTA without token aggregation, which serves as an upper bound for TESTA's performance.Although preserving full visual tokens yields higher recall, it requires 1.8 times more GLFOPs compared to TESTA.As the number of input frames increases from 32 to 96, the GFLOPs of TESTA w/o agg.exceed 2300, but the performance gain diminishes (only +0.8 R@1 on QuerYD).This indicates the superiority of our method in aggregating redundant tokens in long sequence inputs.
Table 2 demonstrates model performance on DiDeMo and ActivityNet Caption, which consist of shorter videos (∼100 seconds on average) and are considered less challenging.For 32-frame inputs, TESTA with 5M pre-training data achieves 57.7 R@1 on DiDeMo, which even surpasses the models pre-trained with over 100M data.By increasing the number of frames to 96, TESTA achieves R@1 of 59.2 on DiDeMo and 53.7 on ActivityNet, outperforming previous SOTA methods by 2.7 and 2.6, respectively.

Long-Form Video Question-Answering
Table 3 showcases the performance of TESTA on ActivityNet-QA (using 96-frame).The accuracy of TESTA is 45.0%, which is 3.2% higher than the previous SOTA, Singularity (Lei et al., 2022).This demonstrates that our method eliminates redundant information while integrating crucial visual cues to accurately answer the posed questions.

Zero-shot Generalizability
In Table 4, we show the zero-shot performance of pre-trained CLIP4clip, BLIP, and TESTA on three datasets (32 frames).Although our TESTA is initialized by the BLIP checkpoint, it consistently outperforms BLIP (as well as CLIP4clip) after our pre-training, achieving average improvements of +14.1, +2.9, and +3.8 on QuerYD, DiDeMo, and ActivityNet respectively.This indicates our substantial gains on long-form video datasets are not solely due to the strong BLIP checkpoint, but also owing to our temporal modeling and pre-training on video data.

Ablation Study
We perform an extensive ablation study and analysis on various crucial components in our aggregation algorithm to examine their impacts.
Token Aggregation v.s.Token Pruning.We first compare the performance and efficiency of 938 -1 Frames Agg.tokens token aggregation and token pruning (Rao et al., 2021).Regarding pruning, we calculate the importance score (Eq.( 1)) for each token and prune the least important R tokens following previous methods (Goyal et al., 2020).We finetune our pre-trained model on QuerYD without token aggregation, then apply token aggregation and pruning in an off-the-shelf manner for test evaluation.The results are presented in the first block of Table 5.In comparison to the vanilla model (no aggregation), both pruning and aggregation decrease computation costs, with only 58% GFLOPs and 66% GPU memory.However, the performance degradation of our token aggregation is much smaller than that of pruning (−2.2 v.s.−8.4 in terms of average recall), suggesting that aggregation better preserves the valuable visual semantics within videos.
Ablation on the Aggregation Strategy.To investigate the effectiveness of different aggregation strategies, we report the performance of TESTA using importance-based and geometry-based aggregation methods.The results in the middle block of Table 5 show that the simplest geometry-based aggregation method achieves the best Recall@1 of 83.4,outperforming the other method by 3.2.This confirms our hypothesis that adjacent tokens exhibit greater similarity and should be assigned to separate sets for aggregation.
Ablation on the Aggregation Dimension.We compare the performance of three aggregation methods: (1) temporal only, (2) spatial only, and (3) both temporal and spatial.To ensure a roughly equal computational overhead, we adjust R S and R T accordingly.The results in the bottom block of Table 5 show that performing token aggregation on a single dimension leads to excessive dilution of information, while the information in other dimensions becomes overly redundant.This imbalance hurts the performance of the model.Therefore, our approach, with incorporates both temporal and spatial aggregation, achieves the best outcomes.
Additionally, Appendix E discusses the impact of the number of reduced tokens R T and R S .Appendix F analyzes the properties of aggregated tokens by probing their similarity.Figure 4: Text grounding visualization.Fi denotes the i th frame in the video and Si denotes the i th sentence in the caption.We calculate the similarity between the phrase query (in orange) and each region formed by our aggregation, then record the value in the region.The phrase queries can be grounded to their corresponding aggregated regions, achieving the highest similarity.

Comparison to Token Merging
We directly compare the performance of ToMe (Bolya et al., 2022) and TESTA by initializing both models from the BLIP pre-trained checkpoint and fine-tuning them on QuerYD.
As we noted in § 3.2.3,due to the extremely high computational complexity of ToMe's global attention, increasing the number of input frames can lead to out-of-memory issues without token aggregation (w/o agg.).Therefore, we limit the number of input frames to 16. Besides, We set the hyperparameter R (number of reduced tokens) to ensure matched GFLOPs.Specifically, for ToMe, R = 197, while for TESTA, R T = 1 and R S = 2.The results in Table 6 illustrate TESTA's efficiency and effectiveness for long-form video understanding, which can be attributed to our tailored design for divided spatial-temporal modeling.
In comparison to ToMe, our approach achieves higher recall with fewer GFLOPs, regardless of whether token aggregation is applied.

Visualization
Figure 3 provides a visualization of temporal and spatial aggregation on the DiDeMo dataset.TESTA effectively aggregates tokens with highly-similar semantics, demonstrating its strong interpretability.
From a temporal perspective, TESTA aggregates a sequence of frames captured during continuous lens movement (first 3 frames).It also condenses similar frames of athletes waiting for the game (last 3 frames).From a spatial perspective, TESTA merges the patches belonging to the same scenes (e.g., sky, baseball park) and the same objects (e.g., billboard, back of the audience's head).More examples can be found in Appendix G.
In Figure 4, we further show that TESTA enables grounding of language to the aggregated visual to-kens (Ren et al., 2023b,a).Given the phrase query in the caption, it achieves the highest similarity of its oracle region formed by our aggregation, facilitating fine-grained alignment between phrases and regions.

Conclusion
In this paper, we present TESTA, an efficient method for long-form video-language understanding.By aggregating similar frames and patches, TESTA effectively condenses video semantics and accelerates video encoding.Experimental results on paragraph-to-video retrieval and VideoQA tasks demonstrate that TESTA outperforms previous SOTA methods by a considerable margin.

Limitations
To facilitate future research, we analyze the limitations and possible solutions in our work.(1) Due to limited computing resources, we do not use long-form video pre-training datasets such as HD-VILA (Xue et al., 2021) or incorporate TESTA in pre-training.We believe long video pre-training with TESTA could greatly improve pre-training efficiency and obtain a video-language model with better performance.(2) For aggregation efficiency, we only use video-side features to merge visual tokens.We believe that leveraging text signals for aggregation could make the final encoded features more suitable for downstream tasks.(3) Our model training only uses coarse objectives such as VTC, VTM, and CAP (Eq.( 2)-( 4)) on video-text pairs.Considering TESTA can aggregate tokens into objects, scenes, events, etc., training with fine-grained alignment functions (Ren et al., 2021;Wang et al., 2022c) could help some tasks like action localization and video object detection (Zhukov et al., 2019;Real et al., 2017), on which we will perform more explorations in future work.

C Downstream Datasets
We finetune and evaluate TESTA on two downstream tasks of paragraph-to-video retrieval and long-form VideoQA.The details of these datasets are shown in Table 7.
For paragraph-to-video retrieval, we use 4 datasets of DiDeMo (Hendricks et al., 2017), QuerYD (Oncescu et al., 2020), ActivityNet Captions (Krishna et al., 2017), and Condensed Movie (Bain et al., 2020).We evaluate text-tovideo retrieval, where the text acts as the query, in terms of R@k, which means the recall (%) of the target video through K retrieval efforts.

TESTA (Ours)
Figure 5: Comparison of GFLOPs and Recall@1 on the QuerYD dataset.nF denotes using n input frames for fine-tuning and evaluation.The curve of our TESTA is located in the upper left corner, indicating that our model achieves a better performance-cost tradeoff compared to other pre-trained models.

D Recall-GFLOPs Tradeoff of Various Pre-trained Models
In Figure 7, we analyze the tradeoff between recall and GFLOPs for various pre-trained models.The curve of our TESTA is located in the upper left corner, indicating that our model achieves a superior Recall-GFLOPs tradeoff compared to other pre-trained models.
Furthermore, Figure 7 presents the model performance with different input frames.Surprisingly, increasing the number of input frames from 32 to 96 has minimal impact on the performance of Singularity (Lei et al., 2022) and Frozen (Bain et al., 2021), and even slightly reduced the recall of AL-PRO (Li et al., 2021a) and VINDLU (Cheng et al., 2022).In contrast, our TESTA exhibits linear improvement in performance with the number of input frames, demonstrating superior scalability.

E Ablation on the Number of Reduced Tokens
In our TESTA ( § 3.2), R T and R S specify the number of tokens to be reduced for the temporal and spatial aggregation module, separately.To investigate the influence of these two hyper-parameters, we vary the number of R T and R S , then report the average GFLOPs (blue bars) and recall (red star) on the QuerYD dataset.Figure 6 illustrates the results.On one hand, GFLOPs decrease linearly as R6 increases, indicating that increasing the reduced token number can improve the efficiency of video encoding.On the other hand, merging too many tokens with large R (e.g., R T = 10) will lose semantic information in the final encoded video representation, thus leading to a declined average recall.We evaluate more cases with various R T and R S configurations, and plot the GFLOPs-Recall tradeoff in Figure 7. Based on these results and analysis, we determined the default configuration for our TESTA, i.e., R T = 4 & R S = 8 and for 96-frame inputs, and R T = 1 & R S = 12 for 32-frame inputs.This configuration helps our model achieve approximately a 50% reduction in computation cost without significant performance decline.

F Token Similarity Analysis
We probe the properties of the aggregated tokens by analyzing their similarity.In Figure 8, we count the average similarity between tokens from different blocks, different dimensions (frame tokens or patch tokens), and different aggregation results (aggregated or disaggregated).
For patch tokens (in orange), the overall similarity between them is large (higher than 0.5), indicating considerable spatial redundancy.Meanwhile, the aggregated patch tokens (in dark orange) have a very high similarity of 0.96, which ensures the semantic purity of the aggregated patch tokens.While for frame tokens (in blue), their similarity decreases as the number of blocks increases, which may yield aggregated frames with mixed and diverse semantics.Nevertheless, recall that our frame token is a pseudo token ( § 3.2.1)obtained by averaging patch features, which does not elaborately model frame semantics.Therefore, compared to patch tokens, the representation of frame token and their similarity measure needs improvement, which we regard as future work.

G More Visualization of Aggregation
In this section, we provide more qualitative results of our TESTA for video-language understanding.Figure 9 shows another 4 case on the DiDeMo dataset.TESTA effectively aggregates tokens with highly-similar semantics, demonstrating its strong interpretability.

Figure 1 :
Figure1: Two blocks on the left compare ToMe(Bolya et al., 2022) and our TESTA on three aspects: video token definition, aggregation method, and computation complexity.The block on the right illustrates TESTA's divided temporal aggregation (left) and spatial aggregation (right).Patches sharing the same inner and border colors are merged together.Our aggregation gradually reduces the number of frames and patches by averaging their features during the forward process of video encoding.

Figure 2 :
Figure 2: Architecture of our pre-trained model and token aggregation algorithm of TESTA.We record the size of the input and output features in red.The circles in the left panel denote either patch tokens or frame tokens.
We see baseball park for the first time.The first time the baseball stadium is shown.

Figure 3 :
Figure 3: Visualization of our temporal and spatial aggregation.Frames that are enclosed within the same red rectangle, as well as patches that share the same inner and border color, are merged together.
S2: A dog walks up to a little girl then walks away.0.49S1:A girl in a pirate hat walks towards the camera.

Figure 6 :
Figure 6: Ablation on reduced the token number, R T (temporal aggregation), and R S (spatial aggregation).The average recall is represented by red stars, while GFLOPs are depicted by blue bars.The dotted lines denote the results without any aggregation (R T = 0 and R S = 0).All results are evaluated on QuerYD with 96 frames.

Figure 7 :
Figure 7: GFLOPs-Recall tradeoff on QuerYD.We record the performance (dots) of TESTA with various R T -R S configurations, and plot the trends (curve) by fitting the dots.

Figure 8 :
Figure 8: Cosine similarity between tokens from Set A and Set B in various video encoder blocks.The blue color indicates frame tokens while the orange color indicates patch tokens.For those tokens finally being aggregated, we plot their similarity in a dark color.

Table 1 :
Paragraph-to-video retrieval performance (Recall@k) on QuerYD and Condensed Movie.#PT Data refers to the number of video-text pairs used for pre-training.† indicates the results of our re-implementation.TESTA w/o agg.denotes fine-tuning our pre-trained model without activating the token aggregation modules, resulting in no reduction in token number.This serves as an upper bound for TESTA's performance.

Table 2 :
Paragraph-to-video retrieval performance on DiDeMo and ActivityNet Caption.We gray out methods that use significantly more pre-training data for a fair comparison.The other notations are the same as those on Table1.