On Pursuit of Designing Multi-modal Transformer for Video Grounding

Video grounding aims to localize the temporal segment corresponding to a sentence query from an untrimmed video. Almost all existing video grounding methods fall into two frameworks: 1) Top-down model: It predefines a set of segment candidates and then conducts segment classification and regression. 2) Bottom-up model: It directly predicts frame-wise probabilities of the referential segment boundaries. However, all these methods are not end-to-end, i.e., they always rely on some time-consuming post-processing steps to refine predictions. To this end, we reformulate video grounding as a set prediction task and propose a novel end-to-end multi-modal Transformer model, dubbed as GTR. Specifically, GTR has two encoders for video and language encoding, and a cross-modal decoder for grounding prediction. To facilitate the end-to-end training, we use a Cubic Embedding layer to transform the raw videos into a set of visual tokens. To better fuse these two modalities in the decoder, we design a new Multi-head Cross-Modal Attention. The whole GTR is optimized via a Many-to-One matching loss. Furthermore, we conduct comprehensive studies to investigate different model design choices. Extensive results on three benchmarks have validated the superiority of GTR. All three typical GTR variants achieve record-breaking performance on all datasets and metrics, with several times faster inference speed.


Introduction
Video grounding is a fundamental while challenging task for video understanding and has recently attracted unprecedented research attention (Chen et al., 2018(Chen et al., , 2019bZhang et al., 2019a;Liu et al., 2018a;Yuan et al., 2021). Formally, it aims to identify the two temporal boundaries of the moment of interest based on an input untrimmed video and a * Work started when Long Chen at Tencent AI Lab. † Corresponding author.

× faster
4.9% better Figure 1: Performance comparisons on TACoS in terms of R@1, IoU@0.5 and Query Per Second (the number of queries that are retrieved each second during inference). Marker sizes are proportional to the model size. Our GTR-H is 4.9% better than 2D-TAN (Zhang et al., 2020b) with 5 times faster speed.
natural language query. Compared to the conventional video action localization task (Shou et al., 2016;Zhao et al., 2017;Zhang et al., 2021), video grounding is more general and taxonomy-free, i.e., it is not limited by the predefined action categories. The overwhelming majority of the state-of-theart video grounding methods fall into two frameworks: 1) Top-down models (Anne Hendricks et al., 2017;Gao et al., 2017;Ge et al., 2019;Chen et al., 2018;Zhang et al., 2019b;Liu et al., 2018b;Yuan et al., 2019a): They always use a propose-and-rank pipeline, where they first generate a set of moment proposals and then select the best matching one. To avoid the proposal bottlenecks and achieve high recall, a vast number of proposals are needed. Correspondingly, some time-consuming post-processing steps (e.g., non-maximum suppression, NMS) are introduced to eliminate redundancy, which makes the matching process inefficient (cf. Figure 1). 2) Bottom-up models (Mun et al., 2020;Rodriguez et al., 2020;Zeng et al., 2020;Chen et al., 2020;Lu et al., 2019): They directly regress the two temporal boundaries of the referential segment from each frame or predict boundary probabilities framewisely. Similarly, they need post-processing steps to group or aggregate all frame-wise predictions.
Although these two types of methods have realized impressive progress in video grounding, it is worth noting that they still suffer several notorious limitations: 1) For top-down methods, the heuristic proposal generation process introduces a series of hyper-parameters. Meanwhile, the whole inference stage is computation-intensive for densely placed candidates. 2) For bottom-up methods, the framewise prediction manner overlooks fruitful temporal context relationships, which strictly limits their performance. 3) All these methods are not end-to-end, which need complex post-processing steps to refine predictions, and easily fall into the local optimum.
In this paper, we reformulate the video grounding as a set prediction problem and propose a novel end-to-end multi-modal Transformer model GTR (video Grounding with TRansformer). GTR has two different encoders for video and language feature encoding, and a cross-modal decoder for final grounding result prediction. Specifically, we use a Cubic Embedding layer to transform the raw video data into a set of visual tokens, and regard all word embeddings of the language query as textual tokens. Both these visual and textual tokens are then fed into two individual Transformer encoders for respective single-modal context modeling. Afterwards, these contextualized visual and textual tokens serve as an input to the cross-modal decoder. Other input for the decoder is a set of learnable segment queries, and each query try to regress a video moment by interacting with these contextualized tokens. To better fuse these two modalities in the decoder, we design a new Multi-head Cross-Modal Attention module (MCMA). The whole GTR model is trained end-to-end by optimizing a Many-to-One Matching Loss which produces an optimal bipartite matching between predictions and ground-truth. Thanks to this simple pipeline and the effective relationship modeling capabilities in Transformer, our GTR is both effective and computationally efficient with extremely fast inference speed (cf. Figure 1).
Since our community has few empirical experiences on determining the best design choice for multi-modal Transformer-family models, we conduct extensive exploratory studies on GTR to investigate the influence of different model designs and training strategies, including: (a) Visual tokens acquisition. We use a cubic embedding layer to transform raw videos to visual tokens, and discuss the design of the cubic embedding layer from three dimensions. (b) Multi-modal fusion mecha-nism. We propose six types of multi-modal fusion mechanisms, and compare their performance and computation cost thoroughly. (c) Decoder design principles. We explore some key design principles of a stronger multi-modal Transformer decoder, such as the tradeoff between depth and width, or the attention head number in different layers. (d) Training recipes. We discuss the influence of several training tricks. We hope our exploration results and summarized take-away guidelines can help to open the door for designing more effective and efficient Transformer models in multi-modal tasks.
In summary, we make three contributions in this paper: 1. We propose the first end-to-end model GTR for video grounding, which is inherently efficient with extremely fast inference speed. 2. By the careful design of each component, all variants of GTR achieve new state-of-the-art performance on three datasets and all metrics. 3. Most importantly, our comprehensive explorations and empirical results can help to guide the design of more multi-modal Transformerfamily models in other multi-modal tasks.

Related Work
Video Grounding. The overwhelming majority of state-of-the-art video grounding methods are topdown models (Anne Hendricks et al., 2017;Gao et al., 2017;Ge et al., 2019;Liu et al., 2018aLiu et al., , 2020Zhang et al., 2019aZhang et al., , 2020bChen et al., 2018;Yuan et al., 2019a;Wang et al., 2020;Xu et al., 2019;Xiao et al., 2021b,a;. Although these top-down models have dominated the performance, they suffer from two inherent limitations: 1) The densely placed proposal candidates lead to heavy computation cost. 2) Their performance are sensitive to the heuristic rules (e.g., the number and the size of anchors). Another type of methods is bottom-up models (Yuan et al., 2019b;Lu et al., 2019;Zeng et al., 2020;Chen et al., 2020Chen et al., , 2018Zhang et al., 2020a). Some works Wang et al., 2019) resort to Reinforcement Learning to guide the boundary prediction adjustment. However, all existing methods (both top-down and bottom-up) are not end-to-end and require complex post-processing steps. In this paper, we propose a end-to-end model GTR, which directly generates predictions with ultrafast inference speed. Vision Transformer. Transformer (Vaswani et al., 2017) is a de facto standard language modeling architecture in the NLP community. Recently, a pioneering object detection model DETR (Carion et al., 2020) starts to formulate the object detection task as a set prediction problem and use an end-to-end Transformer structure to achieve stateof-the-art performance. Due to its end-to-end nature, DETR regains the CV community attention about the Transformer, and a mass of vision Transformer models have been proposed for different vision understanding tasks, such as image classification , object detection (Carion et al., 2020;, tracking (Meinhardt et al., 2021;Sun et al., 2020), person re-id , image generation , super resolution , and video relation detection (Gao et al., 2021). Unlike previous methods only focusing on the vision modality, our GTR is a multi-modal Transformer model, which not only needs to consider the multi-modal fusion, but also has few empirical experience for model designs.

Video Grounding with TRansformer
As shown in Figure 2, GTR yields temporal segment predictions semantically corresponding to the given query by four consecutive steps: (1) Input Embedding. Given a raw video and a query, this step aims to encode them into the feature space (i.e., visual and textual tokens). 2) Encoder. Visual and textual token embeddings are enhanced with a standard Transformer encoder by modeling the intra-modality correlations.
(3) Cross-Modal Decoder. Contextualized visual and textual token embeddings are fused by a Multi-head Cross-Modal Attention module (MCMA), and a set of learnable segment queries are fed into the decoder to interact with these two modal features. (4) Prediction Heads. A simple feed-forward network (FFN) is applied to predict final temporal segments.

Input Embedding and Encoder
Video Cubic Embedding. Aiming to build a pure Transformer model without the reliance on CNNs, ViT (Dosovitskiy et al., 2021) decomposes input images into a set of non-overlapping patches. To process video data, a straightforward solution is to apply this partition for each frame. However, this simple extension overlooks the frame-wise temporal correlations. Thus, we propose a Cubic Embedding layer which directly extracts 3D visual tokens from the height, width, and temporal dimensions respectively (cf. Figure 2). Formally, given a raw video, we firstly use framerate 1/γ τ to sample the video, and obtain video clip V ∈ R T ×H×W ×3 . Then, we use a sampling kernel κ with shape (k h , k w , k t ) to transform the video into visual tokens, and the sampling kernel κ is propagated with stride size s(s h , s w , s t ), where s h , s w , and s t denote the stride size in the height, width, and temporal dimension respectively. Each sampled 3D visual patch is fed into a projection layer, and the output of the cubic embedding layer is a set of visual token embeddings I v ∈ R F ×d , where F is the number of visual tokens, and d is the dimension of the projected layer. In our experiments, we set d to the same hidden dimension of the Transformer. Apparently, and O t = T −kt st + 1 . Compared to the nonoverlapping tokenization manners, our cubic embedding layer allows overlapping sampling, which implicitly fuses adjacent spatial-temporal context (More experiments and discussions about the cubic embedding layer are shown in Sec. 4.2). In particular, when setting s h = k h , s w = k w , and s t = k t = 1, our cubic embedding degrades into the prevalent patch embedding in ViT. Sentence Embedding. For the language input, we first encode each word with pretrained GloVe embeddings (Pennington et al., 2014), and then employ a Bi-directional GRU to integrate the sentencelevel embedding feature I s ∈ R S×ds , where S represents the length of the sentence query, and d s is the dimension of textual token embeddings. Encoder. We use two plain Transformer encoders to model the visual and textual intra-modality context, respectively. Specifically, for the video token embeddings I v , we apply an encoder with N v layers to obtain their corresponding contextualized visual tokens H ∈ R F ×d . Similarly, for the textual token embeddings I s , we use an encoder with N s layers to obtain contextualized textual tokens S. Another feed-forward network with two fullyconnected layers is applied to adjust S to be the same channel with H, i.e., S ∈ R S×d .

Cross-modal Decoder
As shown in Figure 2, the inputs for cross-modal decoder consist of the visual features H ∈ R F ×d , language features S ∈ R S×d , and a set of learnable segment queries Q ∈ R N ×d . Each segment query q i ∈ Q tries to learn a possible moment by interacting with H and S, and the whole decoder will decode N moment predictions in parallel. For accurate segment localization, video grounding requires modeling fine-grained cross-modal relations.
To this end, we design a cross-modal decoder with a novel Multi-head Cross-Modal Attention module (MCMA). As shown in Figure 3, we propose several specific instantiations of MCMA 1 : Joint Fusion. Given visual and language features H and S, we firstly generate a set of modal-specific key (H k , S k ) and value (H v , S v ) pairs by lin-1 More details are left in the supplementary materials.
ear transformations: h and W v s ∈ R d×d are all learnable parameters. Then joint fusion concatenates the two modalities before conducting the attention computing: whereQ is the enhanced segment query embeddings after the self-attention and ⊗ denotes the channel concatenation. MHA stands for the standard Multi-head Attention (Vaswani et al., 2017). Divided Fusion. We provide a modality-specific attention computation manner, i.e., Divided Fusion decomposes the multi-modal fusion into two parallel branches and the final results are summed up with learnable weights.
whereQ, H k , H v , S k and S v are defined the same as in the joint fusion. ⊕ denotes the additive sum with learnable weights. Hybrid Fusion. Hybrid Fusion offers a compromise between Joint Fusion and Divided Fusion. Specifically, the query-key multiplication is conducted separately while the query-value multiplication is still in an concatenation format. Suppose there are n h self-attention heads. The query, key and value embeddings are uniformly split into n h where d h is the dimension of each head and equal to d/n h . For each head, we apply hybrid fusion in the form: where σ is the softmax function. The outputs of all heads are then again concatenated along the channel dimension and a linear projection is finally applied to produce the final output as follows: Stepwise Fusion. This fusion manner implements the cross-modality reasoning in a cascaded way, i.e., attention computation is performed between Q and video features and then propagated to the sentence modality: We further discuss more multi-modal fusion mechanisms in Sec. 4.2 and supplementary materials.

Many-to-One Matching Loss
Training: Based on the Cross-Modality Decoder output, a feed forward network is applied to generate a fixed length predictionsŶ GTR applies set prediction loss (Carion et al., 2020;Stewart et al., 2016) between the fixed-size output sets and ground-truth. Notably, considering each language query only corresponds to one temporal segment, we adapt the many-to-many matching in (Carion et al., 2020) to the many-to-one version. Specifically, the loss computation is conducted in two consecutive steps. Firstly, we need to determine the optimum prediction slot via the matching cost based on the bounding box similarity and confidence scores as follows: (1) In our many-to-one matching loss, the optimal match requires only one iteration of N generated results, rather than checking all possible permutations as in (Carion et al., 2020), which greatly simplifies the matching process.
is a scale-invariant generalized intersection over union in (Rezatofighi et al., 2019). Then the second step is to compute the loss function between the matched pair: where i * is the optimal match computed in Eq. (1). Inference: During inference, the predicted segment set is generated in one forward pass. Then the result with the highest confidence score is selected as the final prediction. The whole inference process requires no predefined threshold values or specific post-processing processes.

Experiments
We first introduce experimental settings in Sec. 4.1. Then, we present detailed exploratory studies on the design of GTR in Sec. 4.2. The comparisons with SOTA methods are discussed in Sec. 4.3, and we show visualization results in Sec. 4.4. More results are left in supplementary materials.

Settings
Datasets. We evaluated our GTR on three challenging video grounding benchmarks: 1) ActivityNet Captions (ANet) (Krishna et al., 2017): The average video length is around 2 minutes, and the average length of ground-truth video moments is 40 seconds. By convention, 37,417 video-query pairs for training, 17,505 pairs for validation, and 17,031 pairs for testing. 2) Charades-STA (Gao et al., 2017): The average length of each video is around 30 seconds. Following the official splits, 12,408 video-query pairs for training, and 3,720 pairs for testing. 3) TACoS (Regneri et al., 2013): It is a challenging dataset focusing on cooking scenarios. Following previous works (Gao et al., 2017), we used 10,146 video-query pairs for training, 4,589 pairs for validation, and 4,083 pairs for testing. Evaluation Metrics. Following prior works, we adopt "R@n, IoU@m" (denoted as R m n ) as the metrics. Specifically, R m n is defined as the percentage of at least one of top-n retrieved moments having IoU with the ground-truth moment larger than m. Modal Variants. Following the practice of Visual Transformers or BERTs (Dosovitskiy et al., 2021;Devlin et al., 2019), we also evaluate three typical model sizes: GTR-Base (GTR-B, N v = 4, N s = 4, N d = 6, d = 320), GTR-Large (GTR-L, N v = 6, N s = 6, N d = 8, d = 320), and GTR-Huge (GTR-H, N v = 8, N s = 8, N d = 8, d = 512). 1 Implementation Details. For input video, the sampling rate 1/γ τ was set to 1/8, all the frames were resized to 112 × 112, the kernel shape and stride size were set to (8,8,3) and (8, 8, 2), respectively. We used AdamW (Loshchilov and Hutter, 2017) with momentum of 0.9 as the optimizer. The initial learning rate and weight decay were set to 10 −4 . All weights of the encoders and decoders were initialized with Xavier init, and the cubic embedding  layer was initialized from the ImageNet-pretrained ViT (Dosovitskiy et al., 2021). We used random flip, random crop, and color jitter for video data augmentation. Experiments were conducted on 16 V100 GPUs with batch size 64.

Empirical Studies and Observations
In this subsection, we conducted extensive studies on different design choices of GTR, and tried to answer four general questions: Q1: How to transform a raw video into visual tokens? Q2: How to fuse the video and text features? Q3: Are there any design principles to make a stronger Transformer decoder? Q4: Are there any good training recipes?

Visual Tokens Acquisition (Q1)
Settings. In cubic embedding layer, there are two sets of hyper-parameters (kernel shape (k w , k h , k t ) and stride size (s w , s h , s t )). We started our studies from a basic GTR-B model 2 , and discussed design choices from three aspects: 1) Spatial configuration. We compared four GTR-B variants with different kernel spatial size k w , k h = {8, 12, 16, 24}, and denoted these models as GTR-B/*. Results are reported in The baseline GTR-B is with the stepwise fusion strategy, and with sw = kw = s h = k h = 8, kt = 3, st = 2. corresponds to three basic types (temporal overlapping, spatial overlapping, and non-overlapping).
Observations. 1) Models with smaller patch size (e.g., GTR-B/8) achieve better performance yet at the cost of the dramatic increase of FLOPs. 2) Models with larger kernel temporal depth (k t ) will not always achieve better results. It is worth noting that our cubic embedding will degrade into the prevalent framewise partition in vision Transformers by setting k t = s t = 1. We further compared cubic embedding with this special case in Table 2, and the results show the superiority of our cubic embedding layer. 3) Compared to non-overlapping and spatial overlapping, temporal overlapping can help to improve model performance significantly. Besides, the performance is not sensitive to the overlapping degree in all overlapping cases. Guides. The temporal overlapping sampling strategy can greatly boost the performance of the Cubic Embedding layer at an affordable overhead.

Multi-modal Fusion Mechanisms (Q2)
Settings. As mentioned in Sec. 3.2, we design four types of multi-modal fusion mechanism in the decoder (i.e., joint/divided/hybrid/stepwise fusion), and we group all these fusion mechanisms as late fusion. Meanwhile, we also propose two additional fusion mechanisms 1 : 1) Early fusion: The multimodal features are fused before being fed into the decoder. 2) Conditional fusion: The language features act as conditional signals of segment queries of the decoder. Results are reported in Table 3.  Table 4: Performance comparisons on three benchmarks(%). All the reported results on ANet and TACoS datasets are based on C3D (Tran et al., 2015) extracted feature. "*" denotes finetuning on corresponding backbones. For pair comparisons, Parma (M) includes the parameter of feature extractor (C3D). GTR-B is more efficient while GTR-H achieves the highest recall. perform the early fusion and conditional fusion ones. 2) Among late fusion models, the stepwise fusion model achieves the best performance by using the largest number of parameters.

Observations. 1) All four late fusion models out-
3) The performance is not sensitive to the crop size, but is positively correlated with sample rate and converges gradually(cf. Figure 4). 4) We find that the FLOPs difference does not increase significantly regarding the crop size and sample rate (cf. Figure 4).

Guides.
Stepwise fusion is the optimal fusion mechanism, even for long and high-resolution videos.

Decoder Design Principles (Q3)
Settings. To explore the key design principles of a stronger multi-modal Transformer decoder, we considered two aspects: 1) Deeper vs. Wider, i.e., whether the decoder should go deeper or wider? Based on the basic GTR-B model 2 , we designed two GTR-B variants with nearly equivalent parameters: GTR-B-Wide (d = 352) and GTR-B-Deep (N d = 8) 1 . The results are reported in Table 5 (a). 2) Columnar vs. Shrink vs. Expand., i.e., how to design the attention head number in different layers. We tried three different choices: i) the same number of heads in all layers (columnar); ii) gradually decrease the number of heads (shrink); iii) gradually increase the number of heads (expand). Results are reported in Table 5  Guides. Going deeper is more effective than going wider and progressively expanding the attention heads leads to better performance.

Models
Param (

Training Recipes (Q4)
Settings. Due to the lack of inductive biases, visual Transformers always over-rely on large-scale datasets for training (Dosovitskiy et al., 2021;Touvron et al., 2020). To make multi-modal Transformers work on relatively small multi-modal datasets, we discussed several common training tricks: 1) Pretrained weights. Following the idea of pretrainthen-finetune paradigm in CNN-based models 3 , we initialize our video cubic embedding layer from the pretrained ViT. Specifically, we initialized our 3D linear projection filters by replicating the 2D filters along the temporal dimension. 2) Data augmentations. To study the impact of different data augmentation strategies, we apply three typical choices sequentially, i.e., random flip, random crop, and color jitter. Results are reported in Table 6.
Observations. 1) Using the pretrained cubic embedding weights can help to improve model performance significantly. 2) Color jitter brings the most performance gains among all visual data augmentation strategies. We conjecture that this may be due to the fact that color jitter can change the visual appearance without changing the structured infor-   mation, which is critical for multi-modal tasks.
Guides. Using pretrained embedding weights and color jitter video augmentation are two important training tricks for multi-modal Transformers.

Comparisons with State-of-the-Arts
Settings. We compared three variants of GTR (i.e., GTR-B, GTR-L, and GTR-H) to state-of-the-art video grounding models: 2D-TAN (Zhang et al., 2020b), DRN (Zeng et al., 2020), SCDM (Yuan et al., 2019a), QSPN (Xu et al., 2019), CBP (Wang et al., 2020). These video grounding methods were based on pre-extracted video features (e.g., C3D or I3D) while our GTRs were trained in an end-to-end manner. For more fair comparisons, we selected two best performers (2D-TAN and DRN), and retrained them by finetuning the feature extraction network. Results are reported in Table 4 1 .
Results. From Table 4, we have the following observations: 1) All three GTR variants outperform existing state-of-the-art methods on all benchmarks and evaluation metrics. Particularly, the GTR-H achieves significant performance gains on more strict metrics (e.g., 6.25% and 4.23% absolute performance differences on Charades-STA with metric R 0.7 1 and TACoS with metric R 0.5 1 , respectively.) 2) The inference speed of all GTR variants are much faster than existing methods (e.g., 13.4 QPS in GTR-B vs. 2.36 QPS in 2D-TAN). Meanwhile, our GTR has fewer parameters than existing methods (e.g., 25.98M in GTR-B vs. 93.37M in 2D-TAN).

Visualizations
Self-Attention in Video Encoder. We showed an example from TACoS dataset in Figure 5. For a given video snippet (e.g., the 30-th frame 4 ), we selected two reference patches 4 , and visualized the attention weight heatmaps of the last self-attention layer on other four frames (i.e., the 10-th frame to 50-th frame). From Figure 5, we can observe that the self-attention in the video encoder can effectively focus more on the semantically corresponding areas (e.g., the person and chopping board) even across long temporal ranges, which is beneficial for encoding global context. Cross-attention in Decoder. An example from ANet dataset was presented in Figure 6. We took the stepwise fusion strategy for MCMA in the decoder. For the output, we selected the segment query slot which outputs the highest confidence scores and visualized its attention weights on two modal features. For the background video frames 4 (#1), the decoder mainly focuses on figure outline. For ground-truth video frames 4 (#2, #3), it captures the most informative parts (e.g., moving hands and touching eye action), which plays an important role in location reasoning. As for the language attention weights, it focuses on essential words, e.g., the salient objects (lady, eye) and actions (put).

Conclusions and Future Works
In this paper, we propose the first end-to-end multimodal Transformer-family model GTR for video grounding. By carefully designing several novel components (e.g., cubic embedding layer, multihead cross-modal attention module, and many-toone matching loss), our GTR achieves new stateof-the-art performance on three challenging benchmarks. As a pioneering multi-modal Transformer model, we also conducted comprehensive explorations to summarize several empirical guidelines for model designs, which can help to open doors for the future research on multi-modal Transformers. Moving forward, we aim to extend GTR to more general settings, such as weakly-supervised video grounding or spatial-temporal video grounding.

Acknowledgement
This paper was partially supported by National Engineering Laboratory for Video Technology-Shenzhen Division, and Shenzhen Municipal Development and Reform Commission (Disciplinary Development Program for Data Science and Intelligent Computing). It is also supported by National Natural Science Foundation of China (NSFC 6217021843). Special acknowledgements are given to AOTO-PKUSZ Joint Research Center of Artificial Intelligence on Scene Cognition technology Innovation for its support.

GTR Variant Settings
We list typical GTR variant parameter settings in Table 7.

Models
Nv  We also compare our Cubic Embedding layer with the framewise partition in Vision Transformer. Part of the results have been listed in the main paper and the full results are shown in Table. 9. It demonstrates the superiority of our cubic embedding layer.

Decoder Design Principles
Columnar vs. Shrink vs. Expand. We provide more experiments to determine how to design the attention head number in different layers. We set up three different distributions of attention heads: i) the same number of heads in all layers (columnar); ii) gradually decrease the number of heads (shrink); iii) gradually increase the number of heads (expand). The results are listed in Table. 10. It is consistent with our conclusion that progressively expanding the attention head leads to better performance. Deep vs. Wide. We have developed two variants (GTR-B-Wide, GTR-B-Deep) and reported the results on ActivityNet Caption and TACoS datasets, which demonstrates that going deeper is more effective than going wider. Here we provide two  additional variants (GTR-L-Wide, GTR-L-Deep) and report results on all three datasets in Table 12.
Similarly, the deep model also outperforms the wide model, which further confirms our conclusion.

Multi-modal Fusion Mechanisms
In general, multi-modal fusion methods can be divided into three categories (cf. Figure. 7): 1) Early fusion: The multi-modal features are fused before being fed into the decoder. 2) Conditional fusion: The language feature acts as conditional signals of segment queries of the decoder. We concatenate it with the learnable segment queries features. 3) Late fusion: Multi-modal fusion is conducted with the decoder via the proposed Multi-head Cross-modal Attention (MCMA) Module. For the specific instantiations of MCMA, except for the four fusion strategies mentioned in the main paper, we additionally present two additional variants in Figure. 8. 1) Hybrid Fusion (value split) concatenates the key features and computes the query and value multiplication separately. Specifically,Q ∈ R N ×d , H k ∈ R F ×d , S k ∈ R S×d generate the attention matrix with shape (F + S) × d, which is divided into two parts and each is with shape F × d and S × d, respectively. These two divided attention weights are used to computed with H v and S v separately to generate the final results.

2)
Stepwise Fusion (language-vision) fuses the language feature firstly and then the video features.

Models
ActivityNet  The experimental results are presented in Table. 13 and we have the following findings. 1) Stepwise fusion has the best performance on all three datasets by using the largest number of parameters.
2) The performance of hybrid fusion (value split) is almost the same as hybrid fusion. 3) Also, stepwise fusion(L-V) shares the similar performance with stepwise fusion, which demonstrates that the order of fusion of visual or language information is not sensitive.
To investigate the influence of frame crop size and sample rate, we conduct more experiments on stepwise fusion models. The results in Table. 11 show that 1) the performance is not sensitive o the frame crop size; 2) the performance is positively correlated with the sample rate but gradually reach convergence.

Performance of GTR
The performance of GTR under more IoUs is available in Table. 14.   Table 13: Multi-modal fusion comparisons.

Video Embedding
Segment Query Q