Parallel Attention Network with Sequence Matching for Video Grounding

Given a video, video grounding aims to retrieve a temporal moment that semantically corresponds to a language query. In this work, we propose a Parallel Attention Network with Sequence matching (SeqPAN) to address the challenges in this task: multi-modal representation learning, and target moment boundary prediction. We design a self-guided parallel attention module to effectively capture self-modal contexts and cross-modal attentive information between video and text. Inspired by sequence labeling tasks in natural language processing, we split the ground truth moment into begin, inside, and end regions. We then propose a sequence matching strategy to guide start/end boundary predictions using region labels. Experimental results on three datasets show that SeqPAN is superior to state-of-the-art methods. Furthermore, the effectiveness of the self-guided parallel attention module and the sequence matching module is verified.


Introduction
Video grounding is a fundamental and challenging problem in vision-language understanding research area (Hu et al., 2019;Yu et al., 2019;Zhu and Yang, 2020). It aims to retrieve a temporal video moment that semantically corresponds to a given language query, as shown in Figure 1. This task requires techniques from both computer vision (Tran et al., 2015;Shou et al., 2016;Feichtenhofer et al., 2019), natural language processing Yang et al., 2019), and more importantly, the crossmodal interactions between the two. Many existing solutions Liu et al., 2018a; tackle video grounding problem with proposal-based approach. This approach generates proposals with pre-set sliding windows or anchors, computes the similarity between the query and each proposal. The proposal with highest score is selected as the answer. These methods are sensitive to the quality of proposals and are inefficient because all proposal-query pairs are compared. Recently, several one-stage proposal-free solutions (Chen et al., 2019;Lu et al., 2019a;Mun et al., 2020) are proposed to directly predict start/end boundaries of target moments, through modeling video-text interactions. Our solution, SeqPAN, is a proposal-free method; hence our key focuses are video-text interaction modeling and moment boundary prediction.
Video-text interaction modeling. In order to model video-text interaction, various attentionbased methods have been proposed (Gao et al., 2017;Mun et al., 2020). In particular, transformer block (Vaswani et al., 2017) is widely used in vision-language tasks and proved to be effective for multimodal learning (Tan and Bansal, 2019;Lu et al., 2019b;Su et al., 2020;Li et al., 2020). In video grounding task, fine-grain scale unimodal representations are important to achieve good localization performance. However, existing solutions do not refine unimodal representations of video and text when doing cross-modal reasoning, and thus limit the performance.
To better capture informative features for multimodalities, we encode both self-attentive contexts and cross-modal interactions from video and query. That is, instead of solely relying on sophisticated "ORG" is for "Organization", "B", "I" and "E" denote the begin, inside and end of the organization entity, respectively.
cross-modal learning as in most existing studies, we learn both intra-and inter-modal representations simultaneously, with improved attention modules. Moment boundary prediction. In terms of the length, target moment is usually a very small portion of the video, making positive (frames in target moment) and negative (frames not in target moment) samples imbalanced. Further, we aim to predict the exact start/end boundaries (i.e., two video frames 2 ) of the target moment. If we view from the space of video frames, sparsity is a major concern, e.g., catching two frames among thousands.
Recent studies attempt to address this issue by auxiliary objectives, e.g., to discriminate whether each frame is foreground (positive) or background (negative) (Yuan et al., 2019b;Mun et al., 2020), or to regress distances of each frame within target moment to ground truth boundaries (Lu et al., 2019a;Zeng et al., 2020). However, the "sequence" nature of frames or videos is not considered. We emphasize the "sequence" nature of video frames and adopt the concept of sequence labeling in NLP to video grounding. We use named entity recognition (NER) (Lample et al., 2016;Ma and Hovy, 2016) as an example sequence labeling task for illustration in Figure 2. Video grounding is to retrieve a sequence of frames with start/end boundaries of target moment from video. This is analogous to extract a multi-word named entity from a sentence. The main difference is that, words are discrete, so word annotations (i.e., B, I, E, and O tags) in sentence are discrete. In contrast, video is continuous and the changes between consecutive frames are smooth. Hence, it is difficult (and also not necessary) to precisely annotate each frame. We relax the annotations on video sequence by specifying video regions, instead of frames. With respect to the target moment, we label B, I, E and O (BIEO) regions on video (see Figure 3) and introduce label embeddings to model these regions. Our contributions. In this research, we propose a Parallel Attention Network with Sequence match-2 The "frame" is a general description, which can refer to a frame in a video sequence or a unit in the corresponding video feature representation. ing (SeqPAN) for video grounding task. We first design a self-guided parallel attention (SGPA) module to capture both self-and cross-modal attentive information for each modality simultaneously. In SGPA module, a cross-gating strategy with selfguided head is further used to fuse self-and crossmodal representations. We then propose a sequence matching (sq-match) strategy, to identify BIEO regions in video. The label embeddings are incorporated to represent label of frames in each region for region recognition. The sq-match guides Se-qPAN to search for boundaries of target moment within constrained regions, leading to more precise localization results. Experimental results on three benchmarks demonstrate that both SGPA and sqmatch consistently improve the performance; and SeqPAN surpasses the state-of-the-art methods.

Related Work
Existing solutions to video grounding are roughly categorized into proposal-based and proposal-free frameworks. In proposal-based framework, common structures include ranking and anchor-based methods. Ranking-based methods (Liu et al., 2018b;Hendricks et al., 2017Hendricks et al., , 2018Chen and Jiang, 2019;Ge et al., 2019;Zhang et al., 2019b) solve this task with two-stage propose-and-rank pipeline, which first generates proposals and then uses multimodal matching to retrieve most similar proposal for a query. Anchor-based methods Zhang et al., 2019c;Wang et al., 2020b) sequentially assign each frame with multiscale temporal anchors and select the anchor with highest confidence as the result. However, these methods are sensitive to the proposal quality; and comparison of all proposal-query pairs is computational expensive and inefficient.
Proposal-free framework includes regression and span-based methods. Regression-based methods (Yuan et al., 2019b;Lu et al., 2019a;Chen et al., 2020a,b) tackle video grounding by learning cross-modal interactions between video and query, and directly regressing temporal time of target moments. Span-based methods (Ghosh et al., 2019;Rodriguez et al., 2020;Zhang et al., 2020a;Lei et al., 2020;Zhang et al., 2021) address video grounding by borrowing the concept of extractive question answering (Seo et al., 2017;, and to predict the start and end boundaries of target moment directly.
In addition, there are several works (He et al.,

Integration
T i m e l i n e 2019; Wang et al., 2019;Cao et al., 2020;Hahn et al., 2020;Wu et al., 2020a,b) that formulate this task as a sequential decision-making problem and adopt reinforcement learning to observe candidate moments conditioned on queries. Other methods, e.g., weakly supervised learning methods (Mithun et al., 2019;Lin et al., 2020;Wu et al., 2020a), 2D map model of temporal relations between video moments (Zhang et al., 2020b), ensemble of topdown and bottom-up methods (Wang et al., 2020a), joint learning video-level matching and momentlevel localization (Shao et al., 2018), have also been explored. Some works (Shao et al., 2018;Cao et al., 2020;Wang et al., 2020a) use either additional resources/features or different evaluation metrics, so their results are not directly comparable with many others, including ours.

Proposed Method
Let V = [f t ] T −1 t=0 be an untrimmed video with T frames; Q = [q j ] M −1 j=0 be a language query with M words; t s and t e denote start and end time point of ground-truth temporal moment. We define and tackle video grounding task in feature spaces. Specifically, we split the given video V into N clip units, and use pre-trained feature extractor to encode them into visual features Then the t s(e) are mapped to the corresponding indices i s(e) in the feature sequence, where 0 ≤ i s ≤ i e ≤ N − 1. For the query Q, we encode words with pre-trained word embeddings as Q = [w j ] M −1 j=0 ∈ R dw×M , where d w is word dimension. Given the pair of (V , Q) as input, video grounding aims to localize a temporal moment starting at i s and ending at i e .

The SeqPAN Model
The overall architecture of the proposed SeqPAN model is shown in Figure 3. Next, we present each module of SeqPAN in detail.

Encoder Module
Given visual features V ∈ R dv×N of the video and word embeddings Q ∈ R dw×M of the language query, we map them into the same dimension d with two FFNs 3 , respectively. The encoder module mainly encodes the individual modality separately. As position encoding offers a flexible way to embed a sequence, when the sequence order matters, we first incorporate a position embedding to every input of both video and query sequences. Then, we adopt stacked 1D convolutional block to learn representations by carrying knowledge from neighbor tokens. The encoded representations are written as: where V ∈ R d×N and Q ∈ R d×M ; E p denotes the positional embeddings. Both position embeddings and convolutional block are shared by the video and text features.

Self-Guided Parallel Attention Module
A self-guided parallel attention (SGPA) module (see Figure 4) is proposed to improve multimodal representation learning. Compared with standard transformer (TRM) encoder, SGPA uses two parallel multi-head attention blocks to learn both unimodal and cross-modal representations simultaneously, and merge them with a cross-gating strategy 4 . 3 We denote the single-layer feed-forward network as FFN (FFN(X) = W · X + b) in this work. 4 A detailed comparison of SGPA and standard TRMs is summarized in Appendix.

Multi-Head Attention
Multi-Head Attention

Cross Gating
Add & Norm  Taking video modality as an example, the attention process is computed as: where σ s denotes Softmax function; Q V , K V and V V are the linear projections of V ; Q Q , K Q and V Q are linear projections of Q ;V S encodes the self-attentive contexts within video modality; and V C integrates information from query modality according to cross-modal attentive relations. The selfand cross-modal representations are then merged together by a cross-gating strategy: where σ denotes Sigmoid function and represents Hadamard product. The cross-gating explicitly interacts features obtained from the self-and cross-attention encoders to ensure both are fully utilized, instead of relying on only one of them. Finally, we employ a self-guided head to implicitly emphasize the informative representations by measuring the confidence of each element inV as: The refined representationsQ for the query modality are obtained in a similar manner (e.g., swapping visual and query features).

Video-Query Integration Module
This module further enhances the cross-modal interactions between visual and textual features. It utilizes context-query attention (CQA) strategy  and aggregates text information for each visual element 5 (see Figure 3). GivenV andQ, CQA first computes the similarities, S ∈ R N ×M , between each pair ofV andQ features. Then two attention weights are derived by A VQ = S r ·Q and A QV = S r · S c ·V , where S r /S c are row-/column-wise normalization of S by Softmax. The query-aware video representations V Q is computed by: Similarly, video-aware query representations Q V ∈ R d×M can be derived by swapping visual and textual inputs in CQA module. Then we encode Q V into sentence representation q with additive attention (Bahdanau et al., 2015) and

Sequence Matching Module
As illustrated in Figure 3, we considers the frames within ground truth moment and several neighboring frames as foreground, while the rest as background. Then, we split the foreground into Begin, Inside, and End regions. For simplicity, we assign each region a label, i.e., "B-M" for begin, "I-M" for inside, "E-M" for end region, and "O" for background. B-M/E-M explicitly indicate potential positions of the start/end boundaries. We also specify orthogonal label embeddings E lab ∈ R d×4 to represent those labels, and to infuse label information into visual features after region label predictions.
Note our approach is different from Lin et al. (2018) on temporal action proposal generation task, where the target proposal is split into start, centre, and end regions. The probability of a frame belonging to each of three regions is predicted separately in a regression manner, leading to three separate probability sequences, one for each region. The maximum probabilities in the sequences are used to guide proposal generations. In contrast, we formulate matching process as a multi-class classification problem and predict a concrete region label for each frame, i.e., same as a sequence labeling task in NLP. Label embeddings are then assigned to the frames based on the labels of the predicted region.
A straightforward solution to predict the confidence of an element belonging to each region is multi-class classifier: (6) where S seq encodes the probabilities of each visual element in different regions. Then label index with highest probability from S seq is selected to represent the predicted label for each visual element: However, a major issue here is that Eq. 7 needs to sample from a discrete probability distribution, which makes the back-propagation of gradients through S seq in Eq. 6 infeasible for optimizer. To make back-propagation possible, we adopt the Gumbel-Max (Gumbel, 1954;Maddison et al., 2014) trick to re-formulate Eq. 7 as: whereL lab ∈ R 4×N . Then, we utilize the Gumbel-Softmax (Jang et al., 2017;Maddison et al., 2017) to relax the arg max so as to make Eq. 8 being differentiable 6 . Formally, we use Eq. 9 to approximate Eq. 8 as: where τ is annealing temperature. As τ → 0 + , L lab ≈L lab , while τ → ∞, each element inL lab will be the same and the approximated distribution will be smooth. Note we use Eq. 8 during forward pass while Eq. 9 for backward pass to allow gradient back-propagation. As the result, the embedding lookup process is differentiable and the label-attended visual representations is derived as: The training objective is defined as: where Y lab denotes the ground truth sequence labels, 1 is the matrix with all elements being 1 and I is the identity matrix. The second term in Eq. 11 is the orthogonal regularization (Brock et al., 2019), which ensures E lab to keep the orthogonality.

Localization Module
Finally, we present a conditioned localizer to predict the start and end boundaries of the target moment. The localizer consists of two stacked transformer blocks and two FFNs. The scores of start and end boundaries are calculated as: 6 More details about Gumbel Tricks are in Appendix.
where S s/e ∈ R N . W s/e and b s/e are the weight and bias of start/end FFNs, respectively. Note the representations of end boundary (H e ) are conditioned on that of start boundary (H s ) to ensure the predicted end boundary is always after start boundary. Then, the probability distributions of start/end boundaries are computed by P s/e = Softmax(S s/e ) ∈ R N . The training objective is: where f XE is cross-entropy function, Y s/e is onehot labels for start/end (i s /i e ) boundaries.

Training and Inference
The overall training loss of SeqPAN is: L = L loc + L seq , to be minimized during the training process. During inference, the predicted start and end boundaries of a given video-query pair are generated by maximizing the joint probability as: whereî s andî e are the best start and end boundaries of predicted moment for the given video-query pair. Let T be the duration of given video, the predicted start/end time are computed byt s(e) =î s(e) /(N − 1) × T . With the predicted (t s ,t e ) and ground truth (t s , t e ) time intervals, the measure, temporal intersection over union (IoU), is computed as: where t s(e) min / max = min / max(t s(e) , t s(e) ).

Comparison with State-of-the-Arts
We compare SeqPAN with the following state-ofthe-arts.    TMLGA (Rodriguez et al., 2020), DRN (Zeng et al., 2020); 3) Others: TSP-PRL (Wu et al., 2020b) and 2D-TAN (Zhang et al., 2020b). The best results are in bold and the second bests are in italic. In all result tables, the scores of compared methods are reported in the corresponding works.
The results on the Charades-STA are summarized in Table 1. SeqPAN surpasses all baselines and achieves the highest scores over all metrics. Observe that the performance improvements of SeqPAN are more significant under more strict metrics. The results show that SeqPAN can produce more precise localization results. For instance, compared to LGI, SeqPAN achieves 5.86% absolute improvement by "R@1, IoU=0.7", and 1.40% by "R@1, IoU=0.5". Table 2 reports the results on ANetCaps. SeqPAN is superior to baselines and achieves the best performance on "R@1, IoU=0.7" and mean IoU. As reported in Table 3, similar observations hold on TACoS. Note 2D-TAN (Zhang et al., 2020b) pre-processes the TACoS dataset, making it is slightly different from the original one. We also conduct experiments on their version for a fair comparison. SeqPAN outperforms the base-   lines over all evaluation metrics on both versions.

Discussion and Analysis
We perform in-depth ablation studies to analyze the effectiveness of the SeqPAN. We run all the experiments 5 times and report 5-run average. Analysis on Self-Guided Parallel Attention. The SGPA (see Figure 4)   Impact of SGPA block numbers N SGPA . We now study the impact of SGPA block numbers on Charades-STA and ANetCaps. We evaluate five different values of N SGPA from 1 to 5. The performance across the number of SGPA blocks in SeqPAN are plotted in Figures 5(a) and 5(b). Best performance is achieved at N SGPA = 2 on both datasets. In general, along with increasing N SGPA , the performance of SeqPAN first increases and then gradually decreases, on both datasets. We also note that performance on Charades-STA is not very sensitive to the setting of N SGPA . Analysis on Sequence Matching. The conventional matching strategy (Yuan et al., 2019b;Lu et al., 2019a;Mun et al., 2020) (denoted by fbmatch) is to predict whether a frame is inside or outside of target moment, i.e., foreground or background. In SeqPAN, we predict begin-, inside-and end-regions, and introduce label embeddings (E lab ) to represent each region. The prediction process also uses the Gumbel-Max trick. In this experiment, we analyze the effects of label embeddings and Gumbel-Max trick in sequence matching. Summarized in Table 5, both Gumbel-Max trick (denoted by G) and label embeddings contribute Method sq-match Charades-STA ActivityNet Captions R@1, IoU = µ mIoU R@1, IoU = µ mIoU G Elab µ = 0.3 µ = 0.5 µ = 0.7 µ = 0.3 µ = 0.5 µ = 0.  to the grounding performance improvement. In addition, consistent improvements are observed by incorporating G and E lab into the model. SeqPAN is superior to SeqPAN w/ fb-match over all evaluation metrics. The performance improvements are more significant under more strict metrics. The results show that sq-match is more effective than the fb-match strategy. Regional indication of potential positions of start/end boundaries does help the model to produce accurate predictions. Impact of Annealing Temperature τ . We then analyze the impact of annealing temperature τ of Gumbel-Softmax in sequence matching module. Gumbel-Softmax distributions are identical to a categorical distribution when τ → 0 + . With τ → ∞, its distribution is smooth. We evaluate 11 different τ values from 0.01 to 1.0, where 0.01 is used to approximate 0.0 since 0.0 is not divisible. The results are compared against vanilla Softmax as a baseline. For vanilla Softmax, we multiply the probability distribution of labels with E lab , to aggregate label information into the visual representations. Figure 6 plots the results of different τ 's on Charades-STA and ANetCaps, respectively. We observe similar patterns on the four sets of results. The best performance is achieved when τ = 0.3 over both metrics on both datasets. From Figure 6(a), when τ is too small or too large (i.e., the probability distribution from Gumbel-Softmax becomes too sharp or too smooth), Gumbel-Softmax performs poorer than vanilla Softmax. This result suggests that a proper annealing temperature τ is crucial to achieve good performance. Similar observations hold on ANetCaps (see Figure 6(b)). IoU ranges, e.g., IoU ≥ 0.5 on both datasets, Se-qPAN and the variant with fb-match outperform the variant without sequence matching. The results show that having auxiliary objectives (e.g., foreground/background or sequential regions) is helpful in video grounding task. Results in Figure 7 also show that our sequence matching is more effective than fb-match, for highlighting the correction regions for predicting start/end boundaries. Figure 8 depicts two video grounding examples from the ANetCaps dataset. From the two examples, the moments retrieved by SeqPAN are closer to the ground truth than that are retrieved by Seq-PAN without utilizing the sq-match strategy. Besides, the start and end boundaries predicted by SeqPAN are roughly constrained within the pre-set potential start and end regions. In addition, the predicted sequence labels (PSL) in Figure 8 also reveal the weakness of sq-match strategy. The predicted labels by sq-match strategy are not continuous, where multiple start, inside, and end regions are generated. In consequence, the localizer may be affected by wrongly predicted regions and leads to inaccurate results. To further constrain the generated regions is part of our future work.

Qualitative Analysis
In this work, we propose a Parallel Attention Network with Sequence matching (SeqPAN) to address the language query-based video grounding problem. We design a parallel attention module to improve the multimodal representation learning by capturing both self-and cross-modal attentive information simultaneously. In addition, we propose a sequence matching strategy, which explicitly indicates the potential start and end regions of the target moment to allow the localizer precisely predicting the boundaries. Through extensive experimental studies, we show that SeqPAN outperforms the state-of-the-art methods on three benchmark datasets; and both the proposed parallel attention and sequence matching modules contribute to the grounding performance improvement. This appendix contains two sections. Section A provides (A.1) a detailed comparison between the proposed SGPA and standard transformer blocks, (A.2) technical details of the video-query integration module, and (A.3) categorical reparameterization used in the sequence matching module. Section B describes statistics on the benchmark datasets and parameter settings in our experiments.

A Additional Comparison and Technical Details
A.1 SGPA versus Standard Transformers Two ways are mainly used to adopt the transformer block for multi-modal representation learning: • Transformer block with the self-attention (Se-TRM), which encodes visual and textual inputs in separate streams, shown in Figure 9(a).
• Transformer block with the cross-attention (Co-TRM), which encodes both visual and textual inputs with interactions through co-attention, shown in Figure 9(b).
Several works (Lu et al., 2019a;Chen et al., 2020a;Zhang et al., 2020a) adopt Se-TRM to learn visual and textual representations in video grounding task. Se-TRM separately encodes each modality, it focuses on learning the refined unimodal representations within each modality for video and text respectively. Without any connection between two modalities, Se-TRM cannot use information from other modality to improve the representations.
Co-TRM 7 is commonly used as a basic component in various vision-language methods (Tan and Bansal, 2019;Lu et al., 2019b;Lei et al., 2020). Co-TRM relies on co-attention to learn the crossmodal representations for both visual and textual inputs. However, Co-TRM lacks the ability to encode self-attentive context within each modality.
The cascade of Se-TRM and Co-TRM is also used in recent vision-language models (Tan and Bansal, 2019;Lu et al., 2019b;Zhu and Yang, 2020;Lei et al., 2020) to learn both unimodal and crossmodal representations. In general, there are two cascade forms: 1) stacking Co-TRM upon Se-TRM (SeCo-TRM) in Figure 10(a); and 2) stacking Se-TRM upon Co-TRM (CoSe-TRM) in Figure 10(b). These stacked TRMs learn the unimodal and crossmodal information in a sequence manner. Hence, 7 It is also known as co-attentional, multi-modal or crossmodal transformer block in different works.  their final outputs focus more on either the selfattentive contexts or cross-modal interactions. Our SGPA combines advantages of both Se-TRM and Co-TRM, but not through cascade. As shown in Figure 9(c), SGPA contains two parallel multi-head attention blocks. One block takes single modality as input and the other takes both modalities as inputs. Thus, SGPA is able to learn both unimodal and cross-modal representations simultaneously. Then, a cross-gating strategy is designed to fuse the self-and cross-attentive representations. We also employ a self-guided head to replace the feed forward layer in transformer block. This design implicitly emphasizes informative representations by measuring the confidence of each element. Table 6 reports the performance of SGPA and standard TRMs on Charades-STA and ANetCaps datasets. Here, we regard both SeCo-TRM and CoSe-TRM as single block. The results show that both PA (a SGPA variant without self-guided head) and SGPA are superior to standard TRMs.

A.2 Video-Query Integration Computation
This section presents the detailed computation process of video-query integration (see Section 3.1.3).
Given two inputs X ∈ R d×Nx and Y ∈ R d×Ny , the context-query attention first computes similarities between each pair of X and Y elements as: where W ∈ R d×d and S ∈ R Nx×Ny . Then X-to-Y and Y -to-X attention weights are computed by: where S r and S c are the row-and column-wise normalization of S by Softmax function. The final output of context-query attention is calculated as: where denotes element-wise multiplication, ";" represents concatenation operation, and X Y ∈ R d×Nx . In this way, the information of Y is properly fused into X.
By setting X =V ∈ R d×N and Y =Q ∈ R d×M , we can derive the query-aware video representations V Q ∈ R d×N . Similarly, the videoaware query representations Q V ∈ R d×M is obtained by setting X =Q and Y =V .
into sentence representation q with additive attention: where W α ∈ R 1×d . The q is then concatenated with each element of Finally, the queryattended visual representation is computed as where W h ∈ R d×2d and b h ∈ R d denote the learnable weight and bias, andH ∈ R d×N .

A.3 Categorical Reparameterization
This section provides a brief introduction of the categorical reparameterization strategy used in sequence matching module (see Section 3.1.4). Categorical reparameterization, e.g., reinforcebased approaches (Sutton et al., 2000;Schulman et al., 2015), straight-through estimators (Bengio et al., 2013) and Gumbel-Softmax (Jang et al., 2017;Maddison et al., 2017), is a strategy that enables discrete categorical variables to backpropagate in neural networks. It aims to estimate smooth gradient with a continuous relaxation for categorical variable. In this work, we use Gumbel-Softmax to approximate the sequence labels from a probability distribution. Then those labels are applied to lookup the corresponding embeddings for region representation in the sequence matching module of SeqPAN.  (Zhang et al., 2020b) 9, 790/4, 436/4, 001 143.52 1, 983 287.14 9.42 25.26 Table 7: Statistics of the evaluated video grounding benchmark datasets, where NV is number of videos, NA is number of annotations,NA/V denotes the average number of annotations per video, NVocab is the vocabulary size of lowercase words,LV denotes the average length of videos in seconds,LQ denotes the average number of words in the sentence queries andLM represents the average length of temporal moments in seconds.
Let x = (x 1 , . . . , x l ) be a categorical distribution, where l is the number of categories, x c is the probability score of category c and l c=1 x c = 1. Given the i.i.d. Gumbel noise g = (g 1 , . . . , g l ) from Gumbel(0, 1) distribution 8 , the soft categorical sample can be computed as: where τ > 0 is annealing temperature, and Eq. 21 is referred as Gumbel-Softmax operation on x.
As τ → 0 + , y is equivalent to the Gumbel-Max form (Gumbel, 1954;Maddison et al., 2014) as: y = Onehot arg max(log(x) + g) whereŷ is an unbiased sample from x and thus we can draw differentiable samples from the distribution during training. Note, when input x is unnormalized, the log(·) operator in Eq. 21 and 22 shall be omitted (Jang et al., 2017;Dong and Yang, 2019). During inference, discrete samples can be drawn with the Gumbel-Max trick directly.

B.1 Dataset Statistics
The statistics of the evaluated benchmark datasets are summarized in Table 7. Charades-STA dataset consists of 6, 672 videos and 16, 128 annotations (i.e., moment-query pairs) in total. ActivityNet Captions (ANetCaps) dataset is taken from the ActivityNet (Heilbron et al., 2015). The average duration is about 120 seconds and each video contains 3.68 annotations on average. TACoS dataset contains 127 cooking activities videos with average duration of 4.79 minutes, and 18, 818 annotations in total. We follow the same train/val/test split as Gao et al. (2017). Besides, Zhang et al. (2020b) 8 The Gumbel(0, 1) distribution can be sampled using inverse transform sampling by drawing u ∼ Uniform(0, 1) and computing g = − log(− log(u)) (Jang et al., 2017). pre-processes the TACoS dataset, hence their version is slightly different from the original version. Detailed statistics are summarized in Table 7.

B.2 Hyper-Parameter Settings
We follow (Ghosh et al., 2019;Mun et al., 2020;Rodriguez et al., 2020;Zhang et al., 2020a) and use 3D ConvNet pre-trained on Kinetics dataset (i.e., I3D 9 ) (Carreira and Zisserman, 2017) to extract visual features from videos. The maximal visual feature sequence lengths are set to 64, 100, and 256 for Charades-STA, ActivityNet Captions, and TACoS, respectively. This setting is based on the average video lengths in the three datasets. The feature sequence length of a video will be uniformly downsampled if it is larger than the pre-set threshold, or zero-padding otherwise. For the language queries, we lowercase all the words and initialize them with GloVe (Pennington et al., 2014) embeddings 10 . The word embeddings and extracted visual features are fixed during training.
For other hyper-parameters, we use the same settings for all datasets. The dimension of the hidden layers is 128; the head number in multi-head attention is 8; the number of SGPA blocks (N SGPA ) is 2; the annealing temperature τ of Gumbel-Softmax is 0.3; The Dropout (Srivastava et al., 2014) is 0.2. The maximal training epochs E = 100 is used, with batch size of 16 and early stopping tolerance of 10 epochs. We adopt Adam (Kingma and Ba, 2015) optimizer, with initial learning rate of β 0 = 0.0001, weight decay 0.01, and gradient clipping 1.0, to train the model. The learning rate decay strategy is defined as β e = β 0 × (1 − e E ), where e denotes the e-th training epoch.
All the experiments are conducted on a workstation with dual NVIDIA GeForce RTX 2080Ti GPUs.