Relation-aware Video Reading Comprehension for Temporal Language Grounding

Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence. Previous methods treat it either as a boundary regression task or a span extraction task. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. A choice-query interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modal interaction. Moreover, a novel multi-choice relation constructor is introduced by leveraging graph convolution to capture the dependencies among video moment choices for the best choice selection. Extensive experiments on ActivityNet-Captions, TACoS, and Charades-STA demonstrate the effectiveness of our solution. Codes will be available at https://github.com/Huntersxsx/RaNet.


Introduction
Recently, temporal language grounding in videos has become a heated topic in the computer vision, and natural language processing community (Gao et al., 2017;Krishna et al., 2017). This task requires a machine to localize a temporal moment semantically relevant to a given language query, as shown in Fig.1. It has also drawn great attention from industry due to its various applications such as video question answering Lei et al., 2018), video content retrieval Shao et al., 2018), and human-computer interaction , etc.
A straightforward paradigm for this task is the proposing-and-ranking pipelines ; * Jialin Gao and Xin Sun are co-first authors with equal contributions, supervised by Prof. Xi Zhou in SJTU. Figure 1: An illustration of temporal language grounding in videos based on the relation-aware network. Given a video and a query sentence, our approach aims to semantically align the query representation with a predefined answer set of video moment candidates (a 1 ,a 2 ,a 3 and a 4 ) and then mine the relationships between them to select the best-matched one. Zhang et al., 2020bZhang et al., , 2019a. They first generate a number of video moment candidates and then rank them according to moment-query similarities. This requires a solution to achieve two key targets simultaneously, which are (1) semantic visual-language interaction and (2) reliable candidate ranking. The former ensures a satisfying cross-modal matching between video moments and the query, while the latter guarantees the distinction among candidates. For the first target, some previous works Zhang et al., 2020b;Chen and Jiang, 2019) resort to the visual clues by modeling moment-sentence or snippet-sentence relations. However, they overlook the linguistic clues from token-level, i.e., token-moment relations, which contain fine-grained linguistic information. For the second target, previous solutions (Ge et al., 2019;Liu et al., 2018;Zhang et al., 2020a) generate ranking scores by considering different moment candidates separately or constructing moment-level relations in a simple way (Zhang et al., 2020b). Hence, they neglect the temporal and semantic dependencies among candidates. Without this information, it is difficult for previous approaches to distinguish these visually similar moment candidates correctly.
To this end, we propose a Relation-aware Network (RaNet) to address temporal language grounding. In our solution, we formulate this task as Video Reading Comprehension by regarding the video, query, and moment candidates as the text passage, question description, and multi-choice options, respectively. Unlike previous methods, we exploit a coarse-and-fine interaction, which captures not only sentence-moment relations but also token-moment relations. This interaction can allow our model to construct both sentence-aware and token-aware visual representation for each choice, which is expected to distinguish similar candidates in the visual modality. Moreover, we propose to leverage Graph Convolutional Networks (GCN) (Kipf and Welling, 2016) for mining the momentmoment relations between candidate choices based on their coarse-and-fine representations. With information exchange in GCNs, our RaNet can learn discriminative features for correctly ranking candidates regardless of their high relevance in visual content. Similar to the system of multi-choice reading comprehension, our RaNet consists of five components: a modality-wise encoder for visual and textual encoding, a multi-choice generator for answer set generation, a choice-query interactor for cross-modality interaction, a multi-choice relation constructor for relationships mining and an answer ranker for the best-matched choice selection. Our contributions are summarized as three-fold: (1) We address temporal language grounding by a Relation-aware Network, which formulates this task as a video reading comprehension problem.
(2) We exploit the visual and linguistic clues exhaustively, i.e., coarse-and-fine moment-query relations and moment-moment relations, to learn discriminative representations for distinguishing candidates. (3) The proposed RaNet outperforms other state-of-the-art methods on three widelyused challenging benchmarks: TACoS, Charades-STA and ActivityNet-Captions, where we improve the grounding performance by a great margin (i.e., 33.54% v.s. 25.32% of 2D-TAN on TACoS dataset).

Related Work
Temporal Language Grounding. This task was introduced by (Anne Hendricks et al., 2017;Gao et al., 2017) to locate relevant moments given a language query. He et al.  and  used reinforcement learning to solve this problem. Chen et al.  and Ghosh et al. (Ghosh et al., 2019) proposed to select the boundary frames based on visual-language interaction. Most of recent works Zhang et al., 2020b;Chen and Jiang, 2019;Zhang et al., 2019a) adopted the two-step pipeline to solve this problem. Visual-Language Interaction. It is vital for this task to semantically match query sentences and video. This cross-modality alignment was usually achieved by attention mechanism (Vaswani et al., 2017) and sequential modeling (Hochreiter and Schmidhuber, 1997;Medsker and Jain, 2001).  and Liu et al. (Liu et al., 2018) designed soft-attention modules while Hendricks et al. (Anne Hendricks et al., 2017) and Zhang et al. (Zhang et al., 2019b) chose the hard counterpart. Some works Ghosh et al., 2019) attempt to use the property of RNN cells and others went beyond it by dynamic filters (Zhang et al., 2019a), Hadamard product (Zhang et al., 2020b), QANet (Lu et al., 2019) and circular matrices (Wu and Han, 2018). However, these alignments neglect the importance of token-aware visual feature in cross-modal correlating and distinguishing the similar candidates. Machine Reading Comprehension. Given the reference document or passage, Machine Reading Comprehension (MRC) requires the machine to answer questions about it (Zhang et al., 2020c). There are two types of the existing MRC variations related to the temporal language grounding in videos, i.e., span extraction and multi-choice. The former (Rajpurkar et al., 2016) extracts spans from the given passage and has been explored in temporal language grounding task by some previous works (Zhang et al., 2020a;Lu et al., 2019;Ghosh et al., 2019). The latter (Lai et al., 2017;Sun et al., 2019) aims to find the only correct option in the given candidate choices based on the given passage. We propose to formulate this task from the perspective of multi-choice reading comprehension. Based on this formulation, we focus on the visual-language alignment in a token-moment level. Compared with query-aware context representation in previous solutions, we aim to construct tokenaware visual feature for each choice. Inspired by recent advanced attention module (Gao et al., 2020;, we mine the relations between multi-choices in an effective and efficient way. The overview of our proposed RaNet. It consists of modality-wise encoder, multi-choice generator, choice-query interactor, multi-choice relation constructor, and answer ranker. Video and language passages are first embedded in separately branches. Then we initialize the visual representation for each choice < t s i , t e i > from the video stream. Through choice-query interactor, each choice can capture the sentence-aware and token-aware representation from the query. Afterwards, the relation constructor takes advantage of GCNs to model relationships between choices. Finally, the answer ranker evaluates the probability of being selected for each choice based on the exchanged information from the former module.

Methodology
In this section, we first describe how to recast the temporal language grounding from the perspective of a multi-choice reading comprehension task, which is solved by the proposed Relation-aware Network (RaNet). Then, we introduce the detailed architecture of the RaNet, consisting of five components as shown in Fig.2. Finally, we illustrate the training and inference of our solution.

Problem Definition
The goal of this task is to answer where is the semantically corresponding video moment given a language query in an untrimmed video. Referring to the forms of MRC, we treat the video V as a text passage, the query sentence Q as a question description and provide a set of video moment candidates as a list of answer options A. Based on the given triplet (V, Q, A), temporal language grounding in videos is equivalent to cross-modal MRC, termed video reading comprehension.
For each query-video pair, we have one natural language sentence and an associated groundtruth video moment with the start g s and end g e boundary. Each language sentence is represented by where L is the number of tokens. The untrimmed video is represented as a sequential snippets V = {v 1 , v 2 , · · · , v nv } ∈ R nv×C by a pretrained video understanding network, such as C3D (Tran et al., 2015), I3D (Carreira and Zisserman, 2017), etc..
In temporal language grounding, the answer should be a consecutive subsequence (namely time span) of the video passage. For any video moment candidate (i, j), it can be treated as a possible answer if it meets the condition of 0 < i < j < n v . Hence, we follow the fixed-interval sampling strategy in previous work (Zhang et al., 2020b) and construct a set of video moment candidates as the answer list A = {a 1 , · · · , a N } with N valid candidates. After these notations, we can recast the temporal language grounding task from the perspective of multi-choice reading comprehension as: However, different from the traditional multichoice reading comprehension, previous solutions in temporal language grounding also compare their performance in terms of top-K most matching candidates for each query sentence. For fair comparison, it requires our approach to scores K candi- represent the probability of selection, the start, end time of answer a i , respectively. Without additional mention, the video moment and answer/choice are interchangeable in this paper.

Architecture
As shown in Figure 2, we describe the details of each component in our framework as followings: Modality-wise Encoder. This module aims to separately encode the content of language query sen- Figure 3: (a). Illustration of choice generator and feature initialization. Each block (i, j) is a valid choice when i < j, denoted with blue. The Ψ combines the boundary feature for feature initialization of (1, 3), the dark blue square. (b). An example of all graph edges connected to one choice in our Multi-Choice Relation Constructor. Moments with the same start or end index (dark green) are connected with the illustrative choice (red). (c). The information propagation between two unconnected moment choices. For other moments (dark green) that are not connected with target moment (red) but have overlaps, relations can be implicit captured with two loops, namely 2 graph attention layers. tence and video. Each branch aggregates the intramodality context for each snippet and token.
· Video Encoding. We first apply a simple temporal 1D convolution operation to map the input feature sequence to the desired dimension, which is followed by an average pooling layer to reshape the sequence into a desired length T . To enrich the multi-hop interaction, we use a graph convolution block called GC-NeXt block , which aggregates the context from both temporal and semantic neighbors of each snippet v i and has been proved effective in Temporal Action Localization task. Finally, We get the encoded visual feature asV ∈ R C×T · Language Encoding. Each word q i of Q is represented with the embedding vector from GloVe 6B 300d (Pennington et al., 2014) to get Q ∈ R 300×L . Then we sequentially feed the initialized embeddings into a three-layer Bi-LSTM network to capture semantic information and the temporal context. We take the last layer's hidden states as the language representationQ ∈ R C×L for crossmodality fusion with video representationV. In addition, the effect of different word embeddings, is also compared in 4.4.4.
The encoded visual and textual features can be formulated as follows: Multi-Choice Generator. As shown in Fig.3 (a), the vertical and horizontal axes represent the start and end index of visual sequence. The blocks in the same row have the same start indices, and those in the same column have the same end indices. The white blocks indicate all the invalid choices in the left-bottom, where the start boundaries exceed the end boundaries. Therefore, we have the multi- To capture the visual-language interaction, we should initialize the visual feature for the answer set A so that it can be integrated with the textual features from the language encoder. To ensure the boundary-aware ability inspired by , the initialization method Ψ combines boundary information, i.e.,v t s i andv t e i inV, to construct the moment-level feature representation for each choice a i . The initialized feature representation can be written as: where Ψ is the concatenation ofv t s i andv t e i , A is the answer set and F A ∈ R C×N . We also explore the effect of different Ψ on grounding performance in 4.4.3. Choice-Query Interactor. As shown in Figure 2, this module explores the inter-modality context for visual-language interaction. Unlike previous methods (Zhang et al., 2020b,a;, we propose a coarse-and-fine cross-modal interaction. We integrate the initialized features F A with the query both in the sentence-level and token-level. The former can be obtained by a simple Hadamard product and an normalization as: where ϕ is the aggregation function for a global representation ofQ and we set it as the max-pooling, is element-wise multiplication, and · F indicates the Frobenius normalization.
To ensure the token-aware visual feature for each choice a i , we adopt attention mechanism to learn the token-moment relation between each choice and the query. Firstly, we adopt a 1D convolution layer to project the visual and textual features to the common space and then calculate their semantic similarities, which depict the relationships R ∈ R N ×T between N candidates and L tokens. Secondly, we generate query-related feature for each candidate based on the relationships R. Finally, we integrate these two features of candidates for token-aware visual representation.
where T denotes the matrix transpose, and ⊗ are element-wise and matrix multiplications, respectively. We add the sentence-aware F 1 and tokenaware features F 2 as the output of this moduleF A .
Multi-choice Relation Constructor. In order to explore the relation between multi-choices, we propose this module to aggregate the information from the overlapping moment candidates by GCNs. Previous methods MAN (Zhang et al., 2019a), 2D-TAN (Zhang et al., 2020b) also considered momentwise temporal relations, while both of them have two drawbacks: expensive computations and the noise from unnecessary relations. Inspired by CCNet , which proposed a sparsely-connected graph attention module to collect contextual information in horizontal and vertical directions, we propose a Graph ATtention layer (GAT) to constrain the relation between moment candidates that have high temporal overlaps with each other. Concretely, we take each answer candidate a i = (t s i , t e i ) as a graph node, and there is a graph edge connecting two candidate choices a i , a j if they share the same start/end time spot, e.g., t s i = t s j or t e i = t e j . An example is shown in Figure 3 (b), where neighbors of the target moment choice (the red one) is denoted as dark green in a criss-cross shape. As shown in Figure 3 (c), our model is also able to achieve the information propagation between two unconnected moment choices. For other moments (dark green) that are not connected with the target moment (red) but have overlapped, their relations can be implicitly captured with two loops, namely two graph attention layers. We can guarantee the message passing between the dark green moment and cyan moments in the first loop. And then, in the second loop, we can construct relations between cyan moments and target moment, where the information from the dark green moment is finally propagated to the red moment.
Given the choice-queryF A ∈ R C×N , there are N nodes and approximately 2T N edges in the graph. A GAT layer inpsired by  is applied on the graph: for each moment, we compute attention weights of its neighbours in a criss-cross path, and average the features with the weights. The output of the GAT layer can be formulated as: whereÂ is the adjacency matrix of the graph to determine the connections between two moment choices, defined by the predefined answer set A. Answer Ranker. Since we have captured the relationship between multi-choices by GCNs, we adopt this answer ranker to predict the ranking score of each answer candidate a i for selecting the bestmatched one. This ranker takes the query-aware featureF A and relation-aware featureF * A as input and concatenate them (denoted as ) to aggregate more contextual information. After that, we employ a convolution layer to generate the probability P A of being selected for a i in the predefined answer set A. The output can be computer as: where σ represents the sigmoid activation function.

Training and Inference
Following (Zhang et al., 2020b), we first calculate the Intersection-over-Union (IoU) between the answer set A and ground-truth annotation (g s , g e ) and then rescale them by two thresholds θ min and θ max , which can be written as: where g i and θ i are the supervision label and corresponding IoU between a i and ground-truth respectively. Hence, the total training loss function of our RaNet is: where p i is the output score in P A for the answer choice a i and N is the number of multi-choices. In the inference stage, we rank all the answer options in A according to the probability in P A .

Experiments
To evaluate the effectiveness of the proposed approach, we conduct extensive experiments on three public challenging datasets: TACoS (Regneri et al., 2013), ActivityNet Captions (Krishna et al., 2017) and

Dataset
TACoS. It consists of 127 videos, which contain different activities that happened in the kitchen. We follow the convention in (Gao et al., 2017), where the training, validation, and testing contain 10,146, 4,589, and 4,083 query-video pairs. Charade-STA. It is extended by (Gao et al., 2017) with language descriptions leading to 12,408 and 3,720 query-video pairs for training and testing. ActivityNet Captions. It is introduced into the temporal language grounding task recently. Following the setting in CMIN , we use val_1 as validation set and val_2 as testing set, which have 37, 417, 17, 505, and 17, 031 queryvideo pairs for training, validation, and testing, respectively.

Implementation Details
Evaluation metric. Following Gao et al. (Gao et al., 2017), we compute the Rank k@µ for a fair comparison. It denotes the percentage of testing samples that have at least one correct answer in the top-K choices. A selected choice a i is correct when its IoU θ i with the ground-truth is larger than the threshold µ; otherwise, the choice is wrong. Specifically, we set k ∈ {1, 5} and µ ∈ {0.3, 0.5} for TACoS and µ ∈ {0.5, 0.7} for the other two. Feature Extractor. We follow the (Zhang et al., 2019a; and adopt the same extrac-tor, e.g., VGG (Simonyan and Zisserman, 2014) feature for Charades and C3D (Tran et al., 2015) for other two. We also use the I3D (Carreira and Zisserman, 2017) feature to make comparison with (Ghosh et al., 2019;Zhang et al., 2020a; on Charades. For word embedding, we use the pre-trained GloVe 6B 300d (Pennington et al., 2014) as previous solutions (Ge et al., 2019;. Architecture settings. In all experiments, we set the hidden units of Bi-LSTM as 256 and the number of reshaped snippet T is defined as 128 for TACoS, 64 for ActivityNet Captions and 16 for Charades-STA. The dimension C of channels is 512. We adopt 2 GAT layers for all benchmarks and position embedding is used in ActivityNet Caption as . Training settings. We adopt the Adam optimizer with learning rate of 1 × 10 −3 , batch size of 32, and training epoch of 15. Following (Zhang et al., 2020b), thresholds θ min and θ max are set to 0.5 and 1.0 for Charades-STA and ActivityNet Captions, while 0.3 and 0.7 for TACoS.

Comparison with State-of-the-arts
Our RaNet is compared with recent published stateof-the-art methods: VLSNet ( Table 1 summarizes performance comparison of different methods on the test split. Our RaNet outperforms all the competitive methods with clear margins and reports the highest scores for all IoU thresholds. Compared with the previous best method 2D-TAN, our model achieves 6% absolute improvement at least across all evaluation settings in terms of Rank 1@µ, i.e., 8.22% for µ = 0.5. For evaluation metric of Rank 5@µ, we even reach around 10% absolute improvement. It is worth noting that we exceed VSLNet by 9.27% and 13.73% in terms of Rank 1@µ = 0.5, µ = 0.3 respectively, which also formulates this task from the perspective of MRC. Charades-STA. We evaluate our method both on VGG and I3D features used in previous works for fair comparison. Our approach reaches the highest score in terms of Rank 1 no matter which  kind of feature is adopted as illustrated in Table  2. For VGG feature, we improve the performance from 23.68% in DRN to 26.83% in terms of Rank 1@µ = 0.7. By adopting the stronger I3D feature, our method also exceeds VSLNet in terms of Rank 1@µ = {0.5, 0.7} (i.e., 60.40% vs. 54.19% and 39.65% vs. 35.22%). ActivityNet-Captions. In Table 3, we compare our model with other competitive methods. Our model achieves the highest scores over all IoU thresholds in the evaluation except the result of Rank 5@µ = 0.5. Particularly, our model outperforms the previous best method (i.e., 2D-TAN) by around 1.29% absolute improvement, in terms of Rank 1@µ = 0.7. Due to the same sampling strategy for moment candidates, this improvement is mostly attributed to the token-aware visual representation and the relationships mining between multi-choices.

Effectiveness of Network Components
We perform complete and in-depth studies on the effectiveness of our choice-query interactor and multi-choice relation constructor based on the TACoS and Charades-STA datasets. On each dataset, we conduct five comparison experiments  for evaluation. First, we remove F 2 and R to explore the RaNet-base model, compared with only using F 2 . Then, we integrate the interaction and relation modules into our third and forth experiments respectively. Finally, we show the best performance achieved by our proposed approach. Table 4 summarizes the grounding results in terms of Rank 1@µ ∈ {0.3, 0.5, 0.7}. Without the interaction and relation modules, our framework can achieve 40.99% and 28.54% for µ = 0.3 and 0.5 respectively. It outperforms the previous best method 2D-TAN, indicating the power of our modality-wise encoder. When we consider the token-aware visual representation, our framework can bring significant improvement on both datasets. Improvements can also be observed when adding relation module into our framework. These results demonstrate the effect of our RaNet on temporal language grounding.

Improvement on different IoUs
To have a better understanding of our approach, we illustrate the performance gain achieved on three datasets in terms of different µ ∈ (0, 1) with previous best method, 2D-TAN, as shown in Figure 4. This figure visualizes the detailed comparison between our model and 2D-TAN, which shows that our approach can continuously improve the performance, especially for higher IoUs (i.e., µ > 0.7 ). It is observed that the value of relative improvement increases along with the increasing IoU on TACoS and ActivityNet Captions datasets.

Feature Initialization Functions
We conduct experiments to reveal the effect of different feature initialization functions. For a moment candidate (t s i , t e i ), it has the corresponding feature sequence Y fromV denoted as Y = {v k } t e i k=t s i . We explore four types of operators (i.e., pooling, sampling, concatenation and addition) in the multi-choice generator. The first two consider all the information of the region within the temporal span of the candidate. Pooling operator focuses on the statistic characteristic and the sampling serves as weight average operator. On the contrary, the last two only consider the boundary information (v t s i and v t e i ) of a moment candidate, which expect the cross-modal interaction to be boundary sensitive.  the highest score across all the evaluation criterion, which indicates boundary sensitive operators have better performance than the statistical operators.

Word Embeddings Comparison
To further explore the effect of different textual features, we also conduct experiments on four pretrained word embeddings (i.e., GloVe 6B, GloVe 42B, GloVe 840B and BERT Base ). GloVe (Pennington et al., 2014) is an unsupervised learning algorithm for obtaining vector representations for words, which has some common word vectors trained on different corpora of varying sizes. BERT (Devlin et al., 2019) is a language representation model considering bidirectional context, which achieved SOTA performance on many NLP tasks. All the GloVe vectors have 300 dimensions whereas BERT Base is a 768-dimensional vector. Table 6 compares the performance of these four pre-trained word embeddings on TACoS dataset. From the results we can see that better word embeddings (i.e.BERT) tend to have better performance, indicating us pay more attention to textual features encoding. All the models in our paper use concatenation feature initialization functions and GloVe 6B word vectors if not specified.   Table 7: Parameters and FLOPs of our RaNet with the previous best mothod 2D-TAN, which also considers moment-level relations. M and G represent 10 6 and 10 9 respectively.

Efficiency of Our RaNet
Both fully-connected graph neural networks and stacked convolution layers result in high computation complexity and occupy a huge number of GPU memory. With the help of a sparsely-connected graph attention module used in our Multi-choice Relation Constructor, we can capture moment-wise relations from global dependencies in a more efficient and effective way. Table 7 shows the parameters and FLOPs of our model and 2D-TAN, which uses several convolution layers to capture context of adjacent moment candidates. We can see that RaNet is more lightweight with only 11 M parameters compared with 92 M of 2D-TAN on ActivityNet. Compared with RaNet, RaNet-base replaces the relation constructor with the same 2D convolutional layers as 2D-TAN. Hence, their comparison on FLOPs further indicates the efficiency of our relation constructor against simple convolution layers.

Qualitative Analysis
We further show some examples in Figure 5 from ActivityNet Captions dataset. From this comparison, we can find that predictions of our approach are closer to ground truth than our baseline model, which is the one removing F 2 and R in Table 4. Considering the same setting for the moment candidate, it also demonstrates the effect of our pro- posed modules. With the interaction and relations construction modules, our approach can select the choice of video moments matching the query sentence best. In turn, it reflects that capturing the token-aware visual representation for moment candidates and relations among candidates facilitate the net scoring candidates better.

Conclusion
In this paper, we propose a novel Relation-aware Network to address the problem of temporal language grounding in videos. We first formulate this task from the perspective of multi-choice reading comprehension. Then we propose to interact the visual and textual modalities in a coarse-and-fine fashion for token-aware and sentence-aware representation of each choice. Further, a GAT layer is introduced to mine the exhaustive relations between multi-choices for better ranking. Our model is efficient and outperforms the state-of-the-art methods on three benchmarks, i.e., ActivityNet-Captions, TACoS, and Charades-STA.