Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos

We address the problem of temporal sentence localization in videos (TSLV). Traditional methods follow a top-down framework which localizes the target segment with pre-defined segment proposals. Although they have achieved decent performance, the proposals are handcrafted and redundant. Recently, bottom-up framework attracts increasing attention due to its superior efficiency. It directly predicts the probabilities for each frame as a boundary. However, the performance of bottom-up model is inferior to the top-down counterpart as it fails to exploit the segment-level interaction. In this paper, we propose an Adaptive Proposal Generation Network (APGN) to maintain the segment-level interaction while speeding up the efficiency. Specifically, we first perform a foreground-background classification upon the video and regress on the foreground frames to adaptively generate proposals. In this way, the handcrafted proposal design is discarded and the redundant proposals are decreased. Then, a proposal consolidation module is further developed to enhance the semantics of the generated proposals. Finally, we locate the target moments with these generated proposals following the top-down framework. Extensive experiments show that our proposed APGN significantly outperforms previous state-of-the-art methods on three challenging benchmarks.


Introduction
Temporal sentence localization in videos is an important yet challenging task in natural language processing, which has drawn increasing attention over the last few years due to its vast potential applications in information retrieval (Dong et al., 2019;Yang et al., 2020) and human-computer interaction (Singha et al., 2018). It aims to ground the most relevant video segment according to a given * Equal contributions. † Corresponding author. sentence query. As shown in Figure 1 (a), most parts of video contents are irrelevant to the query (background) while only a short segment matches it (foreground). Therefore, video and query information need to be deeply incorporated to distinguish the fine-grained details of different video segments. Most previous works (Gao et al., 2017;Chen et al., 2018;Zhang et al., 2019;Yuan et al., 2019a;Zhang et al., 2020b;Liu et al., , 2020a follow the top-down framework which pre-defines a large set of segment candidates (a.k.a proposals) in the video with sliding windows, and measures the similarity between the query and each candidate. The best segment is then selected according to the similarity. Although these methods achieve significant performance, they are sensitive to the proposal quality and present slow localization speed due to redundant proposals. Recently, several works (Rodriguez et al., 2020;Zhang et al., 2020a;Yuan et al., 2019b) exploit the bottom-up framework which directly predicts the probabilities of each frame as the start or end boundaries of segment. These methods are proposal-free and much more efficient. However, they neglect the rich information between start and end boundaries without capturing the segmentlevel interaction. Thus, the performance of bottomup models is behind the performance of top-down counterpart thus far.
To avoid the inherent drawbacks of proposal design in the top-down framework and maintain the localization performance, in this paper, we propose an adaptive proposal generation network (APGN) for an efficient and effective localization approach. Firstly, we perform boundary regression on the foreground frames to generate proposals, where foreground frames are obtained by a foregroundbackground classification on the entire video. In this way, the noisy responses on the background frames are attenuated, and the generated proposals are more adaptive and discriminative compared to the pre-defined ones. Secondly, we perform proposal ranking to select target segment in a top-down manner upon these generative proposals. As the number of proposals is much fewer than the predefined methods, the ranking stage is more efficient. Furthermore, we additionally consider the proposalwise relations to distinguish their fine-grained semantic details before the proposal ranking stage.
To achieve the above framework, APGN first generates query-guided video representations after encoding video and query features and then predicts the foreground frames using a binary classification module. Subsequently, a regression module is utilized to generate a proposal on each foreground frame by regressing the distances from itself to start and end segment boundaries. After that, each generated proposal contains independent coarse semantic. To capture higher-level interactions among proposals, we encode proposal-wise features by incorporating both positional and semantic information, and represent these proposals as nodes to construct a proposal graph for reasoning correlations among them. Consequently, each updated proposal obtains more fine-grained details for following boundary refinement process.
Our contributions are summarized as follows: • We propose an adaptive proposal generation network (APGN) for TSLV task, which adaptively generates discriminative proposals without handcrafted design, thus making localization both effective and efficient. • To further refine the semantics of the generated proposals, we introduce a proposal graph to consolidate proposal-wise features by reasoning their higher-order relations. • We conduct experiments on three challenging datasets (ActivityNet Captions, TACoS, and Charades-STA), and results show that our proposed APGN significantly outperforms the existing state-of-the-art methods.

Related Work
Temporal sentence localization in videos is a new task introduced recently (Gao et al., 2017;Anne Hendricks et al., 2017), which aims to localize the most relevant video segment from a video with sentence descriptions. Various algorithms (Anne Hendricks et al., 2017;Gao et al., 2017;Chen et al., 2018;Zhang et al., 2019;Yuan et al., 2019a;Zhang et al., 2020b;Qu et al., 2020;Yang et al., 2021) have been proposed within the topdown framework, which samples candidate segments from a video first, then integrates the sentence representation with those video segments individually and evaluates their matching relationships. Some of them (Anne Hendricks et al., 2017;Gao et al., 2017) propose to use the sliding windows as proposals and then perform a comparison between each proposal and the input query in a joint multi-modal embedding space. To improve the quality of the proposals, (Zhang et al., 2019;Yuan et al., 2019a) pre-cut the video on each frame by multiple pre-defined temporal scale, and directly integrate sentence information with fine-grained video clip for scoring. (Zhang et al., 2020b) further build a 2D temporal map to construct all possible segment candidates by treating each frame as the start or end boundary, and match their semantics with the query information. Although these methods achieve great performance, they are severely limited by the heavy computation on proposal matching/ranking, and sensitive to the quality of pre-defined proposals.
Recently, many methods (Rodriguez et al., 2020;Yuan et al., 2019b;Mun et al., 2020;Zeng et al., 2020;Zhang et al., 2020a;Nan et al., 2021) propose to utilize the bottom-up framework to overcome above drawbacks. They do not rely on the segment proposals and directly select the starting and ending frames by leveraging cross-modal interactions between video and query. Specifically, they predict two probabilities at each frame, which indicate whether this frame is a start or end frame of the ground truth video segment. Although these methods perform segment localization more efficiently, they lose the segment-level interaction, and the redundant regression on background frames may provide disturbing noise for boundary decision, leading to worse localization performance than top-down methods.
In this paper, we propose to preserve the segment-level interaction while speeding up the  Figure 2: Overall architecture of APGN. (a) Given a video and a query, we first encode and interact them to obtain query-guided video features. (b) Then, along with regressing boundaries on each frame, we perform foregroundbackground classification to identify the foreground frames whose corresponding predicted boundaries are further taken as the generated segment proposals. (c) We further encode each proposal and refine them using a graph convolutional network. (d) At last, we predict the confidence score and boundary offset for each proposal. localization efficiency. Specifically, we design a binary classification module on the entire video to filter out the background responses, which helps model focus more on the discriminative frames. At the same time, we replace the pre-defined proposals with the generated ones and utilize a proposal graph for refinement.

Overview
Given an untrimmed video V and a sentence query Q, the TSLV task aims to localize the start and end timestamps (τ s , τ e ) of a specific video segment referring to the sentence query. We focus on addressing this task by adaptively generating proposals. To this end, we propose a binary classification module to filter out the redundant responses on background frames. Then, each foreground frame with its regressed start-end boundaries are taken as the generated segment proposal. In this way, the number of the generated proposals is much smaller than the number of pre-defined ones, making the model more efficient. Besides, a proposal graph is further developed to refine proposal features by learning their higher-level interactions. Finally, the confidence score and boundary offset are predicted for each proposal. Figure 2 illustrates the overall architecture of our APGN.

Feature Encoders
Video encoder. Given a video V , we represent it where v t is the t-th frame and T is the length of the entire video. We first extract the features by a pre-trained network, and then employ a self-attention (Vaswani et al., 2017) module to capture the long-range dependencies among video frames. We also utilize a Bi-GRU (Chung et al., 2014) to learn the sequential characteristic. The final video features are denoted as Query encoder. Given a query Q = {q n } N n=1 , where q n is the n-th word and N is the length of the query. Following previous works (Zhang et al., 2019;Zeng et al., 2020), we first generate the wordlevel embeddings using Glove (Pennington et al., 2014), and also employ a self-attention module and a Bi-GRU layer to further encode the query features as Q = {q n } N n=1 ∈ R N ×D . Video-Query interaction. After obtaining the encoded features V , Q, we utilize a co-attention mechanism (Lu et al., 2019) to capture the crossmodal interactions between video and query features. Specifically, we first calculate the similarity scores between V and Q as: where W S ∈ R D×D projects the query features into the same latent space as the video. Then, we compute two attention weights as: where S r and S c are the row-and column-wise softmax results of S, respectively. We compose the final query-guided video representation by learning its sequential features as follows: is the concatenate operation, and is the element-wise multiplication.

Proposal Generation
Given the query-guided video features V , we aim to generate the proposal tuple (t, l t s , l t e ) based on each foreground frame v t , where l t s , l t e denotes the distances from frame v t to the starting and ending segment boundaries, respectively. To this end, we first perform binary classification on the whole frames to distinguish the foreground and background frames, and then treat the foreground ones as positive samples and regress the segment boundaries on these frames as generated proposals. Foreground-Background classification. In the TSLV task, most videos are more than two minutes long while the lengths of annotated target segments only range from several seconds to one minute (e.g. on ActivityNet Caption dataset). Therefore, there exists much noises from the background frames which may disturb the accurate segment localization. To alleviate it, we first classify the background frames and filter out their responses in latter regression. By distinguishing the foreground and background frames with annotations, we design a binary classification module with three fullconnected (FC) layers to predict the class y t on each video frame. Considering the unbalanced foreground/background distribution, we formulate the balanced binary cross-entropy loss as: where T f ore , T back are the numbers of foreground and background frames. T is the number of total video frames. Therefore, we can differentiate between frames from foreground and background during both training and testing. Boundary regression. With the query-guided video representation V and the predicted binary sequence of 0-1, we then design a boundary regression module to predict the distance from each foreground frame to the start (or end) frame of the video segment that corresponds to the query. We implement this module by three 1D convolution layers with two output channels. Given the predicted distance pair (l t s , l t e ) and ground-truth distance (g t s , g t e ), we define the regression loss as: (1 − IoU((t, l t s , l t e ), (t, g t s , g t e ))), (5) where IoU(·) computes the Intersection over Union (IoU) score between the predicted segment and its ground-truth. After that, we can represent the generated proposal as tuples {(t, l t s , l t e )} T f ore t=1 based on the regression results of the foreground frames.

Proposal Consolidation
So far, we have generated a certain number of proposals that are significantly less than the predefined ones in existing top-down framework, making the final scoring and ranking process much efficient. To further refine the proposal features for more accurate segment localization, we explicitly model higher-order interactions between the generated proposals to learn their relations. As shown in Figure 3, proposal 1 and proposal 2 contain same semantics of "blue" and "hops", we need to model their positional distance to distinguish them and refine their features for better understanding the phrase "second time". Also, for the proposals (proposal 2 and 3) which are local neighbors, we have to learn their semantic distance to refine their representations. Therefore, in our APGN, we first encode each proposal feature with both positional embedding and frame-wise semantic features, and then define a graph convolutional network (GCN) over the proposals for proposal refinement. Proposal encoder. For each proposal tuple (t, l t s , l t e ), we represent its segment boundary as (t − l t s , t + l t e ). Before aggregating the features of its contained frames within this segment boundary, we first concatenate a position embedding emb pos t to each frame-wise feature v t , in order to inject position information on frame t as follows: where emb pos t denotes the position embedding of the t-th position, and d is the dimension of emb pos t . We follow (Vaswani et al., 2017) and use the sine and cosine functions of different frequencies to compose position embeddings: emb pos t [2j + 1] = cos( where 2j and 2j + 1 are the even and odd indices of the position embedding. In this way, each dimension of the positional encoding corresponds to a sinusoid, allowing the model to easily learn to attend to absolute positions. Given the frame features { v t } T f ore t=1 and a proposal segment (t − l t s , t + l t e ), we encode the vector feature p t of t-th proposal by aggregating the features of the contained frames in the segment as: where each MLP has two FC layers, Pool(·) denotes the max-pooling. The frames from each proposal are independently processed by MLP 1 before being pooled (channel-wise) to a single feature vector and passed to MLP 2 where information from different frames are further combined. Thus, we can represent the encoded proposal feature as p t ∈ R 1×(D+d) . Proposal graph. We construct a graph over the proposal features {p t } T f ore t=1 , where each node of the graph is a proposal associated with both positions and semantic features. We full connect all node pairs, and define relations between each proposal-pair (p t , p t ) for edge convolution (Wang et al., 2018) as: where θ 1 and θ 2 are learnable parameters. We update each proposal feature p t to p t as follow: This GCN module consists of k stacked graph convolutional layers. After the above proposal consolidation with graph, we are able to learn the refined proposal features.

Localization Head
After proposal consolidation, we feed the refined features P = { p t } T f ore t=1 into two separate heads to predict their confidence scores and boundary offsets for proposal ranking and refinement. Specifically, we employ two MLPs on each feature p t as: where r t ∈ (0, 1) is the confidence score, and (δ t s , δ t e ) is the offsets. Therefore, the final predicted segment of proposal t can be represented as (t − l t s + δ t s , t + l t e + δ t e ). To learn the confidence scoring rule, we first compute the IoU score o t between each proposal segment with the ground-truth (τ s , τ e ), then we adopt the alignment loss function as below: Given the ground-truth boundary offsets (δ t s ,δ t e ) of proposal t, we also fine-tune its offsets by a boundary loss as: where SL 1 (·) denotes the smooth L1 loss function.
At last, our APGN model is trained end-to-end from scratch using the multi-task loss : TACoS. This dataset (Regneri et al., 2013) collects 127 long videos, which are mainly about cooking scenarios, thus lacking the diversity. We use the same split as (Gao et al., 2017), which has 10146, 4589 and 4083 sentence-video pairs for training, validation, and testing, respectively. Charades-STA. (Gao et al., 2017) consists of 9,848 videos of daily life indoors activities. There are 12,408 sentence-video pairs for training and 3,720 pairs for testing. Evaluation Metric. Following (Zhang et al., 2019;Zeng et al., 2020), we adopt "R@n, IoU=m" as our evaluation metrics, which is defined as the percentage of at least one of top-n selected moments having IoU larger than m.

Implementation Details
Following (Zhang et al., 2020b;Zeng et al., 2020), for video input, we apply a pre-trained C3D network for all three datasets to obtain embedded features. We also extract the I3D (Carreira and Zisserman, 2017)   The initial learning rate is set to 0.0001 and it is divided by 10 when the loss arrives on plateaus. λ 1 , λ 2 , λ 3 , λ 4 in the loss function are 0.1, 1, 1, 1 and decided by the weight magnitude.
though TACoS suffers from similar kitchen background and cooking objects among the videos, it is worth noting that our APGN still achieves significant improvements. On Charades-STA dataset, for fair comparisons with other methods, we perform experiments with same features (i.e., VGG, C3D, and I3D) reported in their papers. It shows that our APGN reaches the highest results over all evaluation metrics. Comparison on efficiency. We compare the efficiency of our APGN with previous methods on a single Nvidia TITAN XP GPU on the TACoS dataset. As shown in Table 4, it can be observed that we achieve much faster processing speeds and relatively less learnable parameters. The reason mainly owes to two folds: First, APGN generates proposals without processing overlapped sliding windows as CTRL, and generates less proposals than pre-defined methods such as 2DTAN and CMIN, thus is more efficient; Second, APGN does not apply many convolution layers like 2DTAN or multi-level feature fusion modules as DRN for cross-modal interaction, thus has less parameters.

Ablation Study
Main ablation. As shown in Table 5, we verify the contribution of each part in our model. Starting from the backbone model (Figure 2 (a)), we first implement the baseline model x by directly adding the top-down localization head (( Figure 2 (d))). In this model, we adopt pre-defined proposals as (Zhang et al., 2019). After adding the binary classification module in y, we can find that classification module effectively filters out redundant predefined proposals on large number of background frames. When further applying adaptive proposal generation as z, the generated proposals perform better than the pre-defined one y. Note that, in z, we directly encode proposal-wise features by max-pooling, and the classification module also makes the contribution for filtering out the negative generated proposals. To capture more fine-grained semantics for proposal refinement, we introduce a proposal encoder (model {) for discriminative feature aggregation and a proposal graph (model |) for proposal-wise feature interaction. Although each of them can only bring about 1-3% improvement, the performance increases significantly when utilizing both of them (model }).
Investigation on the video/query encoder. To investigate whether a Transformer (Vaswani et al., 2017) can boost our APGN, we replace the GRU in video/query encoder with a simple Transformer and find some improvements. However, it brings    larger model parameters and lower speed. Effect of unbalanced loss. In the binary classification module, we formulate the typical loss function into a balanced one. As shown in Table 7, the model w/ balanced loss has great improvement (2.04%, 1.51%) compared to the w/o variant, which demonstrates that it is important to consider the unbalanced distribution in the classification process. Investigation on proposal encoder. In proposal encoder, we discard the positional embedding as w/o position, and also replace the max-pooling with the mean-pooling as w/ mean pooling. From the Table 8, we can observe that positional embedding helps to learn the temporal distance (boost 2.46%, 1.95%), and the max-pooling can aggregate more discriminative features (boost 1.49%, 0.78%) than the mean-pooling. Investigation on proposal graph. In the table 9, we also give the analysis on the proposal graph. Compared to w/ edge convolution model (Wang et al., 2018), w/ edge attention directly utilizes coattention (Lu et al., 2016) to compute the similarity of each node-pair and updates them by a weighted summation strategy, which performs worse than the former one. Number of graph layer. As shown in Table 9, the model achieves the best result with 2 graph layers, and the performance will drop when the number of   layers grows up. We give the analysis is that more graph layers will result in over-smoothing problem (Li et al., 2018) since the propagation between the nodes will be accumulated. Plug-and-play. Our proposed adaptive proposal generation can serve as a plug-and-play for existing methods. As shown in Table 10, for top-down methods, we maintain their feature encoders and video-query interaction, and add the proposal generation and proposal consolidation before the localization heads. For bottom-up methods, we first replace their regression heads with our proposal generation process and then add the proposal consolidation process. It shows that our proposal generation and proposal consolidation can bring large improvement on both two types of methods.

Qualitative Results
To qualitatively validate the effectiveness of our APGN, we display two typical examples in Figure  4. It is challenging to accurately localize the semantic "for a second time" in the first video, because there are two separate segments corresponding to the same object "girl in the blue dress" performing the same activity "hops". For comparison, previous method DRN fails to understand the meaning of phrase "second time", and ground both two seg-   ment parts. By contrast, our method has a strong ability to distinguish these two segments in temporal dimension thanks to the positional embedding in the developed proposal graph, thus achieves more accurate localization results. Furthermore, we also display the foreground/background class of each frame in this video. With the help of the proposal consolidation module, the segment proposals of "first time" are filtered out, and all the final ranked top 10 positive frames fall in the target segment.

Conclusion
In this paper, we introduce APGN, a new method for temporal sentence localization in videos. Our core idea is to adaptively generates discriminative proposals and achieve both effective and efficient localization. That is, we first introduce binary classification before the boundary regression to distinguish the background frames, which helps to filter out the corresponding noisy responses. Then, the regressed boundaries on the predicted foreground frames are taken as segment proposals, which decreases a large number of poor quality proposals compared to the pre-defined ones in top-down framework. We further learn higher-level feature interactions between the generated proposals for refinement via a graph convolutional network. Our framework achieves state-of-the-art performance on three challenging benchmarks, demonstrating the effectiveness of our proposed APGN.