Natural Language Video Localization with Learnable Moment Proposals

Given an untrimmed video and a natural language query, Natural Language Video Localization (NLVL) aims to identify the video moment described by query. To address this task, existing methods can be roughly grouped into two groups: 1) propose-and-rank models first define a set of hand-designed moment candidates and then find out the best-matching one. 2) proposal-free models directly predict two temporal boundaries of the referential moment from frames. Currently, almost all the propose-and-rank methods have inferior performance than proposal-free counterparts. In this paper, we argue that the performance of propose-and-rank models are underestimated due to the predefined manners: 1) Hand-designed rules are hard to guarantee the complete coverage of targeted segments. 2) Densely sampled candidate moments cause redundant computation and degrade the performance of ranking process. To this end, we propose a novel model termed LPNet (Learnable Proposal Network for NLVL) with a fixed set of learnable moment proposals. The position and length of these proposals are dynamically adjusted during training process. Moreover, a boundary-aware loss has been proposed to leverage frame-level information and further improve performance. Extensive ablations on two challenging NLVL benchmarks have demonstrated the effectiveness of LPNet over existing state-of-the-art methods.


Introduction
Natural Language Video Localization (NLVL), aka, video grounding or video moment localization, has got unprecedented attention in both CV and NLP communities (Gao et al., 2017;Hendricks et al., 2017). As shown in Figure 1, NLVL aims to local- * Long Chen is the corresponding author. The work started when Long Chen was at Zhejiang University.
1 Source codes is available: https://github.com/ xiaoneil/LPNet/ Language Query: The lady spins the stick using her neck. 66.79s 85.34s Ground Truth: Figure 1: An illustrative example of NLVL. Given a video and a query, NLVL is to localize the video segment corresponding to the query with the start timestamp (66.79s) and the end timestamp (85.34s).
ize the video segment relevant to the query by locating the start and end timestamps in an untrimmed video. It is challenging since it needs to not only understand the video and the sentence content but also find out the precise temporal boundaries. Moreover, NLVL is helpful to numerous downstream video understanding tasks, e.g., content retrieval (Shao et al., 2018), relation detection (Gao et al., 2021), and VQA (Lei et al., 2018;Ye et al., 2017).
Currently, state-of-the-art NLVL methods can be roughly grouped into two categories according to how the video segments are detected, namely propose-and-rank and proposal-free methods: The idea of the propose-and-rank approach (Gao et al., 2017;Hendricks et al., 2018;Liu et al., 2018b;Chen et al., 2018;Ge et al., 2019;Xu et al., 2019;Zhang et al., 2019) is intuitive, which follows the same spirits of anchor-based object detectors, e.g., Faster R- CNN (Ren et al., 2015). This kind of methods firstly defines a series of manuallydesigned temporal bounding boxes as moment proposals. Then, they match each candidate with the sentence in a common feature space and compute matching scores for all the candidates. Thus the localization problem is reformulated into a ranking problem. However, these methods suffer from two inherent drawbacks due to the predefined manners: 1) Even though they elaborately design a series of hyperparameters (e.g., temporal scales and sample rates), these hand-designed rules are hard to guarantee the complete coverage of targeted video segments, and consequently tend to produce inaccurate boundaries. 2) A vast number of proposals are required to achieve high recall, which causes redundant computation and degrades the results of ranking process.
Another type of solution proposal-free approach (Chen and Jiang, 2019;Lu et al., 2019;Zhang et al., 2020a) mitigates these defects. Instead of predefining a series of temporal proposals, they directly predict the start and the end boundaries or regress the locations of the query-related video segments. Benefit from such design, proposal-free methods get rid of placing superfluous temporal anchors, i.e., they are more computation-efficient. Furthermore, without fixing the position and length of the moment proposals, these methods are flexible to adapt to video segments with diverse lengths. Compared to propose-and-rank methods, there are two main limitations of proposal-free methods : 1) They overlook the rich information between start and end boundaries because they are hard to model the segment-level interaction. 2) They always suffer from severe imbalance between the positive and negative training samples.
Up to now, almost all the propose-and-rank methods have inferior performance. We argue that the performance of the propose-and-rank methods are underestimated due to current predefined designs. In this paper, we propose a novel propose-and-rank model with learnable moment proposals, termed LPNet. Without fixed dense proposals, only a sparse set of proposals are required to obtain decent performance. In addition, there is no need to worry about the design of hyper-parameters because it is adaptable to targeted segments with diverse positions and lengths. Obviously, as a propose-andrank method, LPNet also avoids the defects of the proposal-free approach.
Specifically, LPNet places a fixed set of learnable temporal proposals represented by 2-d coordinates indicating the centers and lengths of video segments. These proposals are used to extract visual features of Moment of Interest (MoI). In order to model the relative relations among candidates, a module has been proposed to make the candidates interact with each other using the selfattention mechanism. Then an individual classifier is used to predict the matching score between these proposals and the sentence query. During the training process, the coordinates of the proposal with maximum score are adjusted by a dynamic adjustor at each iteration. After sufficient iterations, these learned moment proposals will statistically represent the prior distributions of ground-truth segments on the dataset. In addition, we empirically find that the propose-and-rank models always obtain sub-optimal results without frame-level supervision. A boundary-aware predictor has been proposed to regularize the model to utilize framelevel information, which further boosts the grounding performance.
We demonstrate the effectiveness of our model on two challenging NLVL benchmarks (Charades-STA, and ActivityNet Captions) by extensive ablative studies. Particularly, our model achieves new state-of-the-art performance over all datasets and evaluation metrics.

Related Work
Natural Language Video Localization. NLVL task was first introduced in (Hendricks et al., 2017;Gao et al., 2017). Current existing methods can be roughly grouped into two categories, namely propose-and-rank and proposal-free methods.
The propose-and-rank approaches (Gao et al., 2017;Hendricks et al., 2017Hendricks et al., , 2018Liu et al., 2018b,a;Xu et al., 2019;Zhang et al., 2019) solve the NLVL task by matching the predefined video moment proposals (e.g., in sliding window manner) with the language query and choose the best matching video segment as the final result. Gao et al. (2017) proposed a CTRL model. It takes video moments predefined through sliding windows as input and jointly models text query and video clips, then outputs alignment scores and action boundary regression results for candidate clips. Hendricks et al. (2017) proposed MCN which effectively localizes language queries in videos by integrating local and global video features over time. To improve the performance of the propose-and-rank method, some works devote to improve the quality of the proposals. Xu et al. (2019) injected text features early on when generating clip proposals to eliminate unlikely clips and thus speed up processing and boost performance. Zhang et al. (2019) proposed to explicitly model moment-wise temporal relations as a structured graph and devised an iterative graph adjustment network to jointly learn the best structure in an end-to-end manner. The others mainly worked on designing a more effective multi-modal interaction network. Liu et al. (2018b)

Propose-and-Rank Module
Start End Figure 2: The architecture of LPNet for NLVL. Feature extractor transforms input video and language query into feature space. Feature encoder further refines video and language feature, and produces the multi-modal feature. A series of learnable proposal boxes are proposed which can be updated by dynamic adjustor during training. Interactive rating layer scores each candidate generated by proposal boxes and the candidate with highest score is the final prediction. Boundary-aware predictor takes multi-modal feature as input and predict the distribution of start and end timestamps, which is an auxiliary task to regularize the model to get better performance.
language-temporal attention network to learn the word attention based on the temporal context information in the video. Liu et al. (2018a) designed a memory attention model to dynamically compute the visual attention over the query and its context information. However, these models are sensitive to the heuristic rules. Proposal-free approaches Lu et al., 2019;Chen et al., , 2018Zhang et al., 2020a) directly predict the probabilities for each frame whether the frame is the boundary frame of the ground-truth video segment or regress the location.  directly regressed the temporal coordinates from the global attention outputs. Zhang et al. (2020a) regarded the NLVL task as a span-based QA problem by treating the input video as a text passage and directly classified the start and end points. In order to further improve the performance, some works focus on eliminating the problem of imbalance of the positive and negative samples. Lu et al. (2019) and  regarded all frames falling in the ground truth segment as foreground, and each foreground frame regresses the unique distances from its location to bi-directional boundaries.
There are also some other works  solving the NLVL task by RL, which formulates the selection of start and end timestamps as a sequential decision making process. And some concurrent NLVL works also borrow the design of Visual Transformers (Dosovitskiy et al., 2021; to explore Transformer-based NLVL model (Cao et al., 2021). End-to-End Object Detection. The development of NLVL is inspired by the success of object detection methods. Object detection aims to obtain a tight bounding box and a class label for each object. It can be categorized into anchor-based and anchor-free approaches. Traditional anchor-based models (Ren et al., 2015;Dai et al., 2016) have dominated this area for many years, which place a series of anchor boxes uniformly and do the classification and regression to determine the position and class for the objects. Anchor-free models (Law and Deng, 2018;Duan et al., 2019) are becoming prosperous, which have been promoted by the development of key point detection. They directly predict key points and group them together to determine objects. Recently, end-to-end object detectors based on sparse candidates have drawn large amount of attentions. DETR (Carion et al., 2020) utilizes a sparse set of object queries to interact with the global feature. Benefit from the bipartite matching predictions and ground-truths, DETR can discard

Approach
We formally define NLVL task as follows. Given an untrimmed video as V = {f t } T t=1 and a language query as Q = {w n } M m=1 , where T and M are the number of video frames and query words, NLVL needs to predict the start and end timestamps (t s , t e ) of the video segment described by the language query Q. For each video, we extract its visual features V = {v t } T t=1 by a pre-trained 3D ConvNet. For each query, we initialize the word features Q = {w n } M m=1 using the GloVe embeddings (Pennington et al., 2014).
The overall architecture of the proposed LPNet is shown in Figure 2. In this section, we first introduce each component of LPNet in sequence. Then, we show the training and inference stage in details.

Feature Encoder
Embedding Encoder Layer. Following previous works (Lu et al., 2019), we use a similar feature encoder in QANet . The embedding encoder layer consists of multiple components as shown in the left of Figure 3. The input of this layer is visual features V ∈ R T ×dv and text features Q ∈ R M ×dq . We project them into the same dimension and feed them into the embedding encoder layer respectively to integrate contextual information. The output of the embedding encoder layer V ∈ R T ×d and Q ∈ R M ×d is refined visual and text features that encode the interaction under each modality. Cross-Modal Attention Layer. This layer calculates vision-to-language attention and languageto-vision attention weights, and then encodes the multi-modal feature. As shown in the right of Figure 3, it first computes a similarity matrix S ∈ R T ×M , where the element S ij indicates the similarity between the frame f i and the word w j . Then, the two attention weights A and B are computed: where S row and S col are the row and column-wise normalization of S. We then model the interaction between the video and the query by the cross-modal attention layer: where is the element-wise multiplication, and [·] is the concatenation operation. The FFN denotes feed-forward layer. The output of this layer V q encodes visual features with query-guided attention.

Propose-and-Rank Module
Learnable Moment Proposals. Different from the previous manually designed temporal anchors, our proposal boxes are learnable during the training process. We define the number of proposals as N . The proposal boxes are represented by 2-d parameters ranging from 0 to 1 which are randomly initialized, denoting normalized center coordinates and lengths. The parameters of proposal boxes (N × 2) will be dynamically adjusted with the back-propagation. In order to model the implicit relation among proposals, following (Sun et al., 2021), we attach proposal features P ∈ R N ×d to every proposal boxes. Simultaneously, multi-head self-attention (MHSA) mechanism is applied to proposal features to reason about interactions among proposals: P = MHSA(P ). Interactive Rating Layer. Given N moment proposal boxes for video V , we capture the candidate features C from the visual feature V q in Eq. (2).
The generated video segment candidates have different lengths in the temporal dimension, hence we transform the candidate features into identical length using temporal RoIAlign. For i-th candidate feature C i : where C i ∈ R l×d . We then interact the candidate feature with its corresponding proposal feature P i ∈ R d to encode richer information following (Sun et al., 2021): where W p ∈ R d×d and W c ∈ R ld×d are learnable weights, and Flatten is an operation that flattens the matrix to one-dimensional vector. We also obtain sentence-level query feature Q ∈ R d by weighted pooling over word-level features Q . Then, we fuse them by concatenation followed by a feed-forward layer: where [·] is the concatenation operation. Taking the multi-modal interactive feature {F i } as input, this layer predicts the matching score for each video segment candidate and the language query. This layer consists of two feed-forward layers followed by ReLU and Sigmoid activation respectively: whereŝ i indicates the predicted matching score of the i-th candidates. We argue that the matching scores are positively associated with the temporal IoU scores between candidates and ground-truth moment. Therefore, we use the IoU scores as the ground-truth labels to supervise the training process. As a result, the matching score rating problem turns into an IoU regression problem.

Dynamic Adjustor
The dynamic update of our learnable proposals is performed by the dynamic adjustor. For N candidates generated before, we can obtain N scores for each candidate from the previous layer. The update strategy is that the model only adjusts the proposal with the largest score when a sample comes in, namely the most certain update. Through multiple iterations, the learned proposals can statistically represent the real distribution of the dataset. We adopt temporal IoU loss to achieve the goal: whereb i is the bounding box of the best-matching candidate and b i is the ground-truth coordinates.

Boundary-aware Predictor
We adopt a boundary-aware predictor to further boost the performance. LPNet is still essentially a propose-and-rank approach which is hard to model the boundary information. By depicting the video as a series of segments, propose-and-rank methods break down the natural structure of videos thus causes sub-optimal results. Instead of adding additional module to explicitly incorporate boundary information, we argue that only utilizing a boundaryaware loss can significantly improve model performance. The boundary-aware predictor takes framelevel multi-modal feature as input. A bidirectional LSTM and two feed-forward layers are used to predict distribution of the start and end timestamps, i.e., where H s and H e are the hidden states of the LSTM s and LSTM e , respectively. L s and L e denote the logits of start and end boundaries computed by two feed-forward layer. In order to avoid introducing the noise caused by label uncertainty, following (Opazo et al., 2020), we relax the ground-truth label near the start and end point and adopt KL divergence to fit distributions.

Training and Inference
Training Objectives. Each training sample consists of an untrimmed video, a language query and the corresponding ground-truth video segment. For each segment candidate, we compute the temporal IoU between the candidate and the ground-truth segment as the matching score. For each video frame with the frame-level feature, two class labels indicating whether or not the frame is the start or the end boundary are assigned. In this paper, we use soft label to avoid label uncertainty. There are two loss functions for the propose-and-rank module and boundary-aware predictor: Matching Regression Loss: where f MSE is a L2 loss function and s IoU is the ground-truth temporal IoU scores. Boundary-aware Loss: where the D KL is Kullback-Leibler divergence. Y s and Y e are ground-truth relaxed labels for the start and end boundaries, respectively. P s and P e are obtained from L s and L e via SoftMax. Thus, the final loss is a multi-task loss combining the L KL and L reg , i.e., where λ is a weight that balances the two losses. During training, we directly use the ground-truth matching score to decide which proposal to adjust by L IoU in Eq. (7). Specifically, the L IoU is used to update the parameters of proposal boxes and L is used to update the rest of network. Inference. Given a video, a language query and a set of learned proposal boxes, we forward them through the network and obtain N segment candidates with their corresponding matching scoresŝ. Then, we rank theŝ and select the candidate with the highest score as the final result.

Datasets
We evaluate our LPNet on two public benchmarks: 1) Charades-STA (Gao et al., 2017): It is built on Charades and contains 6,672 videos of daily indoor activities. Charades-STA contains 16,128 sentence-moment pairs in total, where 12,408 pairs are for training and 3,720 pairs for testing. The average duration of the videos is 30.59s and the average duration of the video segments is 8.22s.
2) ActivityNet Captions (Krishna et al., 2017): It contains around 20k open domain videos for video grounding task. We follow the split in , which consists of 37,421 sentencemoment pairs for training and 17,505 for testing. The average duration of the videos is 117.61s and the average length of the video segments is 36.18s.

Evaluation Metrics
Following prior works , we adopt "R@n, IoU=θ" and "mIoU" as evaluation metrics. Specifically,"R@n, IoU=θ" represents the percentage of the testing samples that have at least one of the top-N results whose IoU with the ground-truth is larger than θ. "mIoU" means the average IoU with ground-truth over all testing samples. In our experiments, we set n = 1 and θ ∈ {0.3, 0.5, 0.7}.

Implementation
We down-sample frames for each video and extract visual features using C3D (Tran et al., 2015) network pretrained on Sports-1M. Then we reduce the features to 500 dimension by PCA. For language queries, we initialize each word with 300d GloVe vectors and all word embeddings are fixed during training. The dimension of the intermediate layer in LPNet is set to 256. The number of convolution blocks in embedding encoder is 4 and the kernel size is set to 7. The temporal RoIAlign length l is set to 16. To avoid elaborate design, the number of learnable proposals is uniformly set to 300 on both datasets. For all datasets, we trained the model for 100 epochs. Dropout and early stopping strategies are adopted to prevent overfitting. We implement our LPNet on Tensorflow. The λ in Eq. (12) is set to 100. The whole framework is trained by Adam optimizer with learning rate 0.0001.

Comparisons with the State-of-the-Arts
Settings. We compare the proposed LPNet with several state-of-the-art NLVL methods on two datasets. These methods are grouped into three categories by the viewpoints of propose-and-rank and proposal-free approach: 1) propose-and-rank models: CTRL ( The results on two benchmarks are reported in Table 1 to Table 2. We can observe that our LP-Net achieves new state-of-the-art performance over almost all metrics and benchmarks. Table 1 summarizes the results on Charades-STA. For a fair comparison with methods with different feature, we use both C3D and I3D (Carreira and Zisserman, 2017) video features and we report the VSLNet with C3D feature (500d by PCA) from BPNet . We observe that LPNet works well in even stricter metrics, e.g., LPNet achieves a significant 2.28 absolute improvement in IoU @0.7 compared to the second result with I3D feature,  Table 1: Performance (%) of "R@1, IoU=θ" and "mIoU" compared with the state-of-the-art NLVL models on Charades-STA. Figure 4: The distribution of learnable proposals during training process, which is getting closer to the groundtruth distribution of samples. Horizontal and vertical axes represent the normalized center coordinate and half length of proposals. We initialized the maximum length of proposals on Charades-STA as 0.5 according to its characteristics. which demonstrates the effectiveness of our model. To be noticed, DEBUG and VSLNet utilize the backbone similar to ours adopted from QANet. DE-BUG is a regression-based method and QANet is a classification-based method, which both belong to the proposal-free approach. The results show that our model not only surpass a multitude of propose-   and-rank methods by a lot, but also exceed these proposal-free methods. Table 2 summarizes the results with C3D features on ActivityNet Captions which has longer videos in average. Our model outperforms almost all other methods. Compared with 2D-TAN, our LPNet achieves a significant 4.48 absolute improvement in IoU @0.3 but is slightly lower in IoU @0.7. This may be because the 2D-TAN enumerates much more candidates. The qualitative results of LPNet are illustrated in Figure 4. We can observe that LPNet performs well to produce the precise query-related moments.

Ablation Studies
In this section, we conduct ablative experiments with different variants to better investigate our approach.
Number of Learnable Proposals. The number of proposals is a key factor of propose-and-rank models. We change the number of proposals of our model on Charades-STA dataset and show its impact on performance in Table 3. The results show that using only a small amount of proposals, LPNet is able to achieve impressive performance. It Language Query: A wooden contraption signals that a fish has been caught, The man pulls the fish out of the water.  should be noted that we simply place 300 learnable proposals on both datasets in Table 1 and Table 2 to avoid artificial design. However, a smaller amount (100) of proposals get better result on Charades-STA.

GT
With vs Without Boundary-aware Loss. From Table 4, we find that there is huge improvement when the boundary-aware loss is applied. The main reason is that the KL-divergence loss utilizes framelevel information to regularize the training of the model and force the model to consider the video as a whole.
With vs Without Multi-head Self-attention.
Comparing the first two rows in Table 4, we observe that applying multi-head self-attention mechanism to proposal features can improve the performance. This operation successfully learns the latent relations between the proposals which is helpful to the localization task. However, when boundaryaware loss has been already applied (last two rows in Table 4), the results are very close. This may indicate that the boundary-aware loss makes the similar kind of contribution to the model.

Conclusions
In this paper, we present a novel propose-and-rank model with learnable moment proposals for NLVL. Compared to the existing propose-and-rank method with predefined temporal boxes, our model improves the performance significantly because 1) our model disengages from the hand-designed rules for bounding boxes so that it can produce more accurate temporal boundaries; 2) sparse sampled candidate release the pressure for subsequent ranking process; 3) boundary-aware loss regularize the model to avoid sub-optimum. In the future, we are going to explore a more effective way to learn better proposals and extend this idea to other tasks.