Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding

Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.


Abstract
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multimodal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.

Introduction
Temporal sentence grounding (TSG) is an important yet challenging task in natural language pro- * Equal contributions. † Corresponding author.   Figure 1: (a) An example of temporal sentence grounding task. (b) All existing TSG methods generally utilize a downsampling process to evenly extract a fixed number of frames from a long video. However, the new target segment is obtained by rounding operation and may introduces boundary bias since some original boundary frames are lost. (c) We propose a siamese sampling strategy to extract additional adjacent frames to enrich and refine the information of the sampled frames for generating more accurate boundary of the new segment.
cessing, which has drawn increasing attention over the last few years due to its vast potential applications in information retrieval (Dong et al., 2019;Yang et al., 2020) and human-computer interaction (Singha et al., 2018). It aims to ground the most relevant video segment according to a given sentence query. As shown in Figure 1 (a), video and query information need to be deeply incorporated to distinguish the fine-grained details of adjacent frames for determining accurate boundary timestamps.
Previous TSG methods (Gao et al., 2017;Chen et al., 2018;Zhang et al., 2019b;Zhang et al., 2020b;Liu et al., 2018a;Zhang et al., 2019a;Liu et al., 2018bLiu et al., , 2021a generally follow an encoding-then-interaction framework that first extracts both video and query features and then conduct multi-modal interactions for reasoning. Since many videos are overlong while corresponding target segments are short, these methods simply utilize a sparse sampling strategy shown in Figure1 (b), which samples a fixed number of frames from each video to reconstruct a shorter video, and then learn frame-query relations for segment inferring. We argue that existing learning paradigm suffers from two obvious limitations: 1) Boundary-bias: Each video has a query-related segment, which refers to two specific frames as its start and end timestamps. Traditional sparse downsampling strategy extracts frames from videos with a fixed interval. A rounding operation is then applied to map the annotated segment to the sampled frames by keeping the same proportional length in both original and new videos. As a result, the groundtruth boundary frames may be filtered out and the query-irrelevant frames will be regarded as the actual boundaries, generating wrong labels for latter training. 2) Reasoning-bias: The query-irrelevant boundary frames in the newly reconstructed segment will also lead to incorrect frame-query interaction and reasoning in the training process, reducing the generalization ability of model.
To alleviate these two issues, a straightforward idea is to filter out the sampled boundary frames in the new segment if they are query-irrelevant. However, this will destroy the true segment length when we transfer the downsampled segment back to the original one during the inference process. Another straightforward idea is to directly keep the appropriate segment length (by float values) in the newly reconstructed video and then reason the query content in the new boundary to determine what percentage of this boundary is correct. However, the queryirrelevant boundaries lack sufficient query-related information for boundary reasoning. Based on the above considerations, we aim to extract additional frames adjacent to the sampled frames to enrich and refine their information for supplementing the consecutive visual semantics. In this way, the new boundary frames are well semantic-correlated to its original adjacent boundaries. Based on the refined boundary frames, we can keep and learn the appropriate segment length of the downsampled video for query reasoning. Moreover, other inner frames are also enriched by their neighbors, captur-ing more consecutive visual appearances for fully understanding the entire activity. Therefore, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for temporal sentence grounding task to generate additional contextual frames to enrich and refine the new boundaries. Specifically, we treat the sparse sampled video frames as anchor frames, and additionally extract several frames adjacent to each anchor frame as the siamese frames for semantic sharing and enriching. A siamese knowledge aggregation module is designed to explore internal relationships and aggregate contextual information among these frames. Then, a siamese reasoning module supplements the fine-grained contexts of siamese frames into the anchor frames for enriching their semantics. In this way, the query-related information are added into the new boundaries thus we can utilize an appropriate float value to represent the new segment length for query reasoning, addressing both boundary-and reasoning-bias. Moreover, other sampled frames are also equipped with more consecutive visual semantics from their original neighbors, which further benefits more fine-grained learning process.
Our contributions are summarized as follows: • We propose a novel SSRN model which can sparsely extract multiple relevant frames from original videos to enrich the anchor frames for more accurate boundary prediction. To the best of our knowledge, we are the first to propose and address both boundary-bias and reasoning-bias in TSG task.
• We propose an effective siamese aggregation and reasoning method to correlate and integrate the contextual information of siamese frames to refine the anchor frames.
• Extensive experiments are conducted on three challenging public benchmarks, including Ac-tivityNet Captions, TACoS and Charades-STA, demonstrating the effectiveness of our proposed SSRN method.

Related Work
Temporal sentence grounding (TSG) is a new task introduced recently (Gao et al., 2017;Anne Hendricks et al., 2017), which aims to localize the most relevant video segment from a video with sentence descriptions. All existing methods follow an encoding-then-interaction framework that first extracts video/query features and then conduct multi-modal interactions for segment inferring.
Based on the interacted multi-modal features, traditional methods follow a propose-and-rank paradigm to make predictions. Most of them (Anne Hendricks et al., 2017;Liu et al., 2018a;Chen et al., 2018;Liu et al., 2018b;Ge et al., 2019;Zhang et al., 2019a;Qu et al., 2020;Liu et al., 2021aLiu et al., ,c, 2020a typically utilize a proposal-based grounding head that first generates multiple candidate segments as proposals, and then ranks them according to their similarity with the query semantic to select the best matching one. Some of them (Gao et al., 2017;Anne Hendricks et al., 2017) directly utilize multi-scale sliding windows to produce the proposals and subsequently integrate the query with segment representations via a matrix operation. To improve the quality of the proposals, latest works Zhang et al., 2019b;Cao et al., 2021;Liu et al., 2021bLiu et al., , 2020bLiu et al., , 2022b integrate sentence information with each fine-grained video clip unit, and predict the scores of candidate segments by gradually merging the fusion feature sequence over time.
Recently, some proposal-free works (Yuan et al., 2019b;Rodriguez et al., 2020;Mun et al., 2020;Zeng et al., 2020;Zhang et al., 2020aNan et al., 2021) directly predict the temporal locations of the target segment without generating complex proposals. These works directly select the starting and ending frames by leveraging cross-modal interactions between video and query. Specifically, they either regress the start/end timestamps based on the entire video representation (Yuan et al., 2019b;Mun et al., 2020), or predict at each frame to determine whether this frame is a start or end boundary (Rodriguez et al., 2020;Zeng et al., 2020;Zhang et al., 2020a.
Although the above two types of methods have achieved great performances, their video sampling strategy in encoding part is unreasonable that can lead to both boundary and reasoning bias. Specifically, the boundary bias is defined as the incorrect boundary of the new segment reconstructed by the video sparse sampling. The reasoning bias is defined as the incorrect correlation learning between the query-irrelevant frames and query. In this paper, we aim to reduce the above bias by proposing a new siamese sampling and reasoning strategy to enrich the sampled frames and further refine the reconstructed segment boundary.

The Proposed Method
Given an untrimmed video and a sentence query, we represent the video as V with a frame number of T . Similarly, the query with N words is denoted as Q. Temporal sentence grounding (TSG) aims to localize a segment (τ s , τ e ) starting at timestamp τ s and ending at timestamp τ e in video V, which corresponds to the same semantic as query Q.
The overall architecture of the proposed Siamese Sampling and Reasoning Network (SSRN) method is illustrated in Figure 2. The SSRN framework contains four main components: (1) Siamese sampling and encoding: We sparsely downsample each long video into the anchor frames, and a new siamese sampling strategy additionally samples their adjacent frames as siamese frames. A video/query encoder then extracts visual/query features from all sampled video frames and query sentence respectively. (2) Multi-modal interaction: After that, we interact the query features with the visual features for cross-modal interaction. (3) Multimodal reasoning: Next, to supplement the knowledge of siamese frames into the anchor frames, a siamese knowledge aggregation module is developed to determine how much the information of siamese frames are needed to inject into the anchor ones. Then, a reasoning module is utilized to enrich the anchor frames with the aggregated semantic knowledge. In this way, the contexts of both new boundaries and other sparse frames are enriched and can better represent the full and consecutive visual semantics. (4) Grounding heads with soft labels: At last, we employ the grounding heads with soft label to predict more accurate boundaries via float value to keep the appropriate segment length. We illustrate the details of each component in the following subsections.

Siamese Sampling and Encoding
Given the dense video input V, previous works generally downsample each video into a new video of fixed length to address the problem of overlong video. Considering the existing boundary-bias, we propose a siamese sampling strategy to additionally extract contextual adjacent frames nearby each sampled frame to enrich its query-related information for better determining the accurate new boundary. Here, we call the downsampled frames and their Given a dense video, the anchor frames and siamese frames are first extracted by sparse sampling and siamese sampling, respectively. Then a video/query encoder and a multimodal interaction module are utilized to generate multimodal features. Next, a siamese knowledge generation module is proposed to model contextual relationship between anchor frames and siamese ones from the same video. After that, the siamese knowledge reasoning module exploits the siamese knowledge to enrich the information of the anchor frames for more accurate boundary prediction. At last, in the grounding heads, we utilize a soft label to learn more fine-grained boundaries of float value in addition to the rounded one.
contextual frames as anchor frames and siamese frames, respectively. Specifically, as shown in Figure 1 (c), following previous works, we directly construct the anchor video V a by sparsely and evenly sampling M frames from dense video frames of length T (T is usually much greater than M ). The new siamese videos are then captured at different beginning indices in the original video but next to the frames of the anchor video. The same sample interval is utilized for all frames.
After siamese sampling, we can obtain multiple siamese videos with same length and similar global semantics as the anchor video. We denote the new siamese videos as {V s,k } K k=1 where K means the siamese sample number.
Since we utilize the sampling strategy to process the dense video frames, the start/end time of the target segment in original video sequence needs to be accurately mapped to the corresponding boundaries in the new video sequence of M frames. Following almost all previous TSG methods (Zhang et al., 2019b(Zhang et al., , 2020aLiu et al., 2021a), the new start/end index is generally calculated bŷ τ s(e) = τ s(e) /T × M , where · denotes the rounding operator. During the inference, the predicted segment boundary index can be easily converted to the corresponding time in the dense video via τ s(e) =τ s(e) /M × T . However, the rounding operation may produce boundary bias that the new boundary frames are not semantically correlated to the query semantic. Therefore, we further generate a soft labelτ s(e) = τ s(e) /T × M as an additional supervision to keep the appropriate segment length during training, where · denotes the float result. Video encoder For video encoding, we first extract frame features by a pre-trained C3D network (Tran et al., 2015), and then add a positional encoding (Vaswani et al., 2017) to provide positional knowledge. Such position encoding plays a crucial role in distinguishing semantics at diverse temporal locations. Considering the sequential characteristic in videos, a Bi-GRU (Chung et al., 2014) is further applied to incorporate the contextual information along time series. We denote the extracted video features of both anchor video and siamese video as V a , {V s,k } K k=1 ∈ R M ×D , respectively. Query encoder For query encoding, we first extract word embeddings by the Glove model (Pennington et al., 2014). We also apply positional encoding and Bi-GRU to integrate the sequential information within the sentence. The final feature of the query is denoted as Q ∈ R N ×D .

Multi-Modal Interaction
After obtaining the video features V a , {V s,k } K k=1 and query feature Q, we utilize a co-attention mech-anism (Lu et al., 2019) to capture the cross-modal interactions between them. Specifically, for each video feature V ∈ {V a } ∪ {V s,k } K k=1 , we first calculate the similarity between V and Q as: where W S ∈ R D×D projects the query features into the same latent space as the video. Then, we compute two attention weights as: where S r and S c are the row-and column-wise softmax results of S, respectively. We compose the final query-guided video representation by learning its sequential features as follows: where Bi-GRU(·) denotes the Bi-GRU layers, [; ] is the concatenate operation, and is the elementwise multiplication. The output F ∈ {F a } ∪ {F s,k } K k=1 encodes visual features with queryguided attention.

Multi-Modal Reasoning Strategy
Note that the query-irreverent new boundary frames encoded in the anchor video feature F a has insufficient query-guided visual information for latter boundary prediction. To address this issue, we propose a new multi-modal reasoning strategy to enrich the query-related knowledge in anchor features F a referring to the contextual information in siamese features {F s,k } K k=1 . In detail, the multimodal reasoning strategy consists of two components: a siamese knowledge aggregation module and a siamese knowledge reasoning module. Siamese knowledge aggregation Intuitively, features with close visual-query correlation are expected to generate more consistent predictions of segment probabilities. To this end, we utilize a siamese knowledge aggregation module to generate interdependent knowledge from siamese features to anchor ones to enrich the contexts of anchor features and refine the prediction.
We propose to propagate and integrate knowledge between the query-guided visual features F a and {F s,k } K k=1 . Specifically, we first obtain their semantic similarities by calculating their pairwise cosine similarity scores as: where C ∈ R M ×K is interdependent similarity matrix, · 2 is l 2 -norm, i ∈ {1, 2, ..., M } is the indices of features and k ∈ {1, 2, ..., K} is the indices of siamese videos. Here, each anchor frame is needed to be enriched by only its siamese frames. We employ a softmax function to each row of the similarity matrix C as: where the new C indicates the contextual affinities between each anchor feature and its corresponding siamese features.
Siamese knowledge reasoning After that, we propose to adaptively propagate and merge the siamese knowledge into the anchor features for enriching the query-aware information. This is especially helpful when we determine more accurate boundaries for the downsampled video. Specifically, The integration process can be formulated as: where F a is the propagated semantic vector in anchor video. In order to avoid over propagation and involves in irrelevant noisy information, we further exploit a residual design with a learnable weight to enrich the anchor video as: where W 1 , W 2 ∈ R D×D are projection matrices, weighting factor α ∈ [0, 1] is a hyper-parameter. With the above formulations, the knowledge of the siamese samples within the same video can be propagated and integrated to the anchor one.

Grounding Heads with Soft Label
For the final segment boundary prediction, we first follow the span predictor in (Zhang et al., 2020a) to utilize two stacked-LSTM with two corresponding feed-forward layers to predict the start/end scores of each frame. In details, we send the contextual multi-modal feature F a ∈ R M ×D into this span predictor and apply the softmax function on its two outputs to produce the probability distributions P s , P e ∈ R M of start and end boundaries. We utilize the rounded boundaryτ s(e) to generate the coarse label vectors Y s(e) to supervise P s , P e as: where f CE represents cross-entropy loss function. The predicted timestamps (τ s ,τ e ) are obtained from the maximum scores of start and end predictions P s(e) of frames as: (τ s ,τ e ) = arg max τs ,τe P s (τ s )P e (τ e ), (9) where 0 ≤τ s ≤τ e ≤ M .
Since the above predictions are coarse on the segment boundaries with boundary-bias, we further utilize a parallel prediction head on F a to predict more fine-grained float boundaries on the downsampled boundary frames. Specifically, we utilize the float boundaryτ s(e) to generate the soft labels Y s(e) , and F a is fed into a single feed-forward layer to predict the float boundaries O s(e) supervised by our designed soft labels Y s(e) as follows: where R 1 is the smooth L1 loss. The final predicted segment is calculated by:    as our evaluation metrics. The "R@n, IoU=m" is defined as the percentage of at least one of top-n selected moments having IoU larger than m, which is the higher the better.

Implementation Details
For video encoding, we apply C3D (Tran et al., 2015) to encode the videos on all three datasets, and also extract the I3D (Carreira and Zisserman, 2017) and VGG (Simonyan and Zisserman, 2014) features on Charades-STA dataset for fairly comparing with other methods. Following previous works, we set the length M of the sampled anchor video sequences to 200 for ActivityNet Captions and TACoS datasets, 64 for Charades-STA dataset, respectively. As for sentence encoding, we utilize Glove word2vec (Pennington et al., 2014) to embed each word to a 300-dimension feature. The hidden state dimensions of Bi-GRU and Bi-LSTM are set to 512. The number K of the sampled siamese frames for each anchor frame is set to 4. We train our model with an Adam optimizer with leaning rate 8 × 10 −4 , 3 × 10 −4 , 4 × 10 −4 for ActivityNet Captions, TACoS, and Charades-STA datasets, respectively. The batch size is set to 64.
rics, respectively. On Charades-STA dataset, for fair comparisons with other methods, we perform experiments with same features (i.e., VGG, C3D, and I3D) reported in their papers. It shows that our SSRN reaches the highest results over all metrics. Efficiency comparison To compare the efficiency of our SSRN with previous methods, we make a fair comparison on a single Nvidia TITAN XP GPU on the TACoS dataset. As shown in Table 4, it can be observed that we achieve much faster processing speeds with a competitive model sizes.

Ablation Study
Effect of the siamese learning strategy As shown in Table 5, we set the network without both siamese sampling/reasoning and soft label training as the baseline (model x). Compared with the baseline, the model z additionally extracts siamese frames for contextual learning, and can apparently improve the accuracy. It directly utilizes average operation to aggregate siamese knowledge and exploit concatenation for knowledge reasoning, which validates that multiple frames from same videos can really bring some strong knowledge to enhance the network. When further applying the SKR module on model z, the model { performs better, demonstrating the effectiveness of our SKR module. When we further add the SKG module, our model | can reach a higher performance, which can demonstrate the effectiveness of building the interdependent knowledge (i.e., siamese knowledge) for integrating the samples. It can also prove that adaptively reasoning by our siamese knowledge is better than the purely average operation. We think that the siamese knowledge not only serves as the knowledge-routed representation, but also implicitly constrains the semantic consistency of frames in the space of frame-text features.  Figure 3: The visualization examples to show the benefits from the siamese frames. Due to the boundary-bias during the sparse sampling process, previous VSLNet method filters out the true-positive boundary frames and fails to predict the accurate boundaries. Instead, our siamese learning strategy supplements the query-related information of the adjacent frames into the ambiguous downsampled boundary-frames for predicting more precise boundaries.  Effect of the usage of soft label We also investigate whether our soft label (float value) of the segment boundary contributes to the performance of our model. As shown in Table 5, directly applying the soft label learning to the baseline does not bring significant performance improvement (model y). This is mainly because that the boundary frame may be query-irrelevant and its feature is not able to be accurately matched with the query. Instead, comparing model } with model |, model } enriches the boundary frames with siamese contexts and supplements them with the neighboring queryrelated visual information. Therefore, it brings significant improvement by using the soft label in training process. Effect of the number of siamese frames We compare our method with various number of siamese frames as shown in Table 6. When adding the siamese sample number K from 1 to 8, our method dynamically promotes the accuracy. Such improvement can demonstrate that more siamese samples can bring richer knowledge, which makes our network benefited from it. Although the accuracy is increasing with the number of siamese frames, we observe that the improvement from the number 4 to 8 is slight. We think the reason is the saturation of knowledge, i.e., the model has enough knowledge to learn the task on this dataset. Hence, it is almost meaningless to purely increase the siamese frames. To balance the training time and accuracy, we assign K = 4 in our final version. Plug-and-Play Our proposed siamese learning strategy is flexible and can be adopted to other  TSG methods for anchor feature enhancement. As shown in Table 7, we directly apply siamese learning strategy into existing module for anchor feature enriching without using soft label training. It shows that our siamese learning strategy can provide more contextual and fine-grained information for anchor feature encoding, bringing large improvement.

Qualitative Results
In Figure 3, we show two visualization examples to qualitatively analyze what kind of knowledge does the siamese frames bring to the anchor frames. It is unavoidable to lose some visual contents when sparsely sampling from the video. Especially for the boundary frames that are easily to be filtered out by sampling, the visual content of the newly sampled boundary may lose query-relevant information (e.g., brown words in figure). However, we can obtain the absent contents from their siamese frames due to different sampling indices and duration. Hence, our siamese frames can enrich and supplement the sampled frames with more consecutive query-related visual semantics to make a fine-grained video comprehension, keeping the appropriate segment length of the sampled video for more accurate boundary prediction.

Conclusion
In this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) to alleviate the limitations of both boundary-bias and reasoningbias in existing TSG methods. In addition to the original anchor frames, our model also samples a certain number of siamese frames from the same video to enrich and refine the visual semantics of the anchor frames. A soft label is further exploited to supervise the enhanced anchor features for predicting more accurate segment boundaries. Experimental results show both effectiveness and efficiency of our SSRN on three challenging datasets.

Limitations
This work analyzes an interesting problem of how to learn from inside to address the limitation of the boundary-bias on the temporal sentence grounding. Since our method targets on the issue of long video sampling, it may be not helpful to handle the short video processing but still can improve the contextual representation learning for the short video. Besides, our sampled siamese frames would bring extra burden (e.g., computation, memory and parameters) during the training and testing. Therefore, a more light way to ease the siamese knowledge extraction is a promising future direction.