Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding

A key solution to temporal sentence grounding (TSG) exists in how to learn effective alignment between vision and language features extracted from an untrimmed video and a sentence description. Existing methods mainly leverage vanilla soft attention to perform the alignment in a single-step process. However, such single-step attention is insufficient in practice, since complicated relations between inter- and intra-modality are usually obtained through multi-step reasoning. In this paper, we propose an Iterative Alignment Network (IA-Net) for TSG task, which iteratively interacts inter- and intra-modal features within multiple steps for more accurate grounding. Specifically, during the iterative reasoning process, we pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs, and enhance the basic co-attention mechanism in a parallel manner. To further calibrate the misaligned attention caused by each reasoning step, we also devise a calibration module following each attention module to refine the alignment knowledge. With such iterative alignment scheme, our IA-Net can robustly capture the fine-grained relations between vision and language domains step-by-step for progressively reasoning the temporal boundaries. Extensive experiments conducted on three challenging benchmarks demonstrate that our proposed model performs better than the state-of-the-arts.


Introduction
Temporal localization is an important topic of visual understanding in computer vision. There are several related tasks proposed for different scenarios involving language, such as video summarization Chu et al., 2015), temporal action localization (Shou et al., 2016; * Corresponding author.
Sentence query: All four are once again talking in front of the camera.

Single-step Interaction
Ground Truth: [173.49s, 212.67s] misalignment Sentence query: All four are once again talking in front of the camera. Improved Interand Intra-attention Calibration calibration Improved Interand Intra-attention Calibration Improved Interand Intra-attention Calibration Figure 1: Illustration of our motivation. Upper: Previous methods are mainly based on a single-step interaction with attention, which is insufficient to reason the complicated multi-modal relations, thus may lead to the misalignment on semantics. Bottom: we develop an iterative network with improved attention mechanism and calibration module, which can progressively align accurate semantic. 2017), and temporal sentence grounding (Gao et al., 2017;Anne Hendricks et al., 2017). Among them, temporal sentence grounding is the most challenging task due to its complexity of multi-modal interactions and complicated context information. Given an untrimmed video, it aims to determine the segment boundaries including start and end timestamps that contain the interested activity according to a given sentence description.
Existing methods mainly focus on learning multimodal interaction by a single-step attention mechanism. Most of approaches (Yuan et al., 2019b;Chen and Jiang, 2019;Zhang et al., 2019a;Rodriguez et al., 2020;Chen et al., 2020;Liu et al., 2020a,b) utilize a simple co-attention mechanism to learn the inter-modality relations between wordframe pairs for aligning the semantic information. Besides, some approaches Zhang et al., 2019b;) employ a single-step self-attention to explore the contextual information among intra-modality to correlate relevant frames or words.
Although these methods achieve promising results, they are severely limited by two issues. 1) These single-step methods only consider the interor intra-modal relation once, which is insufficient to learn the complicated multi-modal interaction that needs multi-step reasoning. Besides, the misalignment between inter-modality or the wrongattention across intra-modality caused by such single-step attention will directly degenerate the performance on the boundary results. As shown in Figure 1, the task targets to localize the query "All four are once again talking in front of the camera" in the video. It is hard for single-step methods to directly pay more attention on phrase "once again", easily leading to the misalignment problem and the wrong grounding result. 2) Nowhere-to-attend problem is generally happened in TSG task, in which the background frames do not match any word in the sentence, and the basic attention may generate the wrong attention weights in these cases.
In this paper, we develop a novel Iterative Alignment Network (IA-Net) for temporal sentence grounding, which addresses the above problems by an end-to-end framework within multi-step reasoning. Specifically, we introduce an iterative matching scheme to explore both inter-and intra-modal relations progressively with an improved attention based inter-and intra-modal interaction module. In this module, we first pad the multi-modal features with learnable parameters to tackle the nowhere-toattend problem, and enhance the basic co-attention mechanism into a parallel manner that can provide multiple attended features for better capturing the complicated inter-and intra-modal relations. Then, to refine and calibrate the misaligned attention happened in early reasoning step, we develop a calibration module following each attention module to refine the alignment knowledge during the iterative process. By stacking multiple such improved interaction modules, our IA-Net provides effective attention to iteratively reason the complicated relations between the vision and language features step-by-step, providing more accurate segment boundaries.
Our main contributions are three-fold: • We propose an iterative framework for temporal sentence grounding to progressively align the complicated semantics between vision and language.
• We formulate the proposed iterative matching method with an improved co-attention mech-anism to utilize learable paddings to address nowhere-to-attend problem with deep latent clues, and a calibration module to refine or calibrate the alignment knowledge of interand intra-modal relations during the reasoning process.
• Extensive experiments are performed to examine the effectiveness of the proposed IA-Net on three datasets (ActivityNet Captions, TACoS, and Charades-STA), in which we achieve the state-of-the-art performances.

Related Work
Temporal sentence grounding. Temporal sentence grounding (TSG) is a new task introduced recently (Gao et al., 2017;Anne Hendricks et al., 2017). Formally, given an untrimmed video and a natural sentence query, the task aims to identify the start and end timestamps of one specific video segment, which contains activities of interest semantically corresponding to the given sentence query. To interact video and sentence features, some works align the semantics of video with language by a recurrent neural network (RNN). (Chen et al., 2018) design a recurrent module to temporally capture the evolving fine-grained frame-by-word interactions between video and sentence. (Zhang et al.) propose to apply a bidirectional GRU instead of normal RNN for alignment. However, these RNNs can not align the semantics well in this task. As attention has proved its effectiveness on contextual correlation mining, amount of works tend to align relevant visual features with the query text description by an attention module. (Liu et al., 2018a) design a memory attention mechanism on query sentence to emphasize the visual features mentioned in the sentence. (Wang et al., 2020) use a soft attention on moment features based on the sentence feature. ) adopt a simple multiplication operation for visual and language feature fusion. Moreover, the visual-textual co-attention module is widely utilized to model the cross-modal interaction (Liu et al., 2018b;Yuan et al., 2019b;Chen et al., , 2020Rodriguez et al., 2020;Qu et al., 2020;Nan et al., 2021), which performs effective and efficient in most of challenging scenes. There are also some works (Zhang et al., 2019b;   made great progress in TSG, they are severely limited by such single-step attention mechanism. Motivated by this, we introduce an iterative alignment scheme to explore fine-grained inter-and intramodal relations. For complicated correlation capturing, we pad the multi-modal features with learable parameters and enhance the co-attention with multi heads. For semantic misalignment, we additionally develop a calibration module to refine the alignment knowledge. Such iterative process helps our model align more accurate semantic. Attention mechanism. Attention has achieved great success in various tasks, such as image classification, machine translation, and visual question answering. (Vaswani et al., 2017) propose the Transformer to capture the long-term dependency with multi-headed architecture. Although it attracts great interests from multi-modal retrieval community, it only consider sentence-guided attention on video frames or the video-guided attention on sentence words with complex computation. Compared to it, co-attention mechanism (Lu et al., 2016;Xiong et al., 2016) is proposed to jointly reason about frames and words attention with light weights, which is more suitable for addressing the real-world temporal sentence grounding task. In this paper, we consider the nowhere-to-attend cases in TSG task that frame/word is irrelevant to the whole sentence/video, and address it by utilizing a learnable paddings during the process of attention map generation. We also improve the basic co-attention mechanism into a parallel manner like Transformer which provides multiple latent attended features for better correlation mining.

The Proposed Method
The TSG task considered in this paper is defined as follows. Given an untrimmed reference video V and a sentence query Q, we need to predict the start and end timestamps (⌧ s , ⌧ e ), where the segment in V from time point ⌧ s to ⌧ e corresponds to the same semantic as Q.
In this section, we introduce our framework IA-Net as shown in Figure 2. Our model consists of three main components: video and query encoders, iterative inter-and intra-modal interaction, and the segment localizer. Video and sentence are first fed to the encoders for extracting multi-modal features. Then we iteratively interact their features for semantic alignment. Specially, in each iterative step, we utilize a co-attention to align the inter-modal semantic and an another co-attention to correlate intra-modal instances in each modality. We improve the basic co-attention mechanism in a parallel manner, and devise two calibration modules following the inter-and intra-attention to refine and calibrate the knowledge of cross-modal alignment and self-modal correlation during the iterative interaction process. At last, we utilize a segment localizer to ground the segment boundaries.

Video and Query Encoders
Video encoder. For video encoding, we first extract the clip-wise features by a pre-trained C3D network (Tran et al., 2015), and then add a positional encoding (Vaswani et al., 2017) to take positional knowledge. Considering the sequential characteristic in video, a bi-directional GRU (Chung et al., 2014) is further utilized to incorporate the contextual information in time series. The output of this encoder is V = {v t } T t=1 2 R T ⇥D , which encodes the context in video.
Query encoder. For query encoding, we first extract the word embeddings by the Glove word2vec model (Pennington et al., 2014), and also use the positional encoding and bi-directional GRU to integrate the sequential information. The final feature representation of the input sentence is denoted as

Improved Inter-modal Interaction
The improved inter-modal interaction module is based on co-attention mechanism to capture the im-portance between each pair of visual clip and word features. To tackle nowhere-to-attend problem and calibrate the misalignment knowledge, we improve the co-attention in a parallel manner with learnable paddings and devise a calibration module followed by it. Details are shown in Figure 3.
Nowhere-to-attend and parallel attention. Previous co-attention based works in TSG (Yuan et al., 2019b;Chen et al., , 2020 formally compute the attention maps by directly calculating the inner product between V , Q. However, it often occurs at the creation of an attention map that there is no particular frame or word that the model should attend, especially for the background frames that do not match any word in the sentence. This will lead to the wrong attention on the mismatched frame-word pairs. To deal with such cases, we add K elements to both sentence words and video clips to additionally serve for no-attention instances. In details, we incorporate two learnable Besides, we also enhance co-attention into multiple attention heads to capture complicated relations in different latent space, and use their average as the attention result. To generate H number of attention maps, we first linearly project the D-dimensional features of e V , e Q into multiple lower D -dimensional spaces, where D = D/H. We take the h-th attention map (h 2 H) as example: where Linear(·) denotes a fully-connected layer with parameter ⇥. Then, we compute the attention map by inner product with row-wise normalization as: We take average fusion of multiple attended features, which is equivalent to averaging H number of attention maps as: At last, we can get the V, Q-grounded alignment features M V , M Q , in which each element captures related semantics shared by the whole Q, V to each v t , q n : Calibration module. After receiving the alignment features M V , M Q and the multi-modal features V , Q, to refine the alignment knowledge for the next interaction step, we aim to update each modal features V , Q by aggregating them with the corresponding alignment features M V , M Q with a gate function dynamically. In details, we first generate a fusion feature for each modality to enhance its semantics by: where W , U , b are the learnable parameters. To select the discriminative information and filter out incorrect one, a gating weight can be formulated as follows: At last, the calibrated output of the current intermodal interaction module can be obtained by: where denotes the element-wise multiplication. The developed gate mechanism has two main contributions: 1) The information of each modality can be refined by itself and the enhanced semantic features shared with R. It helps to filter out trivial information maintained in V , Q, and calibrate the misaligned attention by re-considering its individual shared semantics. 2) The contextual information from alignment features M V , M Q summarize the contexts regard to each instances in cross-modal features Q, V , respectively. After the gating process, the contextual information maintained in b V , b Q will also assist to determine the shared semantics in latter attention procedure. It will progressively enhance the interaction among inter-modal features and thus benefit the representation learning.

Improved Intra-modal Interaction
The output visual clip and sentence word features of the inter-modal interaction module have encoded cross-modal relations between clips and words. With such contextual cross-modal information, we implement intra-attention on each modality to correlate the relevant instances for composing the scene meaning. Different from inter-attention, there is no nowhere-to-attend problem as video or sentence has strong temporal relations in itself. A calibration module is also utilized for self-relation refinement as shown in Figure 4.

Iterative Alignment with The Improved
Inter-and Intra-modal Interaction Block In this section, we introduce how to integrate the improved inter-and intra-modal interaction modules to enable the iterative alignment for temporal sentence grounding. The inter-modal interaction helps aggregate features from the other modality to update the clip and word features according to the cross-modal relations. The clip and word features would be updated again with the information within the same modality via the intra-modal interaction. We use one inter-modal interaction module followed by one intra-modal interaction module to form a basic improved interaction block (IIB) in our proposed IA-Net framework as: where l is the block number. Multiple blocks could be further stacked thanks to the calibration module for alignment refinement, helping to reason for the accurate segment boundaries.

Inner-Product
Concat

Segment Localizer and Loss Function
After multiple interaction blocks, we utilize a cosine similarity function (Mithun et al., 2019) Q to generate a new video-aware sentence representation b Q 0 which has the same T -dimensional features like b V . We fuse two modal features as To predict the target video segment, similar to , we pre-define multisize candidate moments {(⌧ s ,⌧ e )} on each frame t, and adopt multiple full connection (FC) layers to process features f t to produce the confidence scores {cs} of all windows and predict corresponding temporal offsets {(ˆ s ,ˆ e )}. The final predicted moments of time t can be presented as {(⌧ s +ˆ s ,⌧ e +ˆ e )}.
Training. We first compute the Intersection over Union score o between each candidate moment (⌧ s ,⌧ e ) with ground truth (⌧ s , ⌧ e ). If the o is larger than a threshold value , this moment is viewed as positive sample, reverse as the negative sample. Thus we can obtain N pos positive samples and N neg negative samples in total (N total ). We adopt an alignment loss to align the predicted confidence scores and IoU: We also devise a boundary loss for N pos positive samples to promote exploring the precise start and end points as: where S represents the smooth L1 function. We adopt ↵ to control the balance of the alignment loss and boundary loss: Testing. We rank all candidate moments according to their predicted confidence scores, and then "Top- TACoS. TACoS (Regneri et al., 2013) is widely used on TSG task and contain 127 videos. The videos from TACoS are collected from cooking scenarios, thus lacking the diversity. They are around 7 minutes on average. We use the same split as (Gao et al., 2017), which includes 10146, 4589, 4083 query-segment pairs for training, validation and testing.
Charades-STA. Charades-STA is built on the Charades dataset (Sigurdsson et al., 2016), which focuses on indoor activities. In total, there are 12408 and 3720 moment-query pairs in the training and testing sets respectively.

Experimental Settings
Evaluation Metric. Following previous works (Gao et al., 2017;, we adopt "R@n, IoU=m" as our evaluation metrics. The "R@n, IoU=m" is defined as the percentage of at least one of top-n selected moments having IoU larger than m. Implementation Details. We define continuous 16 frames as a clip and each clip overlaps 8 frames with adjacent clips, and apply C3D (Tran et al., 2015) to encode the videos on ActivityNet Captions, TACoS, and I3D (Carreira and Zisserman, 2017) on Charades-STA. We set the length of video feature sequences to 200 for ActivityNet Captions and TACoS datasets, 64 for Charades-STA dataset. As for sentence encoding, we utilize Glove word2vec (Pennington et al., 2014) to embed each word to 300 dimension features. The hidden state dimension of Bi-GRU networks is set to 512. During segment localization, we adopt convolution kernel size of [16,32,64,96,128,160,192] for Ac-tivityNet Captions, [8,16,32,64,128] for TACoS,and [16,24,32,40] for Charades-STA. We set the stride of them as 0.5, 0.125, 0.125, respectively. We set the high-score threshold to 0.45, and the balance hyper-parameter ↵ to 0.001 for ActivityNet Captions, 0.005 for TACoS and Charades-STA. We train our model with an Adam optimizer with leaning rate 8 ⇥ 10 4 , 3 ⇥ 10 4 , 4 ⇥ 10 4 for Activity Captions, TACoS and Charades-STA, respectively.
Analysis. As shown in Tables 1 and 2, we compare our IA-Net with all above methods on three datasets. It shows that IA-Net performs among the best in various scenarios on all three benchmark datasets across different criteria and ranks  the first or the second in all cases. On Activi-tyNet Captions, we outperform DRN by 3.59% and 12.82% in the strict metrics "R@1, IoU=0.7" and "R@5, IoU=0.7". We also brings 1.41% and 1.16% improvements compared to 2DTAN. On TACoS dataset, the cooking activities take place in the same kitchen scene with some slightly varied cooking objects, thus it is hard to localize such fine-grained activities. Compared to the top ranked method 2DTAN, our model still achieves the best results on "R@1, IoU=0.5" and "R@5, IoU=0.5", which validates that IA-Net is able to localize the moment boundary more precisely. On Charades-STA, we outperform the SCDM by 6.85%, 4.48%, 15.35% and 3.96% in all metrics. The main reasons for our proposed model outperforming the competing models lie in two folds. First, compared to methods like GDP and CMIN which utilize basic attention to interact multi-modal features, our method provides an improved attention mechanism to address "nowhere-to-attend" problem. We also devise a distillation module to refine and calibrate the alignment knowledge. Second, previous works all adopt a single interaction process with no tolerance on the attention mistake. Thanks to the designed distillation module across the whole framework, our IA-Net can stack multiple interaction blocks to progressively reasoning for the segment boundaries in an accurate direction.

Model Efficiency Comparison
To further investigate the efficiency of our IA-Net, we conduct the comparison on TACoS dataset with other released methods. All experiments are run on one NVIDIA TITAN-XP GPU. As shown in Table  3, "Run-Time" denotes the average time to localize one sentence in a given video, "Model Size" denotes the size of parameters. It can be observed that our IA-Net achieves the fastest run-time with the relatively smaller model size. Since CTRL and  ACRN need to sample candidate segments with various sliding windows, they need a quite timeconsuming matching procedure. 2DTAN adopts a convolution architecture to generate a large 2D temporal map, which contains a large number of parameters across the convolution layers. Compared to them, our IA-Net is lightweight with a few parameters of the linear layers, leading to relatively smaller model size, thus is faster than 2DTAN.

Ablation Study
We perform extensive ablation studies on the Ac-tivityNet Captions dataset. The results are shown in Table 4. How to choose the padding size? We investigate the performance on different padding size K, which is originally introduced to deal with "nowhere-to-attend" problem. Although K = 1 has the ability to guide the frame or words to attend nothing, such limited latent space can not meet complicated relations between different videos and sentences. As shown in Table 4, we found that the use of K > 1 improves performance to a certain extent, and K = 3 yields the best.  guide the model attend to different relationships in several latent spaces. Following (Vaswani et al., 2017), we implement 1, 2, 4, 8 number of attention maps for experiments. We find that H = 4 achieves the best performance.
Choice of the number of stacked interaction blocks. Our improved interaction block contains calibration module for refining the alignment knowledge. The results in table indicates that the model with more than 3 blocks will not bring more improvement.

Qualitative Results
We visualize the fused attention maps in Figure  5 (a). For frame-to-word attention, it can be ob-served that the first interaction block fail to focus on the word "first". With the help of the calibration module, the attention is progressively calibrated on the word "first" in the following blocks. Besides, the third frame pays more attention on the padding elements as it does not match any words in the query, which indicates that our IA-Net addresses the "nowhere-to-attend" problem well. For frame-to-frame attention, our model correlates the relevant frames more precisely in deeper blocks.
The qualitative results are shown in Figure 5 (b).

Conclusion
In this paper, we have studied the problem of temporal sentence grounding, and proposed a novel Iterative Alignment Network (IA-Net) in an endto-end fashion. The core of our network is the multi-step reasoning process with the improved inter-and intra-modal interaction module which is designed in two aspects: 1) we pad the multimodal features with learnable parameters for capturing more complicated correlation in deep latent space; 2) we develop a calibration module to refine and calibrate the alignments knowledge from early steps. By stacking multiple such interaction modules, our IA-Net can progressively capture the fine-grained interactions between two modalities, providing more accurate video segment boundaries. Extensive experiments on three challenging benchmarks demonstrate the effectiveness and efficiency of the proposed IA-Net.