Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding

Temporal language grounding (TLG) aims to localize a video segment in an untrimmed video based on a natural language description. To alleviate the expensive cost of manual annotations for temporal boundary labels, we are dedicated to the weakly supervised setting, where only video-level descriptions are provided for training. Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework. However, the temporal structure of the video as well as the complicated semantics in the sentence are lost during the learning. In this work, we propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG. Instead of view the sentence and candidate moments as a whole, FSAN learns token-by-clip cross-modal semantic alignment by an iterative cross-modal interaction module, generates a fine-grained cross-modal semantic alignment map, and performs grounding directly on top of the map. Extensive experiments are conducted on two widely-used benchmarks: ActivityNet-Captions, and DiDeMo, where our FSAN achieves state-of-the-art performance.


Introduction
Given an untrimmed video and a natural language sentence, Temporal Language Grounding (TLG) aims to localize the temporal boundaries of the video segment described by a referred sentence. TLG is a challenging problem with great importance in various multimedia applications, e.g., video retrieval (Shao et al., 2018), visual question answering (Tapaswi et al., 2016;Antol et al., 2015;Yu et al., 2020), and visual reasoning (Yang et al., 2018). Since its first proposal (Gao et al., 2017;Hendricks et al., 2017), tremendous success has * Corresponding author.  Figure 1: Illustration of fine-grained semantic alignment map for temporal language grounding. been made on this problem (Wu and Han, 2018;Chen et al., 2018;Ge et al., 2019;Yuan et al., 2018;Zhang et al., 2019a;Wang et al., 2019;Zhang et al., 2020b;Ning et al., 2021). Despite the achievements with supervised learning, the temporal boundaries for every sentence query need to be manually annotated for training, which is expensive, time-consuming, and potentially noisy. On the other hand, it is much easier to collect a large amount of video-level descriptions without detailed temporal annotations, since videolevel descriptions naturally appear with videos simultaneously on the Internet (e.g., YouTube). To this end, some prior works are dedicated to weakly supervised setting, where only video-level descriptions are provided, without temporal labels.
Most of the previous weakly supervised methods follow a Multiple Instance Learning (MIL) paradigm, which samples matched and nonmatched video-sentence pairs, and learn a matching classifier to implicitly learn the cross-modal alignment. However, during the matching classification, the input sentence is often treated as a single feature query, neglecting the complicated linguistic semantics. VLANet (Ma et al., 2020) treats tokens in the input sentence separately, and performs cross-modal attention on token-moment pairs, where the moment candidates are carefully selected by a surrogate proposal selection module to reduce computation cost. SCN (Lin et al., 2020) proposes to generate and select moment candidates and performs semantic completion for the sentence to rank selected candidates. Nevertheless, the generation and selection process of moment candidates also involves high computational costs. In addition, the moment candidates are considered separately, while the temporal structure of the video is also important for grounding. Figure 1 shows an example to localize the query "person takes a phone off a desk" in the given video. If the model views the sentence as a whole and performs matching classification, it is hard to learn undistinguished words like "off " during the training. However, the neglected words may play important roles to determine the temporal boundaries of the described moment.
In this paper, we propose a novel framework named a Fine-grained Semantic Alignment Network (FSAN), for weakly supervised temporal language grounding. The core idea of FSAN is to learn token-by-clip cross-modal semantic alignment presenting as a token-clip map, and ground the sentence on video directly based on it. Specifically, given an untrimmed video and a description sentence, we first extract their features by visual encoder and textual encoder independently. Then, an Iterative Cross-modal Interaction Module is devised to learn the correspondence between visual and linguistic representations. To make temporal predictions for grounding, we further devise a semantic alignment-based grounding module. Based on the learned cross-modal interacted features, a token-by-clip semantic alignment map is generated, where the (i, j)-th element on the map indicates relevance between the i-th token in the sentence and j-clip in the video. Finally, an alignment-based grounding module predicts the grounding result corresponding to the input sentence.
Instead of aggregating sentence semantics into one representation and generating video moment candidates, FSAN learns a fine-grained crossmodal alignment map that helps to retain both the temporal structure among video clips and the complicated semantics in the sentence. Furthermore, the grounding module in FSAN makes predictions mainly based on the cross-modal alignment map, which alleviates the computation cost of candidate moment representation generation. We demonstrate the effectiveness of the proposed method on two widely-used benchmarks: ActivityNet-Captions (Krishna et al., 2017) and DiDeMo (Hendricks et al., 2017), where state-of-the-art performance is achieved by FSAN.

Temporal Language Grounding
Temporal language grounding is proposed (Gao et al., 2017;Hendricks et al., 2017) as a new challenging task, which requires deep interactions between two visual and linguistic modalities. Previous methods have explored this task in a fully supervised setting (Gao et al., 2017;Hendricks et al., 2017;Chen et al., 2018;Ge et al., 2019;Xu et al., 2019;Chen and Jiang, 2019;Yuan et al., 2018;Zhang et al., 2019b,a;Lu et al., 2019). Most of them follow a two-stage paradigm: generating candidate moments with sliding windows and subsequently matching the language query. Reinforcement learning has also been leveraged for temporal language grounding Wang et al., 2019;.
Despite the boom of fully supervised methods, it is very time-consuming and labor-intensive to annotate temporal boundaries for a large number of videos. And due to the annotation inconsistency among annotators, temporal labels are often ambiguous for models to learn. To alleviate the cost of fine-grained annotation, weakly supervised setting is explored lately (Mithun et al., 2019;Lin et al., 2020;Ma et al., 2020;Zhang et al., 2020c). TGA (Mithun et al., 2019) exploits maps video candidate features and query features into a latent space to learn cross-modal similarity. In (Ma et al., 2020), a video-language attention network is proposed to learn cross-modal alignment between language tokens and video segment candidates. Differently, our FSAN gets rid of the trouble of generating candidates and learns fine-grained token-by-clip semantic alignment.

Transformer in Language and Vision
Since it is first proposed by Vaswani et al. (Vaswani et al., 2017) for machine translation, transformer has become a prevailing architecture in NLP. The basic block of transformer is the multi-head attention module, which aggregates information from the whole input in both transformer encoder and decoder module. Transformer demonstrates superior performance in language model pretraining methods (Devlin et al., 2019; (Wang et al., 2021b), etc. Comparing to CNN, the attention mechanism learns more global dependencies, therefore, transformer also shows great performance in low-level tasks (Chen et al., 2020a). Transformer has also been proved effective in multi-modal area, including multi-modal representations (Zhang et al., 2020a;Tan and Bansal, 2019;Su et al., 2020;Sun et al., 2019) and applications (Shi et al., 2020;Ju et al., 2020;Liang et al., 2020). Inspired by the great success, we devise an iterative cross-modal interaction module mainly based on the multi-head attention mechanism.

Our Approach
Given an untrimmed video and a text-sentence query, a temporal grounding model aims to localize the most relevant moment in the video, represented by its beginning and ending timestamps. In this paper, we consider the weakly supervised setting, i.e., for each video V , a textual query S is provided for training. The query sentence describes a specific moment in the video, yet the temporal boundaries are not provided for training. In the inference stage, the weakly trained model is required to predict the beginning and ending timestamps of the video moment that corresponds to the input sentence S. We present a novel framework named Finegrained Semantic Alignment Network (FSAN) for the temporal language grounding problem. As shown in Figure 2, given a video and text query, we first encode them separately. The resulting representations then interact with each other through an iterative cross-modal interaction module. The outputs are used to learn a Semantic Alignment Map (SAP) between the two modalities. Finally, the SAP is fed into an alignment-based grounding module to predict scores for all possible moments.
In the following subsections, we will first introduce the visual and language encoder, then describe the Iterative Cross-Modal Interaction Module. Finally, we will elaborate on the semantic alignment map and the grounding module based on it.

Input Representation
Language Encoder. We use a standard transformer encoder (Vaswani et al., 2017) to extract the semantic information for the input query sentence S. Each token in the input query is first embedded using GloVe (Pennington et al., 2014). The resulting vectors are mapped to dimension of d s by a linear layer and fed into a transformer encoder to obtain context-aware token features S = {w i } Ns i=1 , where N s is the number of tokens and w k ∈ R d s denotes the feature of k-th token in the sentence.
Video Encoder. For the input videos, we extract visual features using a pretrained feature extractor and then apply a temporal pooling on frame features to divide it into N v clips. Hence the video can be represented by V = {v j } Nv j=1 , where v j ∈ R d v denotes the feature of j-th video clip, and d v = d s is the dimension of visual feature. Experimental results illustrate that the computation cost is considerably reduced by the temporal pooling.

Iterative Cross-Modal Interaction Module
Inspired by the great success of transformer encoder on vision-language pretraining Tan and Bansal, 2019), we devise an Iterative Cross-modal Interaction Module (ICIM) to learn the semantic relevance between visual and textual representations. The module is composed of a stack of 6 layers, and each layer consists of cross-modal attention, inner-modal attention, and feed-forward layers. The core component for crossmodal interaction is the multi-head attention, which is also vital in the transformer structure. Formally, given two sequences of d-dimensional features X = [x 1 , · · · , x Nx ] and Y = [y 1 , · · · , y Ny ], the calculation of multi-head attention is as follows: where A i is output the i-th of m attention heads. The final output M A(X, Y ) of multi-head attention is of same dimension as the input X.
For the textual representation S ∈ R Nx×d S and visual representation V ∈ R Nx×d V input to the iterative cross-modal interaction module, we first adopt cross-modal attention, i.e., where LN denotes layer normalization and F F N denotes feed-forward layer. To retain the temporal structure of the video and the grammar of the sentence, we add learnable positional encodings to the input of each modality. Through the above attention operation, the features of different modalities are able to freely interact with the other modality to learn a fine-grained semantic alignment.
To model the inner-modal context after crossmodal interaction, we further apply an inner-modal attention, which is similar with the calculation in Equation (3) and (4), except that the multi-head attention is applied only on single-modal representation, i.e., self-attention on single-modal features. After 6 iterations of cross-modal interaction and inner-modal modeling, the enhanced features S and V are fed into subsequent proposal module to predict a cross-modal semantic alignment map, and perform grounding based on it.

Semantic Alignment Map 1
After iterative cross-modal and inner-modal attention in ICIM, the correspondence can be fully explored between each pair of textual tokens and video clips. Therefore, a token-by-clip Semantic Alignment Map (SAP) P with size N s × N v can be learned for temporal grounding. Formally, where W s ∈ R d l ×ds and W v ∈ R d l ×dv are learnable parameters, · denotes dot product, and N orm(·) denotes L2-normalization.
The value of the (i, j)-th element on the SAP P ij represents the relevance between the i-th textual token and the j-th video clip. However, the SAP can learn the cross-modal relationship only with supervision that indicates semantic alignment. To this end, we adopt a video-level matching loss, which is calculated as: where S − is a non-matching description sentence randomly sampled from the dataset, and S(V, S) is the matching score function between video V and sentence S, defined as: which indicates the response between the most relevant moment and the input sentence, and the SC s,e is computed by the following grounding module (Section 3.4). Some same textual expressions may appear in both the positive description and the negative one, confusing the model in the matching classification procedure. To this end, we mask the repeated tokens on the semantic alignment map P − of the input video V and the sampled sentence S − .

Alignment-Based Grounding Module
The elements on the semantic alignment map indicate relevance between video clips and textual tokens, which leads to the idea of a fine-grained alignment-based grounding module. The core idea is that if a specific clip v i is part of the described moment V s,e , where V s,e denotes the video segment from the s-clip to e-th clip. The semantic of v i tends to be highly relevant with all tokens in the description. While if v i is out of the correct moment, at least one token in query is irrelevant to it. Therefore, we score all possible temporal segments formed with video clips by the relevance scores of clips both in and out the segment. Considering the clips in the segment as positive clips, and those not in it as negative clips. Then the relevance score of V s,e and the query is then defined as: where [ s, e] denotes the aggregation of negative clips, i.e., the complementary set of V s,e . The higher the average scores among positive clips for each token, the more possible that V [s,e] is relevant to the query. While the lower the average response of negative clips, the less possible that V [s,e] is redundant. Through the two-fold filtering, the moment that is more relevant with all tokens in the query will be given a higher score, and therefore more likely to be proposed as the grounding result. Although the contrastive loss in Equation (6) enables the model to learn the cross-modal semantic alignment, the temporal discrimination can not be learned under coarse video-level supervision, which is vital for grounding. To provide fine-level temporal supervision for the fine-grained crossmodal alignment, we further devise a novel twofold loss on the semantic alignment map P , including an inner-sample loss and an outer-sample loss. Specifically, the inner-sample loss aims to enhance the grounded moment on the fine-level alignment map P . We promote the response among clips in all possible segments, with a weight representing the confidence of the prediction: At the beginning of training, the model is uncertain about the grounding results, therefore, the weight SC s,e varies a little among moment options. As the training continues, the model can give positive moment higher scores easily, hence the loss weight will be larger for positive moments and smaller for negative ones. Therefore, the model will not deviate much from the correct solution.

Training and Inference
The overall training objective is an aggregation of aforementioned losses, given by: where λ * are hyper-parameters, and satisfy the condition λ 1 + λ 2 + λ 3 = 1. During the inference, the moment with the highest score SC s,e is selected as the grounding result.

DiDeMo.
The Distinct Describable Moments (DiDeMo) dataset is first proposed in (Hendricks et al., 2017) (Gao et al., 2017) contains about 6k videos, and the contents are mainly indoor activities. However, as shown in Table 1, comparing to other datasets, Charades-STA is limited in terms of total video amount, number of video-sentence pairs, and vocabulary size. The vocabulary size is critical to enriching the linguistic semantics, hence the semantic diversity is limited in the dataset. Evaluation Metrics. We follow the settings of previous methods (Mithun et al., 2019;Lin et al., 2020). For the ActivityNet-Captions dataset, we report results for intersection-overunion (IoU)∈{0.5,0.3,0.1} and Recall@{1,5}. On the DiDeMo dataset, considering the limited number of candidates (21) and variance among different annotators, we measure the performance with metrics: Rank@1, Rank@5, and mean intersection over union (mIoU). Here Rank@k means the percentage of samples where ground truth moment labeled by different annotators are on average ranked higher than k. Following (Hendricks et al., 2017), we discard the worst-ranked ground truth label to reduce the influence of outliers.

Implementation Details
For fair comparison, we utilize released visual features as previous methods (Mithun et al., 2019;Lin et al., 2020). For videos in ActivitiNet-Captions, we adopt C3D (Tran et al., 2015) features. For DiDeMo, we adopt VGG (Simonyan and Zisserman, 2014) features. Note that we report the performance of baseline models using the same features as ours. The dimension of these features are reduced from 4096 to 500 using PCA. The loss weights in to train FSAN are set to 1/3 equally. The hidden dimensions are set to 512 for all datasets. We adopt adam algorithm with an initial learning rate of 0.0001. The batch size is set to 128 for all datasets, and the dropout rate is set to 0.1. The code of FSAN is implemented in pytorch, and is trained on one RTX 3090 GPU.

Comparisons with State-of-the-art Methods
We compare the proposed FSAN with multiple baselines, including 1) recently published fully su-  (Ma et al., 2020). Experiments on DiDeMo. Table 2 illustrates the performance comparisons on the DiDeMo dataset. It can be observed from the numbers that FSAN outperforms TGA and WSLLN on all three metrics. And comparing to VLANet, FSAN performs overall better, except for R@5. This may be due to the surrogate proposal selection module introduced in VLANet (Ma et al., 2020), which in fact performs a two-stage candidate selection and gets rid of temporally overlapped candidates. Experiments on ActivityNet-Captions.   SCN suffers from the limited number of candidates.

Ablation Study
To investigate the importance of each component in FSAN, we conduct ablation experiments. Results are shown in Table 4 and we give detailed discussions in the next subsections. Note that for comparison, mean intersection-over-union (mIoU) is not reported in the previous subsection, which calculates the average IoU of rank 1 st predictions. However, we report and compare mIoU among FSAN variants in this section. Impact of Grounding Module. To validate the effectiveness of the alignment-based grounding module, we devise a common yet competitive prediction layer upon the visual branch output of ICIM. Specifically, we apply an attention pooling on the text-aware visual features V , then apply a threelayer MLP to predict matching score for all possible temporal segments. Results are shown in the 1 st row in Table 4. It can be observed that without the grounding module based on the semantic alignment map, the performance drops rapidly on strict IoU metrics (IoU=0.5, 0.3), which demonstrates the temporal precision improvement by introducing the token-by-clip alignment map.
Impact of Loss on SAP. The 2 nd row in Table 4 shows the result without inner-sample loss and outer-sample loss. Under this setting, the grounding performance drops on all metrics compared to full FSAN. The explanation is that without the two losses refining SAP, the FSAN is trained only by the video-level matching loss L tri . Hence the model can learn video-level coarse semantic alignment, while neglecting token-wise sentence semantics as well as the temporal structure of the video.
Impact of ICIM. We study the role of crossmodal attention and inner-modal attention in the iterative cross-modal interaction module. It can be observed in the 3 th and 4 th rows in Table 4 Figure 3: Visualization of the semantic alignment map and grounding results by FSAN. The rows and columns on the map correspond to text tokens and video clips, respectively. The green boxes on the map refer to the ground truth temporal region, while the red boxes refer to output of FSAN.
video clips, sentence structure among tokens, and token-by-clip cross-modal semantic alignment.

Analysis and Visualization
We also visualize some examples of the grounding result of FSAN in ActivityNet-Captions dataset. as shown in Figure 3. For each video-sentence pair, we visualize the token-clip semantic alignment map, as well as the ground truth and predicted temporal boundaries.
In the first example, the description sentence is long and complicated, consisting of three sequential activities (stop, raise and exercise up and down). FSAN achieves high IoU (0.88) on this difficult case, which indicates the strong ability of FSAN to learn fine-grained semantics from both visual and linguistic modalities. The second example shows the ability of FSAN to not only detect objects and their actions in video, but also understand abstract descriptions for video (credits, text). To better understand abstract descriptions is one of the key points for TLG to develop from action localization. In addition, in the second example, the main action shave is blocked in some frames, which is challenging for grounding. Though the blocking reflects in the visualized alignment map, FSAN manages to locate the complete moment.
To further analyze the performance of FSAN, we plot a graph showing how the performance varies as video length grows. As shown in Figure. 4, the performance of FSAN is relatively stable when video length grows, with a trend of weakening. For example, the mIoU for shortest videos (2-12s, 130 cases) and longest videos (>230s, 140 cases) are 42.99 and 34.68, respectively.

Conclusion
In this paper, we present a novel framework for temporal language grounding, namely Fine-grained Semantic Alignment Network (FSAN). To capture fine-level video-language semantic alignment, we devise an iterative cross-modal interaction module, which enables single-modal representations to interact with each other. Furthermore, we propose to perform temporal grounding based on a semantic alignment map, which alleviates the generation of video candidates. We conduct experiments on two widely-used benchmarks: ActivityNet-Captions and DiDeMo, and achieve state-of-the-art performance on both datasets, which demonstrates the effectiveness of our proposed FSAN.