DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding

Temporal Language Grounding seeks to localize video moments that semantically correspond to a natural language query. Recent advances employ the attention mechanism to learn the relations between video moments and the text query. However, naive attention might not be able to appropriately capture such relations, resulting in ineffective distributions where target video moments are difficult to separate from the remaining ones. To resolve the issue, we propose an energy-based model framework to explicitly learn moment-query distributions. Moreover, we propose DemaFormer, a novel Transformer-based architecture that utilizes exponential moving average with a learnable damping factor to effectively encode moment-query inputs. Comprehensive experiments on four public temporal language grounding datasets showcase the superiority of our methods over the state-of-the-art baselines.


Introduction
Temporal Language Grounding (TLG) is a task to determine temporal boundaries of video moments that semantically correspond (relevant) to a language query (Hendricks et al., 2018;Gao et al., 2021a).TLG is a complex and challenging task, since video processing demands understanding across multiple modalities, including image, text, and even audio.However, TLG has received increasing attention in CV and NLP communities because it provides myriad usage for further downstream tasks, e.g.VQA (Lei et al., 2018;Ye et al., 2017;Wang et al., 2019a), relation extraction (Gao et al., 2021b), and information retrieval (Ghosh et al., 2019).
Early methods for TLG deploy concatenation with linear projection (Gao et al., 2017;Wang et    2019b) or similarity functions (Anne Hendricks et al., 2017;Hendricks et al., 2018) to fuse textual and visual features.To further enhance the localization performance, recent works divide the video into equal-length moments and employ the attention mechanism of the Transformer model to learn the relations between video moments and the language query.For example, Moment-DETR model (Lei et al., 2021) concatenates the visual moments and the textual tokens, and then passes the concatenated sequence to the Transformer encoder to capture the alignment.The UMT model (Liu et al., 2022) includes an additional audio channel into the Transformer encoder to construct a unified architecture.However, for the TLG task, previous works have shown that such attention-based approach is still insufficient to capture the rich semantic interaction between the text query and video moments (Xu et al., 2023).As in Figure 2, the localized video moments hardly align with the query statement.Moreover, the attention mechanism in the Transformer encoder does not assume any prior knowledge towards the input elements (Ma et al., 2022).For language localization, this design choice does not leverage the fact that video moments in temporal neighborhoods tend to exhibit closely related features.Therefore, this approach could lead to ineffective modeling of joint moment-query inputs.As evidence, Figure 1 demonstrates the distribution of joint moment-query representations.Particularly, the representations of target moment-query localizations and the remaining ones mingle together, making the grounding task more challenging.
To address these limitations, we dive deeper into polishing the distribution of moment-query representations.In addition to supervised training to correctly localize the language in the video, we perform unsupervised training to explicitly maximize the likelihood of moment-query localizations.This could help the multimodal model focus on capturing the distribution of target moment-query pairs and distinguishing them from others.
As such, we propose to model the distribution of moment-query representations under the framework of the Energy-Based Model (EBM).In contrast to other probabilistic models such as normalizing flow (Rezende and Mohamed, 2015) or autoencoder (Kingma and Welling, 2013), the EBM framework allows us to directly integrate the video moment's salience score into the density function, which results in accurate modeling of momentquery representations.Our implementation develops into a contrastive divergence objective which aims to minimize the energy of the relevant localizations while maximizing the energy of the deviating ones.Accordingly, the framework needs negative samples to represent high-energy regions.Therefore, we adapt the Langevin dynamics equation to directly sample negative inputs from the EBM distribution.Such approach is appropriate because in the beginning the distribution will not match the true distribution, hence the generated samples are assured to be negative.As the training progresses, the distribution will approximate the true distribution, consequently the Langevin equation is able to produce hard negative samples.
In addition, we incorporate the inductive bias that captures local dependencies among the moment-query inputs.We propose DemaFormer in which we equip the Damped Exponential Moving Average (DEMA) computation for the TransFormer architecture.Technically, the computation applies exponentially decaying factors that consider the information from adjacent inputs.We further introduce learnable damping coefficients to enable the model to absorb adjacent information in a sufficient manner that ensures distinction among inputs.Eventually, we combine the DEMA computation with the attention mechanism to construct DemaFormer encoder and decoder modules.
To sum up, the contributions of our paper can be summarized as follows: • We propose DemaFormer, a novel architecture for temporal language grounding.De-maFormer integrates exponential moving average with learnable damping coefficients into the attention mechanism to appropriately capture dependency patterns among videolanguage inputs.
• We propose a novel energy-based learning framework for temporal language grounding.The objective for the energy-based model can be formulated as a contrastive divergence to assist a classical grounding loss for modeling moment-query representations.
• We conduct extensive experiments to demonstrate the superiority of our method over previous state-of-the-art baselines.Furthermore, we conduct comprehensive ablation studies to evaluate our component proposals and deliver meaningful insights.
(2017), TLG is to locate relevant video moments given a language query.Early approaches use LSTM to encode language query and CNN for visual clips, and then estimate cross-modal similarity scores (Anne Hendricks et al., 2017;Hendricks et al., 2018).Modern techniques leverage attention mechanism and structured graph network to learn the video-language relationship (Xiao et al., 2021;Gao et al., 2021a;Zhang et al., 2020a;Yuan et al., 2019).Recent works (Liu et al., 2022;Lei et al., 2021) apply Transformer components to eliminate hand-crafted pre-processing and post-processing steps and make the model end-to-end trainable.
Vision-Language Representation Learning.

Methodology
Our task is to localize moments in videos from natural language queries.Formally, given a language query q of L q tokens and a video v composed of L v equal-length input moments, where each moment is represented by a visual frame sampled from the moment, we aim to localize L m time spans from v that is aligned with the query q, noted as {(l i , r i )} Lm i=1 , where each moment spans from l i to r i scaled by the video timelength and L m < L v .
Thus, we first describe our proposed damping exponential moving average attention for modeling video-language inputs in Section 3.1, the overall architecture in Section 3.2, and the training strategy empowered with energy-based modeling in Section 3.3 and 3.4.

Damping Exponential Moving Average (DEMA) Attention
In this section, we consider the input to our De-maFormer encoder X e and decoder X d in Section 3.2 as the general input X = {x i } L X i=1 of length L X .We delineate the exponential moving average (EMA) with the damping influence applied on X as follows.DEMA Computation.At first, we use a linear layer to map each input x i to an intermediate space: Then, we estimate the current hidden state l i as the sum of the previous hidden state l i−1 and the current intermediate input g i with the weighting coefficients that decrease exponentially and are relaxed by damping coefficients: where α ∈ (0, 1) d denotes the weighting coefficients, δ ∈ (0, 1) d the damping coefficients, and ⊙ the elementwise product.Both α and δ are randomly initialized and learnable during training.Subsequently, we project the hidden state l i back to the original input space: DEMA Attention.Given the input X, we obtain the DEMA output in Eq. ( 3) and pass the output through a non-linear layer: where SiLU denotes the self-gated activation function (Ramachandran et al., 2017;Elfwing et al., 2018).We experiment with other activation functions in Appendix D. Subsequently, we perform the attention operation and utilize Z which exhibits local dependencies as the value tensor: where d K denotes the dimension of K. Thereafter, we aggregate the original input X and the attention output Z ′ in an adaptive manner:

Overall Architecture
Figure 3 illustrates our DemaFormer model for temporal language grounding.We explain the architecture in details in the following.
Uni-modal Encoding.Given a video v consisting of L v moments and a text query with L q tokens, we employ pre-trained models to extract visual moment features , and audio features A = {a i } Lv i=1 .Audio-Dependent Video Encoding.For videoaudio encoding, because audio signals possess heavy noisy information (Liu et al., 2022;Badamdorj et al., 2021), we only perform 1-layer attention to fuse the audio information into the visual sequence.Particularly, the attention between the video and audio input becomes: DemaFormer Encoder.Inspired by (Lei et al., 2021), we concatenate the audio-dependent video and language tokens to form the input sequence: We push the input sequence X e to the DemaFormer encoder of N e encoder layers.Each encoder layer comprises a DEMA attention layer, a normalization layer, a ReLU non-linear layer, and a residual connection: where X (i) and O (i) e denote the input and the output of the i-th encoder layer, respectively; e the intermediate output of the i-th encoder layer.
We take the output of the N e -th encoder layer as the final output O e of the DemaFormer encoder.
DemaFormer Decoder.The input for the decoder is the first L v DemaFormer encoder outputs, i.e.X d = {o e,i } Lv i=1 .The input sequence is forwarded to N d decoder layers, each of which is composed of a DEMA attention layer, a normalization layer, a non-linear layer, and a residual connection: Analogous to the encoder, we retrieve the N d -th layer output as the final output O d of the decoder: Prediction Heads.For each output o d,i , we designate four separate linear layers to predict the salience score ŝi , the center ĉi , the center offset ĉo i , and the moment width ŵi : Thus, each candidate moment's temporal bound becomes ĉt + ĉo i − ŵi 2 , ĉi + ĉo i + ŵi 2 .At test time, we extract the top-L m moments whose salience scores are the largest.

Energy-Based Models for Modeling
Moment-Query Representations Given our joint video-language decoder outputs O d = {o d,i } Lv t=1 , we designate the EBM to specify the density of O d via the Boltzmann distribution: where E θ denotes the energy function and Z θ the normalizing constant Inspired by (Du and Mordatch, 2019), we adopt Langevin dynamics to conduct sampling from the above distribution: where γ is a hyperparameter to specify the variance of the noise.We perform the Eq. ( 24) for K steps and take õ(K) d,i as the sampling outcome.Our target is to better align the video-query representation O d , by minimizing the negative log likelihood of the moment-query representations, i.e.
This can be achieved by differentiating the L NLL (θ) and optimize the resulting contrastive divergence for the gradient, as: ) , (26) whose detailed derivation can be found in Appendix A. Because the samples generated by Eq. ( 24) do not approximate the true distribution in the beginning but will gradually converge to, we take these samples as o − d,i and assign their energy values a decaying weight α with minimum value of α min .We take the moment-query inputs whose groundtruth salience scores are larger than a threshold ρ as positive samples o + d,i .Moreover, because we maximize salience scores while minimizing the energy values of the positive input (vice versa for the negative input), we implement the negative salience-energy relation: As such, θ becomes the DemaFormer's parameters and we obtain the final L NLL 's formulation: where n epoch denotes the current training epoch.

Training Objective
From a video-language input, we obtain L m predictions Ŷ = {(ŝ i , ĉi , ĉo i , ŵi )} Lm i=1 .During training L m is the number of groundtruth localizations, while during testing L m is selected based on validation.We define the matching loss L match between predictions and groundtruth as: where λ {1,2,3,4} denote the hyperparameter weights for the salience, center, width, and offset losses, respectively.We jointly optimize the matching loss with the EBM negative log-likelihood (NLL) loss as follows: where λ NLL denotes the weight to scale the NLL loss size.

Datasets
We evaluate our methods on four benchmark datasets for the temporal language grounding task: QVHighlights, Charades-STA, YouTube Highlights, and TVSum.
QVHighlights is collected by (Lei et al., 2021) to span diverse content on 3 major topics: daily vlog, travel vlog, and news.There are 10,148 videos with 18,367 moments associated with 10,310 queries.We follow (Lei et al., 2018;Liu et al., 2022) to split the dataset into 70% train, 15% val, and 15% test portions.
Charades-STA (Gao et al., 2017) consists of videos about daily indoor activities.The dataset is split into 12,408 and 3,720 moment annotations for training and testing, respectively.YouTube Highlights is prepared by (Sun et al., 2014) to comprise six categories, i.e. dog, gymnastics, parkour, skating, skiing and surfing.In each category, we inherit the original training-testing split as benchmark for the TLG task.
TVSum (Hong et al., 2020) is a video summarization dataset possessing 10 event categories.We employ the video title as the language query and the training/testing split of 0.8/0.2 for experiments.

Experimental Settings
Evaluation Metrics.Our metrics include Rank k@µ, mAP@µ, and Hit@1.Rank k@µ is the percentage of the testing samples that have at least one correct localization in the top-k choices, where a localization is correct if its IoU with the groundtruth Method R1@ mAP@ HIT@1 0.5 0.7 0.5 0. is larger than the threshold µ.In a similar manner, mAP@µ is the mean average precision of localizations whose IoU is larger than µ.Hit@1 computes the hit ratio for the moment with the highest predicted salience score in a video.We consider a moment is hit if its groundtruth salience is larger than or equal to a threshold τ .Following previous works (Lei et al., 2021;Liu et al., 2022), we adopt Rank 1@µ with µ ∈ {0.5, 0.75} and Hit@1 with τ = 4 for the QVHighlights dataset.For the Charades-STA dataset, we use Rank k@µ with k ∈ {1, 5} and µ ∈ {0.5, 0.75}.We apply mAP for both the TVSum and YouTube Highlights datasets.Implementation Details.For fair comparison with previous works (Liu et al., 2022;Lei et al., 2021), on QVHighlights, we use SlowFast (Feichtenhofer et al., 2019) and CLIP (Radford et al., 2021) to obtain features for the video moments and CLIP text encoder to obtain features for the language queries.For feature extraction of the Charades-STA dataset, we deploy VGG (Simonyan and Zisserman, 2014) and optical flow features for video moments and GloVe embeddings (Pennington et al., 2014) for language tokens.On YouTube Highlights and TVSum, we utilize the I3D model (Carreira and Zisserman, 2017) pre-trained on Kinetics 400 (Kay et al., 2017) to extract moment-level visual representations, and CLIP text encoder to extract language representations.Furthermore, as in (Liu et al., 2022;Lei et al., 2021), for QVHighlights dataset, we also experiment with pre-training our architecture with noisy automatic speech recognition (ASR) captions before fine-tuning on the downstream training samples.For all audio features, we use the PANN model pre-trained on AudioSet (Gemmeke et al., 2017).We provide detailed hyperparameter settings in Appendix B.

Baselines
To evaluate the proposed methods, we compare our performance with a diversity of baselines: • UMT (Liu et al., 2022): a multi-modal transformer model to handle three modalities, including audio, text, and video.• Moment-DETR (Lei et al., 2021): a multimodal transformer model that applies the original self-attention mechanism to encode no human prior and eliminates manuallydesigned pre-processing and post-processing procedures.• CLIP (Radford et al., 2021): a framework of visual CNN and textual transformer models trained with a contrastive objective.• XML (Lei et al., 2020): a framework of visual ResNet and textual RoBERTa models with a late fusion approach to fuse the visual and textual features.We include an additional variant, XML+, which is trained with the combination of our salience loss and the XML's loss.Table 3: Temporal language grounding results on the YouTube Highlights dataset.
• Joint-VA (Badamdorj et al., 2021): an approach applying attention mechanism to fuse multi-modal features and a sentinel technique to discount noisy signals.
• MINI-Net (Hong et al., 2020): a weakly supervised learning approach that trains a positive bag of query-relevant moments to possess higher scores than negative bags of queryirrelevant moments.• LIM-S (Xiong et al., 2019): a TLG approach that leverages video duration as a weak supervision signal.• DL-VHD (Xu et al., 2021): a framework applying dual learners to capture cross-category concepts and video moment highlight notions.

Comparison with State-of-the-arts
We report results of our DemaFormer and the baselines in Table 1, 2, 3, and 4 on the QVHighlights, Charades-STA, YouTube Highlights, and TVSum datasets, respectively.As can be seen, our methods significantly outperform previous approaches.
QVHighlights.Compared with the previous best method UMT, our DemaFormer achieves 2% absolute improvement at least across all evaluation settings of Rank 1@µ, particularly 4.54% for µ = 0.5 and 2.01% for µ = 0.7, respectively.When pretrained with the ASR captions, our method outperforms UMT with 2.59% of mAP on average and 2.64 points of Hit@1.These results demonstrate that our method can enhance the TLG operation in diverse settings, including daily vlog, travel vlog, and news.Charades-STA.We increase the performance of UMT with 3.28% in terms of Rank 1@0.5 and 2.53% in terms of Rank 5@0.5.Upon tighter µ = 0.7, we achieve a larger degree of enhancement with 5.99% for Rank 1 and 5.18% for Rank 5. We hypothesize that this is because our energy-based modeling can focus on separating highly relevant localizations from other video moment candidates.
TVSum.In Table 4, we compare our model with other competitive approaches.Our architecture accomplishes the highest mAP scores across all categories and in overall as well.In detail, we outperform the second-best UMT up to 19.34% at maximum on the BT portion.Analogous to the QVHighlights experiments, this demonstrates that our framework can better model the video-language inputs in various contexts to polish the temporal language grounding performance.YouTube Highlights.Equivalent to TVSum, our DemaFormer with the energy-based modeling approach outperforms prior competitive models across various subsets.Specifically, we gain mAP increases of 1.16% at minimum on the surfing portion and 6.07% at maximum on the dog portion.We attribute such improvement to the more effective modeling operation of the proposed DEMA computation in attention, since it can exhibit local dependencies of moment-query inputs for appropriate modeling in various contexts.

Ablation Studies
In this section, we study the impact of (1) Damped Exponential Moving Average (DEMA), (2) Energy-Based Modeling (EBM), (3) Langevin Sampling Steps, and (4) Choice of Energy Functions.With vs. Without DEMA.From Table 5, removing the damping factor results in slight performance decrease, for example 1.04% and 0.26% in terms of Rank 1@0.7 on QVHighlights and Charades-STA, respectively.The main reason is that without the damping coefficient, the model lacks the ability to adjust the information injected into adjacent input elements, such that the amount could be excessive to make it hard to distinguish video moments.Moreover, we observe that completely eliminating the DEMA computation leads to significant decrease, specifically up to 2.97% and 2.51% of Rank 1@0.5 respectively on QVHighlights and Charades-STA, since the model no longer specifies the moment-query distribution effectively.
With vs Without EBM.Investigating the last rows of   grounding performance.Particularly, adding the EBM training objective brings enhancement of 2.67% on Charades-STA and 3.16% on QVHighlights in terms of Rank 1@0.5.This substantiates that the EBM can successfully capture the distribution of moment-query representations in which relevant localizations are separated from the irrelevant ones.We provide more illustrative analysis in Section 4.6 and Appendix F. Langevin Sampling Steps.We investigate the impact of the number of sampling steps K in our Langevin equation ( 24) upon DemaFormer. Figure 4 shows that DemaFormer's performance increases as K grows.However, as long as K passes the threshold 100, the model performance converges with negligible fluctuation.We hypothesize that at K = 100 the added noise is sufficient to segregate target localizations from elsewhere.mentwise and pooling-based manner (we provide the formulations in Appendix E), and evaluate the performance of the variants in Table 6.As the comparison shows, directly utilizing the salience score provides localizations with the most accuracy.This suggests that similarity functions do not fully implement the query-relevance concept.

Qualitative Analysis
We illustrate a prediction example from the QVHighlights dataset by our DemaFormer in Figure 1, 2 and 5.We observe that our model correctly localizes target moments with respect to the user query.Our predicted salience scores also align with the groundtruth scores, which are measured by averaging the three annotated scores in the dataset.
In addition, we also utilize t-SNE to visualize the moment-query representations of the example in Figure 1.We realize that the representations of the target localizations stay separately from the remaining ones, whereas those from the UMT model do mingle together.This could explain the accurate localilzation of DemaFormer and verifies the effec-tive modeling of the proposed DEMA mechanism combined with the energy-based modeling.We provide more examples in Appendix F.

Conclusion
In this paper, we propose DemaFormer, a novel neural architecture for the temporal language grounding (TLG) task.By leveraging the exponential moving average approach with a damping factor, De-maFormer is capable of incorporating local dependencies among moment-query localizations.Additionally, we propose an energy-based strategy to explicitly model localization distribution.On four public benchmarks for the TLG task, our method is effective and outperforms state-of-the-art approaches with a significant margin.

Limitations
Our framework requires negative sampling via the Langevin dynamics equation.This incurs additional compute cost while training the language grounding model.Also, although we propose general methods to enhance the grounding performance, we have not studied their impact in crossdomain scenarios, where the model is trained upon one domain (e.g.skiing videos) and tested upon another (e.g.skating videos).We leave these gaps as future work to optimize our framework in more diverse contexts and use cases.

Acknowledgement
This research/project is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG3-PhD-2023-08-051T). Thong Nguyen is supported by a Google Ph.D. Fellowship in Natural Language Processing.
L  As can be seen from Table 7, with the center offset term the localization scores increase from 54.29% to 58.25% of mAP@0.5, and from 40.97% to 43.94% of Rank 1@0.7.This demonstrates that the center offset term helps our DemaFormer architecture predict the localizations more precisely.

D Choice of Activation Functions
In this appendix, we adopt different activation functions for our DEMA attention in Section 3.1 and compare their performances.In detail, we experiment with the Tanh (Dubey et al., 2022), ReLU (Dubey et al., 2022), and GELU functions (Hendrycks and Gimpel, 2016).We report the temporal language grounding performance with these activation functions on the QVHighlights dataset in Table 8.We observe that DemaFormer exhibits negligible performance fluctuation.These results demonstrate the robustness of our proposed DemaFormer with respect to the choice of activation functions.

E Specification of Energy Functions
We provide the formulation of energy functions we experiment with in Table 6.

F More prediction examples
In this appendix, we present more predictions of our DemaFormer model in Figure 6, 7, and 8.

Figure 1 :
Figure 1: Visualization (t-SNE) of moment-query representations of an input example by the previous best UMT baseline and our DemaFormer model.The target localizations are from the QVHighlights dataset label.Detailed input content is provided in Figure 2.

Figure 2 :
Figure 2: A TLG example.To produce the output, we form the union of overlapped temporal boundaries in the groundtruth and models' localized moments.The UMT output is about countryside scenes, which hardly align with the language query.

Figure 3 :
Figure 3: Illustration of the proposed DemaFormer.Our archtiecture comprises an encoder of N e layers and a decoder of N d layers.We designate the first L v encoder outputs as moment-query representations to become the input for the DemaFormer decoder.

Figure
Figure Effect of the number of Langevin sampling steps upon localization performance on the VU and GA portions of the TVSum dataset and the skating portion the YouTube Highlights dataset..Text query: The woman wearing sunglasses crosses a small colorful bridge over the river ĉi || to be the center loss term, L w = 1 Lm Lm i=1 ||w i − ŵi || the width loss term, and L co = 1 Lm Lm i=1 ||co i − ĉo i || the center offset loss term.Because the salience, center and width terms are mandatory, we justify the necessity of the center offset term.

Table 1 :
Temporal language grounding results on the QVHighlights dataset."w/ PT" denotes pre-training with ASR captions.

Table 2 :
Temporal language grounding results on the Charades-STA dataset.

Table 5 :
Performance comparisons on QVHighlights and Charades-STA datasets in ablative experiments of DEMA and EBM components of DemaFormer.

Table 6 :
Performance comparison on the Charades-STA dataset in ablative experiments of the energy function choices.

Table 8 :
s + L c + L w 57.29 40.97 54.29 36.6135.98 60.84 Performance comparison on the QVHighlights dataset in ablative experiments of the activation functions.