Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval

Video corpus moment retrieval (VCMR) is the task to retrieve the most relevant video moment from a large video corpus using a natural language query.For narrative videos, e.g., drama or movies, the holistic understanding of temporal dynamics and multimodal reasoning are crucial.Previous works have shown promising results; however, they relied on the expensive query annotations for the VCMR, i.e., the corresponding moment intervals.To overcome this problem, we propose a self-supervised learning framework: Modal-specific Pseudo Query Generation Network (MPGN).First, MPGN selects candidate temporal moments via subtitle-based moment sampling.Then, it generates pseudo queries exploiting both visualand textual information from the selected temporal moments.Through the multimodal information in the pseudo queries, we show that MPGN successfully learns to localize the video corpus moment without any explicit annotation.We validate the effectiveness of MPGN on TVR dataset, showing the competitive results compared with both supervised models and unsupervised setting models.


Introduction
The increased interest in video understanding has gathered attention for solving related tasks such as video captioning (Krishna et al., 2017), video question answering (Tapaswi et al., 2016;Lei et al., 2018;Kim et al., 2018), and video retrieval (Xu et al., 2016) over the past few years.Video corpus moment retrieval (VCMR) (Escorcia et al., 2019) is one of the challenging video understanding tasks, in which a model should 1) search for a related video and 2) localize the corresponding moment given a query sentence in a large video corpus.
Prior works have shown promising performances in VCMR using supervised (Lei et al., 2020;Zhang et al., 2020Zhang et al., , 2021)), weakly-supervised (Yoon et al., 2021) settings, and pre-training (Li et al., 2020;Zhou et al., 2021) methods.Despite such accomplishments, selecting a temporal moment in a video (start time, end time) and generating the corresponding query sentence to train such models require overwhelming amounts of human labor.To annotate these videos, humans first need to understand the diverse information in the video, and then select the candidate temporal moment and generate corresponding queries.
The challenges of this VCMR are two folds: 1) Considering a large number of human annotations are required, an efficient approach is required to reduce the annotation cost.2) These multimodal videos (e.g., drama or movies) generally contain rich interactions between characters, which widely exist but has rarely been studied in the VCMR task.We introduce a novel framework to tackle these challenges: Modal-specific Pseudo Query Generation Network (MPGN).Our inspiration comes from the previous works (Nam et al., 2021;Jiang et al., 2022;Changpinyo et al., 2022) that generate pseudo queries to solve their target task in an unsupervised manner.To design our framework, we consider two research questions as follows: 1) What is a good way to select a temporal moment that can include characters' interactions?2) What information should be considered when generating a pseudo query that sufficiently expresses the characters' interactions within the corresponding temporal moment?
First, we select a temporal moment based on where the topic of subtitles is divided considering the conversations between the characters in the target videos.Experimental results show that the proposed subtitle-based moment sampling method performs the best among competitive strategies.
For generating a pseudo query, our framework generates two modal-specific pseudo queries as follows: 1) Focusing on visual information, we extract the descriptive captions from a pre-trained image captioning model and the character names1 from subtitles for the corresponding video frames.Then, we perform a visual-related query prompt module to generate queries that bridge the appearing character names and captions in videos.2) Focusing on textual information, we exploit a pre-trained dialog summarization model to generate a textual query that cohesively captures the interactions among characters.Since raw subtitles are often noisy and informal, we summarize the corresponding subtitles in a temporal moment and use it as a pseudo query.
Our framework has several benefits as follows.First, our framework exploits multimodal video information to generate visual and textual pseudo queries, reducing the annotation cost for both queries and the corresponding moment intervals.Second, our framework generates high-quality modal-specific pseudo queries and shows significant performance gains.
Our contributions can be summarized as follows: • To the extent of our knowledge, we firstly propose an unsupervised learning framework, MPGN, for the VCMR task.
• We propose the subtitle-based moment sampling method to define the temporal moment, and generate modal-specific pseudo queries exploiting both visual and textual information from the selected temporal moments.
• We experiment on the TVR benchmark to verify the effectiveness of our approach, and ablation studies validate each component of the proposed framework.
2 Related Work

Single Video Moment Retrieval
Single video moment retrieval (SVMR) aims to determine the temporal moments in a video that are related to given natural language queries.Previous works proposed remarkable progress based on fully-supervised learning (Gao et al., 2017;Mun et al., 2020;Zeng et al., 2020).However, since the annotations for SVMR are expensive, there have been attempts (Ma et al., 2020;Lin et al., 2020;Mithun et al., 2019) to address the annotation cost in a weakly-supervised manner.Unfortunately, substantial annotation costs still remain.Consequently, Liu et al. (2022) proposed DSCNet for SVMR performing without paired supervision.Although these SVMR approaches are successful, they are unsuitable for the VCMR since they do not consider the huge computational cost involved in retrieving a video from the video corpus.

Video Corpus Moment Retrieval
Video corpus moment retrieval (VCMR) extends the number of video sources from a single (SVMR) to a collection of untrimmed videos (Escorcia et al., 2019).Previous methods have been proposed for VCMR in a supervised manner (Escorcia et al., 2019;Lei et al., 2020;Zhang et al., 2020Zhang et al., , 2021)).However, these approaches require fully-annotated data (e.g., paired video query and the corresponding interval timestamps).To leverage this, Yoon et al. (2021) attempts to solve VCMR in a weaklysupervised setting where only paired videos and queries are available while the corresponding moment interval is unknown.While previous works require paired annotations for training, our framework does not require any annotation.

Pseudo Query Generation
Unsupervised image captioning methods (Laina et al., 2019;Feng et al., 2019) attempt to remove the dependency on the paired image-sentence dataset.However, the proposed methods are not readily applicable to our video corpus.The most similar work to ours is PSVL (Nam et al., 2021), which has been proposed for the zero-shot SVMR task.They construct pseudo queries in a specific form consisting of a set of noun and verb words.However, narrative videos contain complex interactions between characters; it is inadequate to understand the videos with a limited set of noun and verb words.Unlike the previous methods, our framework can generate a pseudo query beyond these restrictions.In addition, we generate two pseudo queries that are specific to each modality.

Method
In this section, we introduce our framework Modal-specific Pseudo Query Generation Network (MPGN) in detail (see Figure 2).Given a video and its subtitles, we first describe how MPGN selects the candidate temporal moments.Then, we describe how MPGN generates modal-specific pseudo queries: using the visual-related prompt module for visual information and dialog summarization for textual information.We denote the generated pseudo query from each modality as visual pseudo query and textual pseudo query.Finally, we show how the generated pseudo queries are used in the training stage.

Subtitle-based Moment Sampling
MPGN samples a target temporal moment from a video to generate the corresponding modal-specific pseudo queries.Previous works have proposed to sample the temporal moments by comparing the visual similarity between adjacent frames (Nam et al., 2021;Jain et al., 2020) or sliding windows (Lin et al., 2020).However, such approaches are inappropriate for narrative videos since distinct and dissimilar visual frames can appear depending on the transitions of camera angles or speaking characters even in a single conversation.Motivated by how humans understand narrative videos, we propose a subtitle-based moment sampling method that determines the start and end timestamps from the sampled subtitles.
We denote the list of subtitles in a target video as S. Let n be the number of subtitles, S = [s 1 , s 2 , ..., s n ].We sample the l-consecutive subtitles from S to select the temporal moment.We empirically found that if the length of the candidate temporal moment is too short or too long, the generated pseudo queries are poor.Hence, we set the minimum number l min and the maximum number l max , and then, uniformly-sample l from {l min , ..l max }.After choosing the l, we uniformly-sample s start from {s 1 , ...s n−l }, and s end is straightforwardly determined by s start and l.We summarize as follows: Finally, the sampled subtitles are defined as S = {s start , ..., s end }.

Generating Modal-specific Pseudo Query
In general, the story of narrative videos can be represented through the visual (e.g., action, place) and textual (e.g., dialog) information related to the characters.Although they share the goal of comprehending a specific situation in a narrative video, the visual and textual-modality information can offer a different perspective.For example, if two characters are having a conversation in a video, visual features can represent that two characters are talking but cannot provide the details of the conversation.Meanwhile, textual features may provide specific details of the conversation, but not the person's location or actions.Therefore, we generate the pseudo queries for both modalities so the model can comprehensively understand the situation from diverse perspectives.

Visual Pseudo Query Generation
Inspired by the success of prompt engineering in vision-language tasks (Radford et al., 2021;Yao et al., 2021;Jiang et al., 2022), we adopt the visualrelated prompt module to generate visual pseudo queries.To express visual information in the temporal moment, it depicts the situation of the scene by focusing on the person who appears in the temporal moment.The proposed visual-related prompt module combines this visual information to generate a visual pseudo query.
For every sampled temporal moment in a video, let a set of frames as F and the subtitles as S. First, we detect the speaker name in the subtitles as shown in Fig 2-(c), and extract n unique character names C = {c 1 , c 2 , ..., c n } from S and generate a sentence with the character's name according to a specific template shown in the Table 1.We empirically found that in n > 1 case, the prompt "{Character's names} are talking together" shows better performance than the prompt "{Character's names} having a conversation".If we cannot identify any character name (n = 0) in the moment, we fill the character name with Someone.Then, we employ the pre-trained image captioning model (Li et al., 2022) to generate the image caption for the middle frame from F. Finally, we concatenate these two sentences using the template for the characters' names and image caption to generate the visual pseudo query.(e.g., "Phoebe, Rachel, and Monica are talking together.A man is standing next to a woman in a living room.")

Textual Pseudo Query Generation
We extract the semantic meaning from subtitles for the textual pseudo query.However, those are informal and noisy for the model to infer.Recently, Engin et al. (2021) has shown remarkable progress in video question answering by using dialog summarization.They convert the dialog into text description in several steps (per scene, whole episode) and use it to improve video-text representations in a supervised manner.Motivated by this, we denoise the subtitles by dialog summarization.To do this, we use the transformer-based BART Large (Lewis et al., 2019) pre-trained on the SAMSum corpus (Gliwa et al., 2019).Finally, we obtain the textual pseudo queries which capture the semantic meaning in dialog by applying a pre-trained language Case Template n = 0 Someone is speaking.n = 1 c 1 is speaking.model to subtitles S.

Video-Language Model
Our video-language model consists of three components Evaluation Metrics We follow the settings of previous methods (Lei et al., 2020).We evaluate the models for the VCMR task as well as its two subtasks, VR and SVMR.For SVMR and VCMR, we use Recall@k with IoU=0.7 for the main evaluation metric.For VR, we report Recall@k as the evaluation metrics.

Implementation Details
We extract 2048D RestNet-152 (He et al., 2016) and 2304 SlowFast (Feichtenhofer et al., 2019) features at 3 FPS and max pooling on frame features every 1.5 seconds.Each video feature is normalized by its L2-norm and concatenated for the final video feature.We extract the textual features via 12layer pre-trained RoBERTa (Liu et al., 2019).Note that we fine-tune RoBERTa using only subtitles in TVR train-split with MLM objective, except the queries.As for subtitle-based moment sampling, we set the l min and l max to 2 and 5 respectively.We sample 130K temporal moments and each video has an average of 7 temporal moments.Each temporal moment has two pseudo queries, therefore 260K pseudo queries are generated.We train our model in an unsupervised setting with 87K pseudo queries of the same size as TVR train-split for 50 epochs with the batch size set to 128.For the supervised setting, we train our model with 260K pseudo queries and annotated queries in TVR train-split over 70 epochs and set the same batch size as above.
All our experiments are run on a single Quadro RTX 8000.Our video-language model is optimized with AdamW and set the initial learning rate to 1.0 × 10 −4 .The objective function of our model follows Zhang et al. (2021).
For the unsupervised and weakly-supervised methods, MPGN outperforms even when compared to the baselines in stronger supervision settings.Despite pre-training on large-scale video datasets, HERO3 showed low performance overall.This result shows that HERO relies heavily on fine-tuning, and using subtitles instead of queries in the pretraining stage may be inappropriate.We show how the performance degrades when we use subtitles as a query instead of our pseudo queries in Section 4.4.4As previous studies (Lei et al., 2020;Yoon et al., 2021) mentioned, retrieval+re-ranking methods show low performance since they consist of models targeting subtasks of VCMR.WMRN shows the best performance, but they generate multi-scale proposals from a large video corpus to predict temporal moments.This wasteful strategy cannot handle VCMR efficiently.
Surprisingly, MPGN outperforms the current state-of-the-art methods in supervised settings.Although the HERO and CUPID are pre-trained with a large amount of video-text pairs (136M), MPGN only uses 260K pseudo queries for training.Since

Method
Val Test-public R@1 R@10 R@100 R@1 R@10 R@100 we focus on generating meaningful pseudo supervision for VCMR in this paper, we do not study pre-training tasks or model architecture that could improve performance.

Ablation Study
To investigate the importance of each component in MPGN, we conduct extensive ablation experiments.For a fair comparison, we use the same amount of pseudo queries as the original supervision for all experiments.We give detailed discussions in the subsections.

Effect of Modal-specific Pseudo Query
To validate the effectiveness of modal-specific pseudo query, we experiment with two baselines, including 1) a model trained on only visual pseudo queries (VPQ), 2) a model trained on only textual pseudo queries (TPQ).Our approach uses both (VPQ + TPQ) for training.The results in Table 3 show that using both modal-specific queries improves model performance across all metrics.We conclude that providing both modalities information to the model helps it understand the video better.

Effect of Video-Language Model
We further compare our video-language model and with other baseline models, XML (Lei et al., 2020) and ReLoCLNet (Zhang et al., 2021).Also, we report the performance of each model in both supervised and unsupervised settings in Table 4.
We confirm that the choice of video-language backbone contributes to the performance improvement and our MPGN framework is agnostic to the video-language backbones used for timestamp prediction.

Effect of Temporal Moment Sampling Method
We experiment our temporal moment sampling method with various l min and l max and compare it with the feature-based temporal moment sampling method proposed by Nam et al. (2021) in Table 5.We see that a using single subtitle (l=1) shows the lowest performance in all metrics.With this result, we safely say that a single subtitle cannot provide enough local temporal information for the model since it's grounded on a very short moment.Finally, we find the best performance when l min and l max were 2 and 5, respectively.We believe that using uniformly sampled temporal moments with appropriate l min and l max gives varying lengths compared to the fixed-size temporal moments.Feature-based method computes all combinations of consecutive frame clusters and samples the temporal moment following a uniform distribution from them.Our approach is not only showing better performance than the feature-based method but also more efficient.To select the 100K tempo-Method VCMR SVMR VR R@1 R@10 R@100 R@1 R@10 R@100 R@1 R@10 R@100 VPQ 0. ral moments, we take 13.48s, whereas the featurebased method consumes 328.05s.

Comparison with Other Types of Pseudo Query
To validate the competence of our pseudo query, we use simplified sentences and dialog, as the baselines method for our experiments.Nam et al. (2021) proposed a simplified sentence that consists of nouns and verbs for a pseudo query generation.For a fair comparison, we re-implemented their approaches as faithfully as possible, and the detailed procedure can be found in Appendix A.1.We also added a dialog baseline that uses subtitles as a pseudo query without applying a dialog summarization.Note that all the methods generate pseudo queries from the same temporal moments.As shown in Table 6, our model can easily surpass other baseline methods in all metrics.We believe that a simplified sentence leads to poor results on our dataset because not only does the sentence have no textual information, but it also loses a lot of useful information since it consists of only nouns and verbs.The result of the dialog baseline achieves the lowest score across all metrics.As aforementioned in Section 4.3, it implies that it is difficult for the model to understand the videos with raw subtitles.With these experiments, we demonstrate that our generated pseudo queries represent more meaningful information in videos than other baseline methods.

Qualitative Analysis
In Figure 4, we provide two qualitative examples of moments predicted by our model and the ablation models.For the first case, our model successfully finds a temporal moment, but others do not.With this result, both of the modal-specific queries play an essential role in VCMR.Our model and VPQ localize the proper temporal moment for the second case, but TPQ fails.We hypothesize the reason is that the character name and visual information in the visual pseudo query help to find the temporal moment.
We visualize four generated pseudo queries on TVR dataset in Figure 3.In Figure 3   To validate the scalability of our approach, we show the pseudo queries generated on the Dra-maQA (Choi et al., 2021) dataset in the Appendix A.3.

Conclusion
In this paper, we present a novel framework: Modal-specific Pseudo Query Generation Network (MPGN) for video corpus moment retrieval in an unsupervised manner.Our framework uses a subtitle-based temporal moment sampling method in which the timestamps (start time, end time) are determined from the sampled subtitles.After that, we generate pseudo queries from candidate temporal moments by using the visual-related prompt module and dialog summarization transformer, respectively.We can improve our model's comprehension of local temporal information and semantic meaning in multimodal videos via the pseudo queries containing the essential information in each modality.We conduct a comprehensive ablation analysis to prove the effectiveness of our approach.For future work, we plan to extend the pseudo query generation method so that it can be applied in several video understanding tasks without using manual supervision.

Limitation
Our framework requires the subtitles to include the name of the speaker.Therefore, it is not directly applicable to videos where the speaker is not specified (e.g., YouTube videos).Also, as our framework utilizes verbal conversation between characters, it cannot guarantee performance in videos which do not include dialog (e.g., videos of cooking, sports, etc.) We hypothesize that there exists some domain discrepancy between video benchmarks.We leave it as future work to extend our framework to diverse types of videos.

A Appendix
We provide additional results not in the main paper due to the page limit.
A.1 Re-implementation of VerbBERT Nam et al. (2021) proposed VerbBERT to predict the verbs from contextual nouns.We collect the dataset from the corpus that describes a person's action.Then we only select sentences that contains the word 'person' and extract only nouns and verbs from the sentences.In this step, about 10,000 sentences are remain.
For training, we randomly split the train and test dataset with a ratio of 9:1.We fine-tune the pretrained RoBERTa model using the above sentences with MLM objective.After 20 epochs, there is no further improvement of perplexity, so we stop training as more epochs might cause over-fitting (see Figure 5).For a given sentence "person [mask] bicycle", VerbBERT will predict '[mask]' as 'ride'.To generate a pseudo query (simplified sentence) for TVR dataset, we predict the verb from detected objects and replace the 'person' with a character name.We visualize the generated simplified sentence in Figure 6.

A.2 Details of Our Video-Language Model
In this section, we provide more details about the video-language model.Model Architecture The video encoder consists of a feed-forward network and three transformer blocks for visual and subtitle representation.We apply the multimodal processing module (Gao et al., 2017) to each output of the last transformer block.The query encoder has a feed-forward network, two transformer blocks, and modularized vectors.Modularized vectors decompose the query into two query vectors each interacting with a visual and subtitle representation.Our localization module consists of two 1D-CNN layers with ReLU that predict the start and end probabilities, respectively.Objective Function The overall training loss follows Zhang et al. (2021) which consist of 1) video retrieval loss (L vr ), 2) video moment retrieval loss (L vmr ), 3) video contrastive loss (L vcl ), 4) frame contrastive loss (L f cl ) as:

A.3 Scalability of MPGN
Unfortunately, since TVR dataset is the only multimodal video dataset in VCMR, we cannot evaluate our framework to other benchmark To validate scalability of our framework, we generate pseudo queries on another multimodal video dataset, Dra-maQA which is built upon the TV drama and contains QA pairs for video question answering task.We visualize the pseudo queries generated on Dra-maQA dataset in Figure 7.

A.4 Statistics of Pseudo Query Dataset
In Figure 8, we show the distribution of temporal moment lengths (left) and the number of characters in temporal moments (right).The average length of the temporal moment is 12.3 seconds.We assume that including the case where the character name is omitted from the subtitle, most temporal moments include more than one character.Furthermore, visual pseudo queries and textual pseudo queries consist of an average of 14.9 words and 12.2 words, respectively.

A.5 Experiment on the Various Size of Pseudo Query Dataset
To investigate the quality of pseudo queries, we train the model on pseudo queries with different scales.We construct these subsets such that larger subsets include the smaller ones.We report AveR VCMR score (the average of R@1, 5,10, IoU=0.5) as the metric to evaluate model performance.
As shown in Figure 9, the performance increases in proportion to the scale of pseudo query dataset.However, we observed that there is no significant increase in performance after a certain point.Method VCMR R@1 R@10 R@100 VPQ (w/o s) 0.62 2.4 6.83 VPQ 0.73 2.57 7.61 Table 7: Ablation study on effect of speaker's name in the visual pseudo query."(w/o s)" means that trained with visual pseudo queries which do not contain the speaker's name.
A.6 Effect of Speaker's Name in Visual pseudo query We report experiment with our model trained on only visual pseudo queries, which do not contain the speaker's name to investigate how the model performs.
As shown in the Table 7, the presence of the speaker's name was helpful but not essential.

A.7 More Visualization of Modal-specific Pseudo Queries
We visualize generated pseudo queries on TVR dataset in Figure 10.As we can see, most pseudo queries properly contain the multimodal information in the temporal moment of the video.However, in some cases, you can see the speaker is missing from the subtitles, such as Figure 10-(h).In these cases, it is difficult for a textual pseudo queries to completely provide textual information in subtitles.

Figure 2 :
Figure 2: (a) Given a video and its aligned subtitles as input, our goal is to generate pseudo queries and train the model using them.Our framework consists of three stages: (b) define the temporal moments, (c) generate the modal-specific pseudo queries, and (d) use them for training our video-language model.
n > 1 c 1 and c 2 are talking together.c 1 , c 2 and c 3 are talking together.etc.
-(a), visual pseudo query contains the speaker name and the caption of the scene, and textual pseudo query describes well what Monica is saying.Although most subtitles are very short and less meaningful in Figure 3-(d), our framework gets the dia-

Figure 3 :
Figure 3: Four visualization examples of pseudo queries on TVR dataset.We show candidate temporal moment and modal-specific pseudo queries.All the pseudo queries except case (c) contain the character-centered context and describe the video moment well.(c) is a failure case due to a missing character name.

Figure 4 :
Figure 4: Visualization of the predictions of the MPGN and the ablation models."GT" means groundtruth timestamp.The predictions of the models are presented below.The models trained on visual pseudo queries and text pseudo queries are called VPQ and TPQ, respectively.We denote VPQ + TPQ as the model trained on both pseudo queries.

Figure 7 :
Figure 7: Four visualization example of pseudo queries on DramaQA dataset.All of the generated pseudo queries sufficiently describe the temporal moment of the video.In Figure7-(c), textual pseudo query well describe about the situation of Haeyoung1 and Taejin weddings are canceled.

Figure 8 :
Figure 8: Statistics of the generated pseudo queries.

Figure 9 :
Figure 9: AveR score on TVR according to various amount of pseudo queries.100% of pseudo query dataset contains 138K for training (87K queries in TVR dataset which is equivalent to 80% of the pseudo query dataset).

Table 1 :
Templates for character name.c n represent character name in a set of character's names C = {c i } n i=1 and n represents number of characters.
Note that we do not use any paired annotations (e.g., pairs of query sentences and temporal moments of a video) in the training stage.Inference We directly use the annotated query sentences during the inference stage without applying the visual-related prompt module for a fair comparison.TV shows, and each video is, on average, 76.2 seconds long and includes subtitles.There are five queries per video, containing an average of 13.4 words.The average length of moments in the video is 9.1 seconds.We follow the same split of the dataset as in TVR for fair comparisons.We reemphasize that we do not use any annotations during the training stage.
specific pseudo queries, we alternately train our model on pseudo queries.At each training step, we randomly (uniformly) select one of the modalspecific pseudo queries.Our training strategy can be cast as data augmentation, encouraging the model to learn the multimodal information robustly.

Table 2 :
Performance comparison with various models and supervision levels.We conduct experiments on the TVR validation set and test-public set, and "-" means that the result on the metric is not reported in the original paper.

Table 4 :
Ablation study on effects of video-language models.✓indicates the dataset used to train the model (P=Pseudo query, T=Training dataset in TVR datasets).

Table 5 :
Ablation study on effect of temporal moment sampling method.If l min and l max are equal, we sample a fixed number of subtitles.

Table 6 :
Ablation study on effect of pseudo query type.