mTVR: Multilingual Moment Retrieval in Videos

We introduce mTVR, a large-scale multilingual video moment retrieval dataset, containing 218K English and Chinese queries from 21.8K TV show video clips. The dataset is collected by extending the popular TVR dataset (in English) with paired Chinese queries and subtitles. Compared to existing moment retrieval datasets, mTVR is multilingual, larger, and comes with diverse annotations. We further propose mXML, a multilingual moment retrieval model that learns and operates on data from both languages, via encoder parameter sharing and language neighborhood constraints. We demonstrate the effectiveness of mXML on the newly collected mTVR dataset, where mXML outperforms strong monolingual baselines while using fewer parameters. In addition, we also provide detailed dataset analyses and model ablations. Data and code are publicly available at https://github.com/jayleicn/mTVRetrieval


Introduction
The number of videos available online is growing at an unprecedented speed. Recent work (Escorcia et al., 2019;Lei et al., 2020) introduced the Video Corpus Moment Retrieval (VCMR) task: given a natural language query, a system needs to retrieve a short moment from a large video corpus. Figure 1 shows a VCMR example. Compared to the standard text-to-video retrieval task (Xu et al., 2016;, it allows more fine-grained momentlevel retrieval, as it requires the system to not only retrieve the most relevant videos, but also localize the most relevant moments inside these videos. Various datasets (Krishna et al., 2017;Hendricks et al., 2017;Gao et al., 2017;Lei et al., 2020) have been proposed or adapted for the task. However, they are all created for a single language (English), though the application could be useful for users speaking other languages as well. Besides, it is also unclear Figure 1: A MTVR example in the Video Corpus Moment Retrieval (VCMR) task. Ground truth moment is shown in green box. Colors in the query text indicate whether the words are more related to video (orchid) or subtitle (salmon) or both (orange). The query and the subtitle text are presented in both English and Chinese. The video corpus typically contains thousands of videos, for brevity, we only show 3 videos here.
whether the progress and findings in one language generalizes to another language (Bender, 2009). While there are multiple existing multilingual image datasets (Gao et al., 2015;Elliott et al., 2016;Shimizu et al., 2018;Pappas et al., 2016;Lan et al., 2017;Li et al., 2019), the availability of multilingual video datasets (Wang et al., 2019a;Chen and Dolan, 2011) is still limited. Therefore, we introduce MTVR, a large-scale, multilingual moment retrieval dataset, with 218K human-annotated natural language queries in two languages, English and Chinese. MTVR extends the TVR (Lei et al., 2020) dataset by collecting paired Chinese queries and Chinese subtitle text (see Figure 1). We choose TVR over other moment retrieval datasets (Krishna et al., 2017;Hendricks et al., 2017;Gao et al., 2017) because TVR is the largest moment retrieval dataset, and also has the advantage of having dialogues (in the form of subtitle text) as additional context for retrieval, in contrast to pure video context in the other datasets. We further propose mXML, a compact, multilingual model that learns jointly from both English and Chinese data for moment retrieval. Specifically, on top of the state-of-the-art monolingual moment retrieval model XML (Lei et al., 2020), we enforce encoder parameter sharing (Sachan and Neubig, 2018;Dong et al., 2015) where the queries and subtitles from the two languages are encoded using shared encoders. We also incorporate a language neighborhood constraint (Wang et al., 2018; to the output query and subtitle embeddings. It encourages sentences of the same meaning in different languages to lie close to each other in the embedding space. Compared to separately trained monolingual models, mXML substantially reduces the total model size while improving retrieval performance (over monolingual models) as we show in Section 4. Detailed dataset analyses and model ablations are provided.

Dataset
The TVR (Lei et al., 2020) dataset contains 108,965 high-quality English queries from 21,793 videos from 6 long-running TV shows (provided by TVQA (Lei et al., 2018)). The videos are associated with English dialogues in the form of subtitle text. MTVR extends this dataset with translated dialogues and queries in Chinese to support multilingual multimodal research.

Data Collection
Dialogue Subtitles. We crawl fan translated Chinese subtitles from subtitle sites. 1 All subtitles are manually checked by the authors to ensure they are of good quality and are aligned with the videos. The original English subtitles come with speaker names from transcripts that we map to the Chinese subtitles, to ensure that the Chinese subtitles have the same amount of information as the English version. Query. To obtain Chinese queries, we hire human translators from Amazon Mechanical Turk (AMT). Each AMT worker is asked to write a Chinese translation of a given English query. Languages are ambiguous, hence we also present the original videos to the workers at the time of translation to help clarify query meaning via spatiotemporal visual grounding. The Chinese translations are required to have the exact same meaning as the original English queries and the translation should be made based on the aligned video content.
To facilitate the translation process, we provide machine translated Chinese queries from Google Cloud Translation 2 as references, similar to (Wang et al., 2019b). To find qualified bilingual workers in AMT, we created a qualification test with 5 multiple-choice questions designed to evaluate workers' Chinese language proficiency and their ability to perform our translation task. We only allow workers that correctly answer all 5 questions to participate our annotation task. In total, 99 workers finished the test and 44 passed, earning our qualification. To further ensure data quality, we also manually inspect the submitted results during the annotation process and disqualify workers with poor annotations. We pay workers $0.24 every three sentences, this results in an average hourly pay of $8.70. The whole annotation process took about 3 months and cost approximately $12,000.00.

Data Analysis
In Table 2, we compare the average sentence lengths and the number of unique words under different part-of-speech (POS) tags, between the two languages, English and Chinese, and between query and subtitle text. For both languages, dialogue subtitles are linguistically more diverse than queries, i.e., they have more unique words in all  categories. This is potentially because the language used in subtitles are unconstrained human dialogues while the queries are collected as declarative sentences referring to specific moments in videos (Lei et al., 2020). Comparing the two languages, the Chinese data is typically more diverse than the English data.

Method
Our multilingual moment retrieval model mXML is built on top of the Cross-model Moment Localization (XML) (Lei et al., 2020) model, which performs efficient video-level retrieval at its shallow layers and accurate moment-level retrieval at its deep layers. To adapt the monolingual XML model into the multilingual setting in MTVR and improve its efficiency and effectiveness, we apply encoder parameter sharing and neighborhood constraints (Wang et al., 2018; which encourages the model to better utilize multilingual data to improve monolingual task performance while maintaining smaller model size. Query and Context Representations. We represent videos using ResNet-152 (He et al., 2016) and I3D (Carreira and Zisserman, 2017) features extracted every 1.5 seconds. We extract language features using pre-trained, then finetuned (on our queries and subtitles) RoBERTa-base (Liu et al., 2019), for English (Liu et al., 2019) and Chinese (Cui et al., 2020), respectively. For queries, we use token-level features. For subtitles, we max-3 The differences might be due to the different morphemes in the languages. E.g., the Chinese word 长发 ('long hair') is labeled as a single noun, but as an adjective ('long') and a noun ('hair') in English (Wang et al., 2019b).  Figure 2: Illustration of mXML's encoding process. Compared to monolingual models, mXML learns from the two languages simultaneously, and allows them to benefit each other via encoder parameter sharing and neighborhood constraints. We show the detailed encoding process of the model in the appendix (Figure 3). pool the token-level features every 1.5 seconds to align with the video features. We then project the extracted features into a low-dimensional space via a linear layer, and add learned positional encoding (Devlin et al., 2018) after the projection. We denote the resulting video features as E v ∈ R l×d , subtitle features as E s en ∈ R l×d , E s zh ∈ R l×d , and query features as E q en ∈ R lq×d , E q zh ∈ R lq×d . l is video length, l q is query length, and d is hidden size. The subscripts en and zh denote English and Chinese text features, respectively.
Encoders and Parameter Sharing. We follow Lei et al. (2020) to use Self-Encoder as our main component for query and context encoding. A Self-Encoder consists of a self-attention (Vaswani et al., 2017) layer, a linear layer, and a residual (He et al., 2016) connection followed by layer normalization (Ba et al., 2016). We use a Self-Encoder followed by a modular attention (Lei et al., 2020) to encode each query into two modularized query vectors q v lang , q s lang ∈ R d (lang ∈ {en, zh}) for video and subtitle retrieval, respectively. For videos, we apply two Self-Encoders instead of a Self-Encoder and a Cross-Encoder as in XML, because we found this modification simplifies the implementation while maintains the performance. We use the outputs from the first and the second Self-Encoder H v vr,lang , H v mr,lang ∈ R l×d for video retrieval and moment retrieval. Similarly, we have H s vr,lang , H s mr,lang ∈ R l×d for subtitles. All the Self-Encoders are shared across languages, e.g., we use the same Self-Encoder to encode both English and Chinese queries, as illustrated in Figure 2. This parameter sharing strategy greatly reduces the model size while maintaining or even improving model performance, as we show in Section 4.
Language Neighborhood Constraint. To facilitate stronger multilingual learning, we add neighborhood constraints (Wang et al., 2018;Burns et al., 2020) to the model. This encourages sentences that express the same or similar meanings to lie close to each other in the embedding space, via a triplet loss. Given paired sentence embeddings e i en ∈ R d and e i zh ∈ R d , we sample negative sentence embeddings e j en ∈ R d and e k zh ∈ R d from the same mini-batch, where i = j, i = k. We use cosine similarity function S to measure the similarity between embeddings. Our language neighborhood constraint can be formulated as: where ∆=0.2 is the margin. We apply this constraint on both query and subtitle embeddings, across the two languages, as illustrated in Figure 2. For queries, we directly apply it on the query vectors q v lang , q s lang . For the subtitle embeddings, we apply it on the embeddings H s vr,lang , H s mr,lang , after max-pooling them in the temporal dimension.
Training and Inference. During training, we optimize video retrieval scores with a triplet loss, and moment scores with a cross-entropy loss. At inference, these two scores are aggregated together as the final score for video corpus moment retrieval. See appendix for details.

Experiments and Results
We evaluate our proposed mXML model on the newly collected MTVR dataset, and compare it with several existing monolingual baselines. We also provide ablation studies evaluating our model design and the importance of each input modality (videos and subtitles). Data Splits and Evaluation Metrics. We follow TVR (Lei et al., 2020) to split the data into 80% train, 10% val, 5% test-public and 5% test-private. We report average recall (R@1) on the Video Corpus Moment Retrieval (VCMR) task. A predicted moment is correct if it has high Intersection-over-Union (IoU) with the ground-truth. Baseline Comparison. In Table 3  a natural language query, the goal of video corpus moment retrieval is to retrieve relevant moments from a large video corpus. The methods for this task can be grouped into two categories, (i) proposal based approaches (MCN (Hendricks et al., 2017) and CAL (Escorcia et al., 2019)), where they perform video retrieval on the pre-segmented moments from the videos; (ii) retrieval+re-ranking methods (MEE (Miech et al., 2018)+MCN, MEE+CAL, MEE+ExCL (Ghosh et al., 2019) and XML (Lei et al., 2020)), where one approach is first used to retrieve a set of videos, then another approach is used to re-rank the moments inside these retrieved videos to get the final moments. Our method mXML also belongs to the retrieval+re-ranking category. Across all metrics and both languages, we notice retrieval+re-ranking approaches achieve better performance than proposal based approaches, indicating that retrieval+reranking is potentially better suited for the VCMR task. Meanwhile, our mXML outperforms the strong baseline XML significantly 4 while using few parameters. XML is a monolingual model, where a separate model is trained for each language. In contrast, mXML is multilingual, trained on both languages simultaneously, with parameter sharing and language neighborhood constraints to encourage multilingual learning. mXML prediction examples are provided in the appendix.
Ablations on Model Design. In Table 4, we present our ablation study on mXML. We use 'Baseline' to denote the mXML model without parameter sharing and neighborhood constraint. Shar-   ing encoder parameter across languages greatly reduces #parameters while maintaining (Chinese) or even improving (English) model performance.
Adding neighborhood constraint does not introduce any new parameters but brings a notable (p<0.06) performance gain to both languages. We hypothesize that this is because the learned information in the embeddings of the two languages are complementary (though the sentences in the two languages express the same meaning, their language encoders (Liu et al., 2019;Cui et al., 2020)) are pre-trained differently, which may lead to different meanings at the embedding level. In Table 5, we show a detailed comparison between mXML and its baseline version, by query types. Overall, we notice the mXML perform similarly with Baseline in 'video' queries, but shows a significant performance gain in 'subtitle' queries, suggesting the parameter sharing and neighborhood constraint are more useful for queries that need more language understanding.
Ablations on Input Modalities. In Table 6, we compare mXML variants with different context inputs, i.e., video or subtitle or both. We report their performance under the three annotated query types,  video, sub and video+sub. Overall, the model with both video and subtitle as inputs perform the best. The video model performs much better on the video queries than on the sub queries, while the subtitle model achieves higher scores on the sub queries than the video queries.
In the appendix, we also present results on 'generalization to unseen TV shows' setup.

Conclusion
In this work, we collect MTVR, a new large-scale, multilingual moment retrieval dataset. It contains 218K queries in English and in Chinese from 21.8K video clips from 6 TV shows. We also propose a multilingual moment retrieval model mXML as a strong baseline for the MTVR dataset. We show in experiments that mXML outperforms monolingual models while using fewer parameters. The subscript lang ∈ {en, zh} is omitted for simplicity. It is optimized using a triplet loss similar to main text Equation (1). For moment retrieval, we first compute the query-clip similarity scores S q,c ∈ R l as: Next, we apply Convolutional Start-End Detector (ConvSE module) (Lei et al., 2020) to obtain start, end probabilities P st , P ed ∈ R l . These scores are optimized using a cross-entropy loss. The single video moment retrieval score for moment [t st , t ed ] is computed as: Given a query q i , the retrieval score for moment [t st :t ed ] in video v j is computed following the aggregation function as in (Lei et al., 2020): where α=20 is used to assign higher weight to the video retrieval scores. The overall loss is a simple summation of video and moment retrieval loss across the two languages, and the language neighborhood constraint loss.  Figure 3: mXML overview. For brevity, we only show the modeling process for a single language (Chinese). The cross-language modifications, i.e., parameter sharing and neighborhood constraint are illustrated in Figure 2. This figure is edited from the Figure 4 in (Lei et al., 2020 (Hendricks et al., 2017) Flickr 41.2K/10.6K ---ActivityNet Captions (Krishna et al., 2017) Activity 72K/15K ---CharadesSTA (Gao et al., 2017) Activity 16.1K/6.7K ---How2R (Li et al., 2020) Instructional 51K/24K --TVR (Lei et al., 2020) TV show 109K/21.8K -MTVR TV show 218K/21.8K show for testing, and train the model on all the other 5 TV shows. For comparison, we also include a model trained on 'seen' setting, where we use all the 6 TV shows including Friends for training. To ensure the models on these two settings are trained on the same number of examples, we downsample the examples in the seen setting to match the unseen setting. The results are shown in Table 7. We notice our mXML achieves a reasonable performance even though it does see a single example from the TV show Friends. Meanwhile, the gap between unseen and seen settings are still large, we encourage future work to further explore this direction.