Movie101: A New Movie Understanding Benchmark

To help the visually impaired enjoy movies, automatic movie narrating systems are expected to narrate accurate, coherent, and role-aware plots when there are no speaking lines of actors. Existing works benchmark this challenge as a normal video captioning task via some simplifications, such as removing role names and evaluating narrations with ngram-based metrics, which makes it difficult for automatic systems to meet the needs of real application scenarios. To narrow this gap, we construct a large-scale Chinese movie benchmark, named Movie101. Closer to real scenarios, the Movie Clip Narrating (MCN) task in our benchmark asks models to generate role-aware narration paragraphs for complete movie clips where no actors are speaking. External knowledge, such as role information and movie genres, is also provided for better movie understanding. Besides, we propose a new metric called Movie Narration Score (MNScore) for movie narrating evaluation, which achieves the best correlation with human evaluation. Our benchmark also supports the Temporal Narration Grounding (TNG) task to investigate clip localization given text descriptions. For both two tasks, our proposed methods well leverage external knowledge and outperform carefully designed baselines. The dataset and codes are released at https://github.com/yuezih/Movie101.


Introduction
The estimated number of visually impaired people worldwide was about 285 million by 2020, according to reports (He et al., 2020). While regulations are in place to ensure increased access for these audiences to experience the culturally dominant movies and TV shows on popular media platforms, technologies that provide them with genuine experience are becoming increasingly important. Audio description (AD, also known as video description) * *Corresponding Author. is a form of such technology intended for visually impaired audiences to experience the movie or TV show by hearing what is happening on-screen. However, producing movie narration scripts is not trivial, often requiring a professional writer to oversee the original movie. The high cost of narration generation (Lakritz and Salway, 2006) greatly hinders the production of movies with AD and thus limits the opportunities for visually impaired users to experience movies.
To address this issue, attempts have been carried out to automate AD production. Datasets of movies with ADs are constructed to support the research on automatic AD generation, including the MPII-MD dataset (Rohrbach et al., 2015) and M-VAD dataset (Torabi et al., 2015), with shotlevel ADs or scripts aligned to the visual contents of movie. Consequently, different solutions for automatic movie narrating have been proposed based on these datasets (Rohrbach et al., 2017).
However, existing benchmarks suffer from several limitations. Firstly, there is a gap between the designed tasks and the actual movie narration scenario. These tasks mainly focus on generating single-sentence narrations for shots of a few seconds. They can not support the generation of coherent narrations for longer plots, which is critical for the visually impaired to better understand the movie, and the timestamps of these shots are carefully annotated, which are difficult to obtain for new movies in real application. Meanwhile, these tasks treat the very distinctive movie narrating task as a normal video captioning task through some simplifications such as replacing role names with SOMEONE, resulting in the inability to connect roles to plots. Secondly, these benchmarks evaluate the generated narrations with ngram-based metrics, which can over-penalize a semantically correct but textually inconsistent narration, especially when there is only one reference available. In addition, these existing datasets are all in English. However, about one-fifth of the world's population speaks Chinese as their mother tongue, of whom more than 17 million are visually impaired (Yu and Bu, 2021). Therefore, building a Chinese movie narration benchmark is necessary.
Intending to address the limitations of the existing narrating benchmarks, in this work, we propose a new benchmark with 101 Chinese movies for movie understanding, named Movie101. We collect the movies from the barrier-free channel on Xigua Video platform 1 , where normal movies are remastered with ADs. Through automatic process and manual correction, we obtain the ADs and actor lines from the raw videos. We crawl rich metainformation relevant to the movies as well. Finally, Movie101 contains 30,174 narration clips totaling 92 hours, with data samples as shown in Fig. 1. As our investigation shows that narrations mostly occur at those times when no actors are speaking (see Appendix A), to achieve realistic movie narrating, we propose the Movie Clip Narrating (MCN) task that requires a model to narrate where there are no lines. It brings a potential benefit for identifying where to narrate in an unlabeled new movie, since the timestamps of the actor lines are easily accessible 2 . Meanwhile, in order for the audience to accurately comprehend the role-related plots, concrete role names should be contained in the generated narration. For the MCN task, we reorganize the Movie101 dataset, merging the narration clips between two actor dialogues into a longer clip, to simulate real-scenario movie narrating. We thus obtain 14,109 long clips of variable length for narration generation. Moreover, to better evaluate the quality of model-generated narrations, we conduct 1 https://www.ixigua.com/channel/barrier_free 2 The timestamps of the lines can be obtained from the movie script or by automatic methods such as OCR and ASR. human evaluations and design a new metric specific to movie narrating, namely Movie Narration Score (MNScore), which well aligns with human evaluation. In addition to the MCN task, our dataset also supports the Temporal Narration Grounding (TNG) task, which asks a model to locate target clips in the movie according to some text descriptions. For both tasks, we benchmark the performance of existing methods, and further propose our improved models by incorporating auxiliary external knowledge. In addition to MCN and TNG tasks, Movie101 can also potentially support other movie understanding tasks such as visual question answering and action recognition, etc.
The main contributions of this paper are as follows: 1) We propose a new benchmark for movie understanding, Movie101, with a large number of video-aligned text descriptions in Chinese. 2) We propose two primary tasks, MCN and TNG, and a new narrating evaluation metric MNScore, where MCN is more in line with the needs of actual movie narrating, while MNScore is more consistent with human evaluation. 3) We benchmark state-of-theart models and propose improved models enhanced by external knowledge for MCN and TNG, respectively. We expect our proposed Movie101 benchmark can inspire more explorations on narrating and understanding a whole movie.

Related Works
Datasets. Existing datasets to support the automatic narration generation task include M-VAD (Torabi et al., 2015) and MPII-MD (Rohrbach et al., 2015), which are merged into LSMDC (Rohrbach et al., 2017). M-VAD, which is collected based on an automatic AD segmentation and alignment method, contains 47K videos from 92 DVDs, with an average length of 6.2s, each with an aligned narration. MPII-MD contains 68K videos from 94 movies with an average duration of 3.9s, about half of which come with paired scripts and the other half with paired ADs. In addition to movies, TV shows are also good data sources for automatic narration generation. Lei et al. (2020) propose TV Show Caption (TVC), a variant of TV Show Retrieval (TVR). It contains 11K short videos averaging 9.1s in length, and 26K captions describing the visual content, dialogues, and subtitles. All the existing datasets are in English.
Video Captioning. As a classic vision and language task, the video captioning task requires a model to generate natural language descriptions for given videos. Solutions for normal video captioning go through stages from pre-designed templates (Kojima et al., 2002;Guadarrama et al., 2013) to sequence-to-sequence generation with deep neural networks (Pasunuru and Bansal, 2017). A challenging variant for this task is dense video captioning (Krishna et al., 2017), which requires the generation of multi-sentence descriptions for long multievent videos. The two-stage generation approach, which firstly performs proposal detection on the video and then generates descriptions for each proposal separately, has been the dominant approach (Krishna et al., 2017;Park et al., 2019;Rohrbach et al., 2014;. Recently, some works avoid event detection and generate paragraph descriptions directly based on the video, such as the one-stage paragraphing model (OVP) , obtaining competitive performance compared to previous works, inspired by which we propose our knowledge-enhanced movie narrating model. Identity-aware video description that distinguishes different persons is more practical in real applications. Park et al. (2020) attempt to achieve role-aware movie narrating by distinguishing different people using labels such as PERSON1, PERSON2, etc. However, it fails to generate concrete role names and falls short in terms of practicality.
Temporal Sentence Grounding. The temporal sentence grounding (TSG) task aims to localize the moment in a video based on a natural language query (Gao et al., 2017). A two-step pipeline has been the mainstream approach, which first produces a large number of moment candidates via sliding windows, then ranks them with their similarity to the query sentence. The following works try to improve the grounding performance by enhancing interaction between video and query modalities  or introducing novel detection heads (Lei et al., 2021;Zhang et al., 2020a). Specifically, for interaction methods,  adopt an Iterative Alignment Network (IA-Net) to iteratively interact interand intra-modal features within multiple steps.  explicitly decompose video and query into multiple structured hierarchies and learn finegrained semantic alignment among them. In this work, we propose to incorporate external knowledge based on the IA-Net model structure.

Data Collection
Movie Acquisition. To the best of our knowledge, there are only a handful of platforms that provide accessible movies in Chinese. The barrierfree channel of Xigua Video is one such platform that provides over 100 accessible movies online, and new movies are still being released that can support further expansion of our dataset. From Xigua Video, we collect all 101 movies available to date and crawl as much meta information as possible for each movie, including title, introduction, genres, directors, actors, etc. We emphasize actors in particular, including actor names, role names, actor portraits, role rankings, and other information about important roles. We expect such information can benefit the movie narrating task and general movie understanding tasks. Narrations and Lines Extraction. As the movie lines and narrations are only available in the subtitle and audio format respectively from the platform, we therefore leverage OCR and automatic speech recognition (ASR) tools for transcription. For lines, we extract text from subtitles by open-source OCR toolkit PaddleOCR 3 at 2.4 FPS, and manually remove the irrelevant subtitles from the beginning and the end of each movie. For narrations, we extract the audio track from the movie and utilize the ASR service provided by iFlyTek 4 , which detects the speech in the audio and transcribes it into text. In addition, the service supports identifying different speakers, which helps discriminate the narrator from the actors. However, the ASR service is not perfect, and its outputs contain errors such as wrong characters, unreasonable sentence breaking, and misidentification of narrations as movie  dialogues, etc. Therefore, we recruit human annotators to further correct the ASR transcription errors and remove non-narration texts manually to improve the data quality. We also delete the irrelevant fragments at the beginning (e.g., movie synopsis, cast introductions) and the summary narration at the end. For coherency, we further organize the narration fragments at the clip level. We merge every two fragments if their temporal gap is less than 1 second. we also apply a paragraph-length threshold of 100 characters to limit over-merging to avoid excessively long clips. We take punctuation into account as well, for example, a period in Chinese is likely to mean the end of a narrative paragraph. Further detailed descriptions of data quality can be found in Appendix B. Movie101-N and Movie101-G. For real-life movie narrating, models are expected to narrate in the breaks between different actor dialogues. Thus, we reorganize Movie101 to fit this task format. Concretely, we first merge the independent lines in Movie101 into dialogues, where two lines with a temporal gap shorter than 5 seconds are considered to belong to one dialogue. Then, we merge all the narration clips between two adjacent dialogues into a long paragraph. In this way, we obtain Movie101-N with narration paragraphs separated by dialogues, which well simulates the practical narrating challenge. Meanwhile, with rich videotext pairs in Movie101, we create another variant dataset to support the temporal grounding tasks, named Movie101-G, where narrations are taken as queries and aligned videos serve as targets. For validation and testing, we carefully select 10 movies of different genres for each respectively.

Dataset Statistics
Movie Properties.
Movie101 contains 101 movies, involving 41 genres (a movie can belong to up to 4 genres) and 645 roles in total. Fig. 2 shows the numbers of movies in the top 10 most popular genres, with comedy, romance, and action in the top 3. Clip Properties. Movie101 contains a total of   Table 1 shows that Movie101-N contains much longer video clips and text descriptions than existing movie narrating datasets, while the length distribution in Fig. 4 indicates that the clip length varies a lot. Movie101-G contains 30,174 clips to be located from 101 movies. The average video length of 6,144 seconds also greatly exceeds existing TSG datasets.

Task Description
In order to help the visually impaired keep up with the plot in the movie, we first propose a Movie Clip Narrating (MCN) task, which aims to generate a plot-related paragraph description given a clip in Movie101-N. Besides, the narration styles may vary across different genres of movies. The role portraits are important external knowledge for a model to accurately describe the subject of actions. Thus, we also provide this information in Movie101-N to support the MCN task.

Proposed Method
For the MCN task, with multimodal inputs including video, movie genres, role names, and actor portraits, we propose a Transformer-based (Vaswani et al., 2017) model with an encoderdecoder framework, namely Role-pointed Movie Narrator (RMN), where the encoder mainly encodes video clips and the decoder generates narrations, as shown in Fig. 5 (a).
On the encoder side, taking into account the frame-level visual information, the video clip is embedded into a sequence of frame-level features. To emphasize the roles, we extract face features from each frame and concatenate them to the corresponding frame feature sequentially based on the confidence scores of face detection. With learnable genre embeddings, genres are also represented as a sequence of genre features. After video and genre representation, we apply a Transformer encoder to perform cross-encoding. Then, we follow the One-stage Video Paragraphing model (OVP)  to use a dynamic memory bank to refine the video-part representations, which updates at each decoding step.
On the decoder side, in addition to the Transformer decoder, we enable the model to directly choose a complete role name from the movie cast according to context during token-by-token generation via a pointer network (Gu et al., 2016). At the decoding step t, with the decoder hidden state h t , we first calculate the token scores y voc t among normal vocabulary. Then we design a Role Selector module to get the name scores among external are computed with the contextfiltered video feature as query and portrait features as key. Finally, the prediction distribution at step t is calculated as follows: where [; ] means concatenation, λ is a gate computed from h t , f () is the softmax function.

Evaluation
Existing movie narration benchmarks directly adopt ngram-based metrics including CIDEr, BLEU, and METEOR as in normal video captioning. However, there are pitfalls for these metrics, such as underestimating semantically correct but textually inconsistent phrases, which have been widely reported (Zhang et al., 2020b;Shi et al., 2022). For movie narrating, a movie clip can be narrated in multiple expressions, while there is only one reference. Thus, text matching is inadequate to measure the quality of a narration paragraph.
To better evaluate the generated narrations in the MCN task, we conduct a manual evaluation to investigate how humans assess different narrations. We randomly select 30 movie clips, each with 5 candidate narrations, of which 3 are derived from the predictions of different models and 2 are obtained by disturbing the ground truth narrations. Next, we recruit 10 annotators to individually rank the candidates for each video in terms of accuracy, informativeness, and textual quality. Accuracy defines how the narration accurately describes the video, especially roles, actions, and objects; informativeness defines how richly the narration reveals the video content; textual quality is determined by the narration fluency and grammatical correctness.
With the human evaluation results, we investigate a wide range of objective metrics as follows: (1) State-of-the-art video captioning metrics based on deep neural networks including CLIP-Score (Hessel et al., 2021), BERTScore (Zhang et al., 2020b) and EMScore (Shi et al., 2022), which are reported outperforming ngram-based metrics in video captioning evaluation; (2) Textual quality metrics including n-grams diversity (Shetty et al., 2017) (DIV) and causal language model perplexity (PPL); (3) F1 score of role name generation (RoleF1). For every two candidate narrations of a video, we use human ranking as a reference to determine whether these metrics correctly judge which of the two candidates is better or worse, and the accuracy is used for evaluating metrics' correlation with human judgment. Finally, we settle on a new metric Movie Narration Score (MNScore) as follows: where mns, ems, berts and rf 1 refer to MNScore, EMScore, BERTScore and RoleF1, respectively. As shown in Table 2, BERTScore outperforms ngram-based metrics in narration evaluation accuracy, while our new proposed MNScore achieves the best alignment with human evaluation. More details about the implementation of the candidate narrations and the above metrics are presented in Appendix C.

Experiments
Implementation Details. In our proposed method, models are trained with next-token language modeling by the maximum likelihood estimation (MLE) objective. For videos, we use CLIP (Radford et al., 2021) pre-trained on large-scale image-text pairs and MIL-NCE (Miech et al., 2020) pre-trained on HowTo100M videos (Miech et al., 2019) to extract frame-level CLIP and S3D features with dimensions of 512 and 1024, respectively, at 1 FPS, and further concatenate them. For faces in video frames and portraits, we use the Arcface model (Deng et al., 2019) pre-trained on MS1M (Guo et al., 2016) to extract face features. When there   Table 3, RMN outperforms the baselines by a large margin, especially on RoleF1. This indicates that our model learns to generate role names from external knowledge with the help of the pointer network. To verify the contribution of the genre and face representations in our RMN model, we also perform an ablation study by progressively adding these representations as input. From the results, face features extracted from video frames bring significant gains in role awareness, which shows that using face features to bridge the video content and external actor portraits is beneficial for generating role-related narrations. Qualitative results can be found in Appendix D.

Task Description
To help people locate clips of interest during movie entertainment, an AI agent should be able to understand users' intentions and locate the target clips. To achieve this goal, we propose the Temporal Narration Grounding (TNG) task. Given a clip narration as the query, TNG aims to predict the starting and ending time of the clip in the whole movie.

Proposed method
Existing temporal sentence grounding models can hardly handle an entire movie input with limited computational resources. Thus, we propose a twostage framework for the TNG task, with global shot retrieval to coarsely locate the target clip in the first stage and local temporal grounding to finalize the precise timestamp of the target clip in the second stage, as shown in Fig. 5 (b). Global Shot Retrieval. To find the approximate location of the target, we treat it as a text-video retrieval subtask. We divide a movie into 20s-long shots, and the shot with the highest similarity to the text query will be used as the anchor for further grounding in the second stage. For training such a retrieval system, we construct a temporary dataset Movie101-GSR(temp). Concretely, after cutting the movie into shots, each shot and each annotated narration in Movie101 are judged with the temporal overlap whether they can be considered as an aligned video-text pair. 6 We build the retrieval model by transferring a Chinese Vision-Language Pre-training (VLP) model ChineseCLIP  (CNCLIP) from image-text to video-text. Specifically, the shot frames are separately encoded as image features by the visual encoder of CNCLIP, and the final video feature is obtained by performing mean pooling over the CLS tokens of all frames. We then perform contrastive learning between the video and text features on Movie101-GSR(temp) to fine-tune the modified CNCLIP. Local Temporal Grounding. After obtaining the anchor shot in the first stage, we further localize the target clip within a 200-second window around the anchor shot. This requires the temporal sentence grounding in a 200s-long movie clip, where comprehending the actions of different roles is critical. Therefore, based on the state-of-the-art TSG model IA-Net , we propose Role-aware Narration Locator (RNL). With a bidirectional GRU (Chung et al., 2014) visual encoder, we encode the input frame features to get temporal context-aware frame representations V . We in addition extract face features from the frames and encode them with a fully connected (FC) layer to filter key face information F . Then we finalize the visual representation by summing V and F . For text encoding, to relate role names in the text query with roles in the video, we extract face features from the portraits that correspond to the role names and encode them as visual token representations with a FC layer, which are then concatenated to the query's textual token representation sequence. During training, for each target, we randomly select a 200s-long clip window that covers the target in each training epoch. We also construct a temporary dataset Movie101-LTG(temp) with fixed window to separately evaluate the second-stage model performance.

Experiments
Implementation Details. For Global Shot Retrieval, we use average Recall@n (n ∈ 1, 5, 10) to evaluate the retrieval performance on all movies. For Local Temporal Grounding, following previous works (Zhang et al., 2020a), we use "R@n, IoU@m" as metrics, which are defined as the percentage of at least one of top-n proposals having a larger temporal IoU than m with the ground truth. We fine-tune CNCLIP-huge on our Movie101-GSR(temp) for Global Shot Retrieval, and benchmark two code-released state-of-the-art temporal grounding models 2D-TAN (Zhang et al., 2020a) and IA-Net  on Movie101-LTG(temp) for Local Temporal Grounding. In our RNL model, the video frame, face, and text fea-ture extractors are pre-trained MIL-NCE, Arcface (same as in the MCN task) and BERT-base-Chinese (Devlin et al., 2019), respectively.
Results & Analysis. Table 4 and Table 5 show the performance of models on Global Shot Retrieval and Local Temporal Grounding, respectively. Our RNL outperforms baselines by introducing roleaware video and text encoding, indicating that distinguishing actions of different roles is critical for grounding movie narration. Furthermore, we perform an ablation study to verify the effectiveness of role-aware encoding. As shown in Table 5, adding face features to either video or text representations outperforms our base method IA-Net. RNL with both role-aware video and text encoding achieves the best performance. Table 6 shows the performance of combined inference by Global Shot Retrieval and Local Temporal Grounding. We in addition show the performance of k-way re-ranking, where the top-k shots retrieved in the first stage are respectively used as the anchors in the second stage, and all predictions obtained are re-ranked with their confidence scores. The experimental results show that k-way re-ranking improves Rank@5 performance but harms Rank@1 performance. Qualitative results can be found in Appendix D.

Conclusion
In this work, we propose Movie101, a Chinese large-scale video benchmark for movie understanding. To assist visually impaired people in enjoying movies, we propose a more realistic Movie Clip Narrating task to address the automatic movie description issue and design a human-preferencecompatible metric MNScore for narrating evaluation. Movie101 also supports the Temporal Narration Grounding task, which is more challenging than the previous TSG benchmarks. Furthermore, our experiments validate the importance of external knowledge including genres and roles for movie understanding. However, there is still a significant gap between our models and expert annotations. This reveals that further research endeavors are still needed to help visually impaired people enjoy movies by AI.

Limitations
Keeping narration coherent within a movie is crucial for visually impaired people to enjoy the movie.
In this work, we move a step forward for this target by setting the ground-truth texts in the Movie Clip Narrating task as narration paragraphs and providing longer video clips as inputs. However, how to ensure description coherence across different clips within a movie has not been studied in this work. This requires a higher-level comprehending ability of models to process the whole movie and connect different plots. We leave this to our future investigation.

Ethics Statement
We propose Movie101, a new benchmark to support exploring technologies to benefit the accessibility of the visually impaired. There are two potential ethical issues with our work, regarding data source and crowdsourcing services, respectively. We state each of them as follows: Data Source. The collected movies are publicly available from Xigua Video, and are allowed to be crawled according to the service contract of the website 7 . Considering the copyright issue, we will only release the url list of movies. Besides, our data source does not contain any information that names or uniquely identifiable individuals or offensive content. Crowdsourcing Services. We recruited 20 Chinese college students (12 females and 8 males) via social media. For ASR outputs cleaning, workers were required to correct errors in the narration text while watching the movie. For each movie, it took about 2 hours with a payment of 50 RMB ($7.40 USD). To review corrections, for each movie, it took about 30 minutes with a payment of 25 RMB ($3.70 USD). Our payment is fair and reasonable in China, especially since the work is easy and fun. Before the annotation works began, we introduced the future use of the data in the task document to ensure that everyone was informed.
causes of visual impairment in population more than 50 years old: The shaanxi eye study. Medicine,99(20

A Narration Distribution
Clips where 'no actors are speaking' refer to ANY scene wherein no verbal dialogue is being employed by the actors, regardless of whether they are visually present or absent. This definition encompasses, for example, a scene focused solely on a depiction of the sky. We detail the dialogues and narrations in the 101 collected movies. By merging the actor lines, we obtain a total of 15,307 dialogues, constituting 15,206 dialogue gaps with a total duration of 99.4 hours. The 30,174 narration clips we collect fill in 95.3% of the dialogue gaps in terms of quantity and cover 92.9% in terms of duration. Therefore, it is reasonable to assume that where there are no lines, there is a need for narration.

B Dataset Quality Description
We adopt a two-stage annotation process to ensure the quality of the narrations. In the first stage, a group of workers is recruited to clean the data according to our guidelines. In the second stage, another group of workers further checks and corrects the annotation data. Our heuristics used to divide the paragraphs are designed based on our observation experience. We further conduct a manual evaluation of the narration quality. Of the randomly sampled 300 paragraphs, (1) in terms of narration recognition, 96.7% are textually consistent with original ADs; (2) as for the paragraph coherence, 90% maintain complete and coherent semantics, 7.7% should be merged with contexts, and 2.3% should be divided into multiple paragraphs. Thus, the narration is of good quality to support downstream tasks.

C Implementation Details
Candidate Narrations. In Section 4.3, We provide 5 different candidate narrations for each sampled movie clip for human evaluators to rank. These candidates are created as follows: Metrics Implementation. For CLIP-based metrics including CLIPScore and EMScore, we finetune ChineseCLIP-huge  on our dataset in the same way as in Section 5.2. For each movie clip and generated narration, CLIP-Score is calculated with the mean pooled feature of 10 uniformly selected frames and the overall text feature, while EMScore is calculated with all selected frame features and textual token features. For BERTScore, we use the BERT-base-Chinese (Devlin et al., 2019) model checkpoint to calculate, and rescale the raw BERTScore with baseline 8 . For DIV, we calculate 1-gram diversity and 2-gram diversity following Shetty et al. (2017), and average them. For PPL, we obtain the perplexity of each narration with the causal Ernie 3.0 model (Sun et al., 2021) following the calculation of Hugging-Face 9 . For RoleF1, we extract role names from the ground truth and the generated narration. We measure how the generated narration covers the 8 https://github.com/Tiiiger/bert_score/blob/ master/journal/rescale_baseline.md 9 https://huggingface.co/spaces/ evaluate-metric/perplexity roles appearing in the movie clip by Recall; given that these generated role names may also come from the model's hallucination, for example from a wrong movie, we also take Precision into account. Finally, we calculate the F1 score with Precision and Recall. Hyperparameters and Computation. We detail the key hyperparameters and computational burden for the models training in Table 7. For each model, the results are derived from a single run.

D Qualitative Result
D.1 Movie Clip Narrating Fig. 6 shows the qualitative results of the MCN task, including the generation results of baselines and our proposed RMN model, and the evaluation results of previous metrics and our proposed MN-Score. Vanilla Transformer and OVP can correctly mention some actions but fail to generate correct role names because these roles never appear during training. However, with the help of the Role Selector module, our RMN could well relate roles in video clips with their role names. In addition, these cases demonstrate that our newly proposed MN-Score evaluates more consistently with humans. Fig. 7 shows the qualitative results of our proposed two-stage method. Through Global Shot Retrieval, we obtain an anchor shot near the target clip from the whole movie, which further helps Local Temporal Grounding to locate the final target. OVP: 他站在⻔⼝的毒贩已经停下了脚步，看着这⾥，他转过身来看着脚下的两个⼈， (The drug dealer he is standing in the doorway stop and look at the place, and he turn to look at the two men at his feet, ) RMN: ⻩达还是⼀副模样，他稍微落寞的样⼦，⼀时间还算回到，⻩达跟在⻩达身后， 他先是⼀屁股坐在椅⼦上，(HuangDa is still the same, slightly melancholy, for a while … come back, HuangDa follows behind HuangDa, he first buttocks in the chair. )  GT: 王多⻥站在保险公司⼤厦最顶层，穿着红⾊⾐服绿裤衩，⼀只⼿背在后⾯，另⼀只 ⼿扶着巨⼤的"瘦"字，双腿交叉带着⼀脸享受闭上双眼，倚靠在"瘦"字上⾯。镜头缓 缓上升拉伸，王多⻥变得越来越渺⼩，最后完全看不⻅了。(WangDuoyu stands at the top of the insurance company building, wearing red clothes and green pants, one hand behind the back, the other hand holding the huge "thin" character, legs crossing, with a face of enjoyment, closing eyes, leaning on the "thin" character. The camera slowly rises, WangDuoyu becomes smaller and smaller, and finally completely invisible. )

Candidate
VT: 在众⼈的⾼举⾏下，张彪也在这⼀场，下⾯的⾼楼下，阿俊也摔倒在地上。(In the crowd hold under the high, ZhangBiao is also in this scene, below the high floor, Arjun also falls to the ground. )   In the narration texts, green and red characters denote the correctly and wrongly generated role names, respectively. In the tables, metrics in green indicate that the ranking of candidates by the metric is consistent with human ranking, while red indicates inconsistency.