BigVideo: A Large-scale Video Subtitle Translation Dataset for Multimodal Machine Translation

We present a large-scale video subtitle translation dataset, BigVideo, to facilitate the study of multi-modality machine translation. Compared with the widely used How2 and VaTeX datasets, BigVideo is more than 10 times larger, consisting of 4.5 million sentence pairs and 9,981 hours of videos. We also introduce two deliberately designed test sets to verify the necessity of visual information: Ambiguous with the presence of ambiguous words, and Unambiguous in which the text context is self-contained for translation. To better model the common semantics shared across texts and videos, we introduce a contrastive learning method in the cross-modal encoder. Extensive experiments on the BigVideo show that: a) Visual information consistently improves the NMT model in terms of BLEU, BLEURT, and COMET on both Ambiguous and Unambiguous test sets. b) Visual information helps disambiguation, compared to the strong text baseline on terminology-targeted scores and human evaluation. Dataset and our implementations are available at https://github.com/DeepLearnXMU/BigVideo-VMT.


Introduction
Humans are able to integrate both language and visual context to understand the world.From the perspective of NMT, it is also much needed to make use of such information to approach humanlevel translation abilities.To facilitate Multimodal Machine Translation (MMT) research, a number of datasets have been proposed including imageguided translation datasets (Elliott et al., 2016;Gella et al., 2019;Wang et al., 2022) and videoguided translation datasets (Sanabria et al., 2018;Wang et al., 2019;Li et al., 2022b).However, the conclusion about the effects of visual information is still unclear for MMT research (Caglayan et al., 2019).Previous work has suggested that visual information is only marginally beneficial for machine translation (Li et al., 2021;Caglayan et al., 2021), especially when the text context is not complete.The most possible reason is that existing datasets focus on captions describing images or videos, which are not large and diverse enough.The text inputs are often simple and sufficient for translation tasks (Wu et al., 2021).Take the widely used Multi30K as an example.Multi30K consists of only 30K image captions, while typical text translation systems are often trained with several million sentence pairs.We argue that studying the effects of visual contexts in machine translation requires a large-scale and diverse data set for training and a real-world and complex benchmark for testing.To this end, we propose BIGVIDEO, a large-scale video subtitle translation dataset.We collect human-written subtitles from two famous online video platforms, Xigua and YouTube.BIGVIDEO consists of 155 thousand videos and 4.5 million high-quality parallel sentences in English and Chinese.We highlight the key features of BIGVIDEO as follows: a) The size of BIGVIDEO bypasses the largest available video machine translation dataset HOW2 and VATEX by one order of magnitude.b) To investigate the need for visual information, two test sets are annotated by language experts, referred as AMBIGUOUS and UNAMBIGUOUS.In AMBIGUOUS, the source input is not sufficient enough and requires videos to disambiguate for translation.The experts also labelled unambiguous words to help evaluate whether the improvement comes from visual contexts.In UN-AMBIGUOUS, actions or visual scenes in the videos are mentioned in the subtitles but source sentences are self-contained for translation.
To make the most of visual information for MMT, we propose a unified encoder-decoder framework for MMT.The model has a cross-modal encoder that takes both videos and texts as inputs.Motivated by recent work on cross-modal learning (Li et al., 2020;Qi et al., 2020;Xia et al., 2021), we also introduce a contrastive learning objective to further bridge the representation gap between the text and video and project them in a shared space.As such, the visual information can potentially contribute more to the translation model.
We conduct extensive experiments on the proposed benchmark BIGVIDEO and report the results on BLEU (Papineni et al., 2002), BLEURT (Sellam et al., 2020), COMET (Rei et al., 2020), terminology-targeted metrics and human evaluation.We also introduce the large scale WMT19 training data, which contains 20.4M parallel sentences to build the strong baseline model.The experiments show that visual contexts consistently improve the performance of both the AMBIGUOUS and UNAMBIGUOUS test set over the strong textonly model.The finding is slightly different with previous studies and address the importance of a large scale and high-quality video translation data.Further, the contrastive learning method can further boost the translation performance over other visual-guided models, which shows the benefits of closing the representation gap of texts and videos.

Related Work
Video-guided Machine Translation.The VATEX dataset has been introduced for video-guided machine translation task (Wang et al., 2019).It con-tains 129K bilingual captions paired with video clips.However, as pointed out by Yang et al. (2022), captions in VATEX have sufficient information for translation, and models trained on VATEX tend to ignore video information.Beyond captions, Sanabria et al. (2018) considers video subtitles to construct the HOW2 dataset.HOW2 collects instructional videos from YouTube and obtains 186K bilingual subtitle sentences.To construct a challenging VMT dataset, Li et al. (2022b) collect 39K ambiguous subtitles from movies or TV episodes to build VISA.However, both HOW2 and VISA are limited on scale and diversity, given the training needs of large models.In contrast, we release a larger video subtitle translation dataset, with millions of bilingual ambiguous subtitles, covering all categories on YouTube and Xigua platforms.
To leverage video inputs in machine translation models, Hirasawa et al. (2020) use pretrained models such as ResNet (He et al., 2016), Faster-RCNN (Ren et al., 2015) and I3D (Carreira and Zisserman, 2017).An additional attention module is designed in the RNN decoder to fuse visual features.To better learn temporal information in videos, (Gu et al., 2021) propose a hierarchical attention network to model video-level features.Different from previous work, we use a unified encoder to learn both video and text features.Specifically, a contrastive learning objective is adopted to learn cross-modal interaction.
Image-guided Machine Translation.Images as additional inputs have long been used for machine translation (Hitschler et al., 2016).For neural models, several attempts have been focused on enhancing the sequence-to-sequence model with strong image features (Elliott and Kádár, 2017;Yao and Wan, 2020;Lin et al., 2020;Yin et al., 2020;Su et al., 2021;Li et al., 2022a;Lin et al., 2020;Zhu et al., 2022;Lan et al., 2023).However, Li et al. (2021) and Wu et al. (2021) point out that images in Multi30K provide little information for translation.In this work, we focus on videos as additional visual inputs for subtitle translation.Videos illustrate objects, actions, and scenes, which contain more information compared to images.Subtitles are often in spoken language, which contains inherent ambiguities due to multiple potential interpretations (Mehrabi et al., 2022).Hence, our dataset can be a complement to existing MMT datasets.For the text length (Len.), we report average length of source sentences (English) for fair comparison.For the video length (Sec.), we report average seconds of videos for fair comparison.

Dataset
We present BIGVIDEO, consisting of 150 thousand unique videos (9,981 hours in total) with both English and Chinese subtitles.The videos are collected from two popular online video platforms, YouTube and Xigua.All subtitles are humanwritten.Table 1 lists statistics of our dataset and existing video-guided translation datasets.Among existing datasets, our dataset is significantly larger, with more videos and parallel sentences.

BIGVIDEO Dataset
To obtain high-quality video-subtitle pairs, we collect videos with both English and Chinese subtitles from YouTube1 and Xigua2 .Both two platforms provide three types of subtitles: 1) creator which is uploaded by the creator, 2) auto-generate which is generated by the automatic speech recognition model and 3) auto-translate which is produced by machine translation model.We only consider videos with both English and Chinese subtitles uploaded by creators in order to obtain high-quality parallel subtitles.These videos and subtitles are often created by native or fluent English and Chinese speakers.In total, we collect 13K videos (6K hours in total) from YouTube and 2K videos from Xigua (3.9K hours in total).Preprocessing.We first re-segment English subtitles into full sentences.To ensure the quality of parallel subtitles, we use quality estimation scores (e.g., the COMET score) to filter out low-quality pairs.More details are provided in Appendix B.1.Ultimately, 3.3M sentences paired with video clips are kept for YouTube, and 1.2M for Xigua.The average lengths of English and Chinese sentences

Dataset Analysis
Quality Evaluation.To assess the quality of text pairs, we randomly select 200 videos from each source and recruit seven annotators to rate the quality of subtitles pairs.For each video, we randomly select at most 20 clips for evaluation.All annotators are fluent in English and Chinese.After watching video clips and subtitles, human annotators are asked to rate subtitle pairs from 1 (worst) to 5 (best) on fluency-whether the source sentence (English) is fluent and grammatically correct, and translation quality-whether the Chinese subtitle is semantically equivalent to the English subtitle.Detailed guidelines are provided in Appendix F. From Table 2, English sentences from both YouTube and Xigua have an average of 4.8 and 4.6 fluency scores, which shows that English subtitles are fluent and rarely have errors.In terms of translation quality, we find more than 96 percent of the pairs are equivalent or mostly-equivalent, with only minor differences (e.g., style).Diversity Evaluation.In addition to the size and quality, diversity is also critical for modeling alignments between parallel texts (Tiedemann, 2012).
Prior work calculates unique n-grams and part-ofspeech (POS) tags to evaluate linguistic complexity (Wang et al., 2019).Besides word-level metrics, we use video category distribution to assess videolevel diversity.
Since the source text of our dataset, VATEX and HOW2 are in English, we compare unique n-grams and POS tags on the source texts.For unique POS tags, we compare four most common types: verb, noun, adjective and adverb.In Figure 2, our data from both XIGUA and YOUTUBE have substantially more unique n-grams and POS tags than VA-TEX and HOW2.Evidently, our dataset covers a wider range of actions, objects and visual scenes.
To evaluate video-level diversity, we compare category distributions among three datasets.The YouTube platform classifies videos into 15 categories.Since videos collected from the Xigua platform do not have category labels, we train a classifier on the YouTube data to label them.Details of the classifier are in Appendix B.2. Figure 3 depicts the distributions of three datasets.While both VATEX and HOW2 have a long-tail distribution on several categories (e.g., "Nonprofits & Activism" and "News & Politic"), BIGVIDEO has at least 1,000 videos in each category, which forms a more diverse training set.

Test Set Annotation Procedure
Subtitles often contain semantic ambiguities (Gu et al., 2021), which can be potentially solved by watching videos.In order to study "How visual contexts benefit machine translation", we create two test sets: AMBIGUOUS contains ambiguous subtitles that videos provide strong disambiguation signal, while UNAMBIGUOUS consists of selfcontained subtitles that videos are related but subtitles themselves contain enough context for translation.Statistics of two test sets are listed in Figure 3.
We randomly sample 200 videos from each of Xigua and YouTube and hire four professional  speakers in both English and Chinese to annotate the test set.Annotators are first asked to remove sentences which are not related to videos.In this step, we filter out about twenty percent of sentences.Annotators are then asked to re-write the Chinese subtitle if it is not perfectly equivalent to the English subtitle.Next, we ask the annotators to distinguish whether the source sentence contains semantic ambiguity.Specifically, annotators are instructed to identify ambiguous words or phrases in both English and Chinese sentences, as illustrated in Figure 4. We finally obtain 2394 samples in our test set.36.6% of the sentences are in the AMBIGUOUS and 63.4% of the sentences are in the UNAMBIGUOUS.In the AMBIGUOUS, we annotate 745 ambiguous terms.The statistics indicate that videos play important roles in our dataset.Annotation instructions and detailed procedures are provided in Appendix F.

Method 4.1 Model
To better leverage videos to help translation, we present our video-guided machine translation model, as displayed in Figure 5.Our model can be seamlessly plugged into the pretrained NMT model, which can benefit from large-scale parallel training data.Importantly, we design a contrastive learning objective to further drive the translation model to learn shared semantics between videos and text.Cross-modal Encoder.Our model takes both videos and text as inputs.Text inputs are first represented as a sequence of tokens x and then converted to word embeddings through the embedding layer.Video inputs are represented as a sequence of continuous frames v.We use a pretrained encoder to extract frame-level features, which is frozen for all experiments.Concretely, we apply the linear projection to obtain video features with the same dimension as text embeddings.To further model temporal information, we add positional embeddings to video features, followed by the layer normalization.Video features v emb and text embeddings x emb are then concatenated and fed into the Transformer encoder.Text Decoder.Our decoder is the original Transformer decoder, which generates tokens autoregressively conditioned on the encoder outputs.We consider the cross entropy loss as a training objective: where y i denotes the text sequence in the target language for the i-th sample in a batch of N samples.

Contrastive Learning Objective
In order to learn shared semantics between videos and text, we introduce a cross-modal contrastive learning (CTR)-based objective.The idea of the CTR objective is to bring the representations of video-text pairs closer and push irrelevant ones further.
Formally, given a positive text-video pair (x i , v i ), we use remaining N − 1 irrelevant textvideo pairs (x i , v j ) in the batch as negative samples.

Video Embedder
Text Embedder

Feature Extractor
That's a good chip.

Example
Positive Pair Negative Pair The contrastive learning objective (Sohn, 2016) is: where x p i and v p i are representations for the text and video, sim(•) is the cosine similarity function and the temperature τ is used to control the strength of penalties on hard negative samples (Wang and Liu, 2021).Text and Video Representations.Importantly, since videos and subtitles are weakly aligned on the temporal dimension (Miech et al., 2019), we first average video embeddings and text embeddings in terms of the time dimension.Concretely, we apply two projection heads ("MLP" in Figure 5) to map representations to the same semantic space (Chen et al., 2020).
In the end, we sum up the two losses to obtain the final loss: where α is a hyper-parameter to balance the two loss items.

Experimental Setup
Implementation Details.We evaluate our method on three video translation datasets: VATEX, HOW2 and our proposed dataset BIGVIDEO.More dataset details can be found in Appendex C.1.
Our code is based on the fairseq toolkit (Ott et al., 2019).The Transformer-base model follows (Vaswani et al., 2017).Both encoder and decoder have 6 layers, 8 attention heads, hidden size = 512, and FFN size = 2048.We utilize post-layer normalization for all models.On VATEX, we follow the Transformer-small setting from Wu et al. (2021) for better performance, 6 layers for encoder/decoder, hidden size = 512, FFN size = 1024 and attention heads = 4.
All experiments are done on 8 NVIDIA V100 GPUS with mixed-precision training (Das et al., 2018), where the batch assigned to each GPU contains 4,096 tokens.More training details can be found in Appendix C.2.We stop the training if the performance on the validation set does not improve for ten consecutive epochs.The running time is about 64 GPU hours for our system.During the inference, the beam size and the length penalty are set to 4 and 1.0.We apply byte pair encoding (BPE) with 32K merge operations to preprocess sentences of our dataset.During training and testing, we uniformly sample a maximum of 12 frames as the video input.The text length is limited to 256.For the contrastive learning loss, we set α to 1.0 and τ to 0.002.The choices of hyper-parameters are in Appendix D.
For video features, we extract 2D features and 3D features to compare their effects.Concretely, we experiment with two pretrained models to extract the video feature: a) The vision transoformer (VIT) (Dosovitskiy et al., 2021) which extracts frame-level features.b) The SlowFast model (SLOWFAST) which extracts video-level features (Feichtenhofer et al., 2019).For 2D features, we first extract images at a fixed frame rate (3 frames per second).Then we utilize pretrained Vision Transformer3 (ViT) to extract 2D video features into 768-dimensional vectors.Here the representation of [CLS] token is considered as the global information of one frame.For 3D features, we extract 2304-dimensional SlowFast4 features at 2/3 frames per second.Baselines and Comparisons.For baselines, we consider the base version of the Transformer (TEXT-ONLY), which only takes texts as inputs.For comparisons, since most recent MMT studies focus on image-guided machine translation, we implement two recent image-based MMT models: a) The gated fusion model (GATED FUSION) which fuses visual representations and text representations with a gate mechanism (Wu et al., 2021).b) The selective attention model (SELECTIVE ATTN) which uses a single-head attention to connect text and im-age representations (Li et al., 2022a).We extract image features using ViT and obtain the visual feature by averaging image features on the temporal dimension.The visual feature is then fused with the text representations which is the same as original GATED FUSION and SELECTIVE ATTN.For HOW2 and VATEX, we additionally include the baseline models provided by the original paper.Evaluation Metrics.We evaluate our results with the following three metrics: detokenized sacreBLEU5 , COMET6 (Rei et al., 2020) and BLEURT7 (Sellam et al., 2020).In order to evaluate whether videos are leveraged to disambiguate, we further consider three terminology-targeted metrics (Alam et al., 2021): • Exact Match: the accuracy over the annotated ambiguous words.If the correct ambiguous words or phrases appear in the output, we count it as correct.
• Window Overlap: indicating whether the ambiguous terms are placed in the correct context.
For each target ambiguous term, a window is set to contain its left and right words, ignoring stopwords.We calculate the percentage of words in the window that are correct.In practice, we set window sizes to 2 (Window Overlap-2) and 3 (Window Overlap-3).
• Terminology-biased Translation Edit Rate (1-TERm): modified translation edit rate (Snover et al., 2006) in which words in ambiguous terms are set to 2 edit cost and others are 1.

Main Results
Videos Consistently Improve the NMT Model.
As displayed in Table 4, on BIGVIDEO, our models equipped with videos obtain higher automatic scores.This indicates the benefit of using videos as additional inputs.Notably, our model trained with the additional contrastive learning objective yields better scores compared to the variant trained only with the cross entropy loss.This signifies that our contrastive learning objective can guide better acquisition of video inputs.Furthermore, we Noticeably, compared to the text-only baseline, our models trained with the CTR objective achieves larger gain on AMBIGUOUS than that on UNAM-BIGUOUS.This demonstrates that it is more difficult to correctly translate sentences of AMBIGU-OUS, while taking videos as additional inputs can help the model generate better translations.
To better study the role of videos in translation, we introduce additional training data to build a stronger NMT baseline.We introduce the WMT19 Zh-En dataset with 20.4M parallel sentences for pretraining.We aim to answer: how will the model perform if more text data are included?
As displayed in Table 4, Model with video inputs outperforms the strong NMT baseline.Pretraining on large corpus benefits models on BIGVIDEO.However, we find improvements mainly come from the UNAMBIGUOUS.This shows that videos play more crucial roles in AMBIGUOUS, which suggests that BIGVIDEO can serve as a valuable benchmark for studying the role of videos in MMT research.Videos Help Disambiguation.We further evaluate the model ability of disambiguation.We present results on terminology-targeted metrics in Table 5.First, our systems with video features achieve consistent improvements both on exact match and window overlap metrics compared to the text-only variant, indicating that models augmented by video inputs correctly translate more ambiguous words and place them in the proper contexts.It is also worth noticing that our system with pretraining achieves better scores compared to the strong text baseline, which further highlights the importance of video inputs.Moreover, we find it hard to correctly translate ambiguous words since the best exact match score is 25.02%, which suggests that our AMBIGUOUS set is challenging.

Video-augmented Model Improves Translation
Quality.We further conduct human evaluation to analyze the translation quality.We randomly pick 100 sentences from the AMBIGUOUS and the UNAMBIGUOUS respectively and recruit three human judges for evaluation.For each sentence, the judges read the source sentence and two candidate translations, which are from TEXT-ONLY and our model + VIT + CTR.The judges are required to rate each candidate on a scale of 1 to 5 and pick the better one.Detailed guidelines are in Appendix G.
From Table 6, we can see our system with video inputs are more frequently rated as better translation than the text-only model on both AMBIGUOUS and UNAMBIGUOUS test sets.This echoes automatic evaluations and implies that taking videos as inputs improve translation quality.Moreover, overall scores on UNAMBIGUOUS are better than those on AMBIGUOUS, which demonstrates that AMBIGUOUS is more challenging.

Incongruent Decoding
In this section, we explore whether visual inputs contribute to the translation model.Following (Caglayan et al., 2019;Li et al., 2022a) for each sentence.As shown in Figure 6, on AM-BIGUOUS and UNAMBIGUOUS, we observe that all automatic metrics of our system drop significantly with incongruent decoding, suggesting the effectiveness of leveraging videos as inputs.Interestingly, we also find that the drop of the BLEU and COMET scores is larger on AMBIGUOUS than that on UNAMBIGUOUS, which further proves our point that videos are more crucial for disambiguation.

Results on Public Datasets
Next, we conduct experiments on public datasets, VATEX and HOW2.Results are displayed in Table 7.On HOW2, our best system achieves higher BLEU score compared to the text-only model.However, the text-only model achieves best COMET and BLEURT, compared to all systems that take videos as inputs.On VATEX, our model with SLOWFAST features also achieves the highest scores on three evaluation metrics, compared to the text-only model and comparisons.Notably, the model with SLOWFAST features is significantly better than models with VIT features, which is probably because VATEX focuses on human actions and the SLOWFAST model is trained on the action recognition dataset.However, the performance gap between the TEXT-ONLY and our model + SLOW-FAST + CTR is marginal.After we introduce 20M external MT data, we observe that the TEXT-ONLY and our best system are comparable on automatic metrics.Since the cross-modal encoder often requires large-scale paired videos and text to train robust representations, our model does not achieve huge performance gain on VATEX and HOW2.We hope our BIGVIDEO dataset can serve as a comple- ment to existing video-guided machine translation datasets.

Conclusion
In this paper, we present BIGVIDEO --a largescale video subtitle Translation dataset for multimodal machine translation.We collect 155 thousand videos accompanied by over 4.5 million bilingual subtitles.Specially, we annotate two test subsets: AMBIGUOUS where videos are required for disambiguation and UNAMBIGUOUS where text contents are self-contained for translation.We also propose a cross-modal encoder enhanced with a contrastive learning objective to build cross-modal interaction for machine translation.Experimental results prove that videos consistently improve the NMT model in terms of the translation evaluation metrics and terminology-targeted metrics.Moreover, human annotators prefer our system outputs, compared to the strong text-only baseline.We hope our BIGVIDEO dataset can facilitate the research of multi-modal machine translation.

Limitations
BIGVIDEO is collected from two video platforms Xigua and YouTube.All videos are publicly available.However, some videos may contain user information (e.g., portraits) or other sensitive information.Similar to VATEX and HOW2, we will release our test set annotation and the code to reproduce our dataset.For videos without copyright or sensitive issues, we will make them public but limit for research, and non-commercial use (We will require dataset users to apply for access).For videos with copyright or sensitive risks, we will provide ids, which can be used to download the video.This step will be done under the instruction of professional lawyers.
Though we show that our model with video inputs helps disambiguation, we find that our model could yield incorrect translation due to the lack of world knowledge.For example, model can not distinguish famous table tennis player Fan Zhengdong and give correct translation.We find this is due to video pretrained models are often trained on action dataset (e.g., Kinetics-600 (Long et al., 2020)) and hardly learn such world knowledge.In this work, we do not further study methods that leverage world knowledge.

Ethical Considerations
Collection of BIGVIDEO.We comply with the terms of use and copyright policies of all data sources during collection from the YouTube and Xigua platform.User and other sensitive information is not collected to ensure the privacy of video creators.The data sources are publicly available videos and our preprocessing procedure does not involve privacy issues.For all annotation or human evaluation mentioned in the paper, we hire seven full-time professional translators in total and pay them with market wage.All of our annotators are graduates.Potential Risks of BIGVIDEO and our model.While BIGVIDEO consists of high-quality parallel subtitles, we recognize that our data may still contain incorrect samples.Our model may as well generate degraded or even improper contents.As our dataset is based on YouTube or Xigua videos, models trained on our dataset might be biased towards US or Chinese user perspective, which could yield outputs that are harmful to certain populations.

A Complete Results
The Complete results with standard deviations can be found in Table 8, Table 9 and Table 10.

B.1 Preprocessing
Subtitles are organized as a list of text chunks.Each chunk contains both English and Chinese lines and a corresponding timestamp.To obtain complete sentences, we start processing subtitles by merging chunks.Since English subtitles are often with strong punctuation marks, we greedily merge continuous segments (The start time of the second segment and the end time of the first segment are within 0.5 seconds) until an end mark is met at the end of the segment.To preserve context, we keep merging continuous sentences until a maximum time limit of 15 seconds is reached.Finally, we pair each merged segment with the video clip from the time interval corresponding to the segment.
English sentences from both YouTube and Xigua have an average of 4.6 fluency score, which shows that English subtitles are fluent and rarely have errors.In terms of translation quality, subtitles collected from Xigua have an average of 4.2 translation quality score, which indicates most of the subtitle pairs are equivalent or near-equivalent.In YouTube data, we find about 20 percent of sentence pairs are not equivalent or have major errors such as mistranslation or omission.
To remove low-quality pairs, we try three commonly-used quality estimation scores: 1) the COMET score, 2) the Euclidean distance based on the multilingual sentence embedding (Artetxe and Schwenk, 2019), and 3) the round-trip translation BLEU score (Moon et al., 2020).We filter out pairs if more than one score is lower than the threshold (set to 0.1, 4 and 20, respectively).On annotated samples, the average translation quality reaches 4.1 after cleaning.

B.2 Video Category Classifier Details
To construct a large-scale video-guided dataset, we collect videos from a variety of domains and categorized them into 15 classes based on their video categories in YouTube.We use the official youtubedl8 toolkit to retrieve video categories and other metadata from YouTube.To ensure consistency

C.1 Dataset
We additionally conduct experiments on two public video-guided translation datasets, VATEX (Wang et al., 2019) and HOW2 (Sanabria et al., 2018).The HOW2 dataset is a collection of instructional videos from Youtube.The corpus contains 184,948 English-Portuguese pairs for training, each associated with a video clip.We utilize val (2,022) as the validation set and dev5 (2,305) as the testing set.The VATEX dataset is a video-and-language dataset containing over 41,250 unique videos.The released version of the bilingual collection includes 129,955 sentence pairs for training, 15,000 sentence pairs for validation, and 30,000 for testing.Since the testing set is not publicly available, we split the original validation set into two halves for validation and testing.Some video clips of VA-TEX are no longer available on the Youtube.So after removal, the used corpus contains 115,480

C.2 Training and Implementation Details
More training details can be found in Table 12.For the pretraining on WMT19 Zh-En dataset, we utilize the same training parameters as that on BIGVIDEO and train the model for 300k steps.

D The Choice of Hyper-parameters
Temperature for contrastive learning objective.Performances of different temperature are presented in Figure 7.Here we fix the weight for contrastive learning objective to 1. On the validation set, there exists no significant difference in BLEU scores among choices of temperature.For better translation performance, a small temperature is more suitable.Weight for contrastive learning objective.We fix the τ = 0.002 and adjust the weight from 0.5 to 1.5.We can observe that contrastive learning objective with varying weights benefits the model to different degrees.1.0 is the most suitable weight for our system.Length of Video Frames.To investigate how the length of video frames affects translation, we adjust the number of sampled video frames in [1,12,36].Figure 9 depicts their performances.Here the video features we use are 2D features extracted by ViT.We can observe that when only one video frame is sampled, the video degrades into one image and its positive impact on the system is reduced.A maximum of 12 video frames achieves the best performance.

E Case Study
We additionally present two cases in the appendix.In figure 10, the phrase "drive shot" is better translated by our system by understanding the meaning of "shot".In Figure 11, we can find both the textonly baseline and our system fail to correctly translate the source title.The objects in the video are cards of Duel Monsters, which need world knowledge to understand.So the source title is complicated for text-only and our system.

F Annotation Guidelines
We hire seven full-time annotators who are fluent in both Chinese and English.They are recruited to annotate translation data or conduct human evaluations.The annotators are shown one English and corresponding Chinese subtitle of the given video clip.After watching videos and reading subtitles, they are required to decide whether videos are related to subtitles.If not, the sample will be discarded.Then the annotators are required to rate on three aspects: • Fluency Score (1-5, 1 is the worst and 5 is the best): If the audio is in English, the annotators will need to check whether the English subtitle is the transcript of the audio.If the audio is not in English, the annotators will need to rate if the sentence is grammatically correct.
• Translation Quality (1-5, 1 is the worst and 5 is the best): Whether the Chinese subtitle is equivalent in meaning to the English Subtitle.
• Ambiguous (0/1): The annotators need to decide whether the video information is required to complete the translation."1" means "the video information is required" and otherwise "0".

G Human Evaluation Guidelines
We hire three annotators to conduct the human evaluation.Each annotator is required to rate 100 samples from AMBIGUOUS and 100 samples from UNAMBIGUOUS on translation quality and rank

Figure 1 :
Figure 1: An example with semantic ambiguity in BIGVIDEO.The phrases with semantic ambiguity are highlighted in red.The wrong translations are in blue and the correct translations are in yellow.

Figure 3 :
Figure 3: Category distribution on BIGVIDEO, VATEX and HOW2.BIGVIDEO covers a wide range of domains.

Figure 4 :
Figure 4: An example from our AMBIGUOUS test set.The ambiguous term "chip" is in red.

Figure 5 :
Figure 5: An illustration of our machine translation system.An example of our contrastive learning is presented in the blue box.

Figure 7 :Figure 8 :
Figure 7: Bleu scores on BIGVIDEO validation, test and AMBIGUOUS sets.The x-axis is the choice of the different temperatures for the contrastive learning objective.

Figure 9 :
Figure 9: Bleu scores on BIGVIDEO validation, test and AMBIGUOUS sets.The x-axis is the length of video frames for our system.

Source
Figure 11: A case.The phrases with semantic ambiguity are highlighted in red.The wrong translations are in blue and the correct translations are in yellow.

Table 1 :
Statistics of the BIGVIDEO and existing video-guided machine translation datasets.The size of the BIGVIDEO bypasses the size of largest available datasets by one order of magnitude.

Table 3 :
Statistics of our test sets.We report number of samples, average length and number of ambiguous terms in two test sets.

Table 4 :
sacreBLEU(%), COMET(%) and BLEURT(%) scores on BIGVIDEO testset.We report results on AMBIGUOUS (Amb.),UNAMBIGUOUS (Unamb.)and whole test set (All)."+ CTR" denotes our cross-model framework with the contrastive learning objective.All results are mean values of five different random seeds.Complete results with standard deviations can be seen in Appendix A. The Best result in each group is in bold.The Best result in each column is in red .
, we use incongruent decoding to probe the need for visual modality on BIGVIDEO.During the inference, we replace the original video with a mismatched video

Table 7 :
Experimental results on HOW2 and VATEX.Complete results with standard deviations can be seen in Appendix A.

Table 11 :
Video category tags of our test set.For most of the YouTube videos, we obtain the category tags from YouTube official.For the rest of the YouTube videos and all of the Xigua videos, we train a classifier and predict the category tags according to the subtitles.