VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.


Introduction
We study the challenge of achieving task-agnostic pre-training for multimodal video understanding, building on recent unimodal approaches such as pretrained language models for text (Peters et al., 2018;Devlin et al., 2019). Although certain language models are near task-agnostic (Devlin et al., 2019;Lewis et al., 2020) on NLP tasks, being taskagnostic on multi-modal tasks are more challenging due to cross-modal tasks such as text-video retrieval. Existing video-and-language pre-trainings are task-specific, which adopt either (1) a crossmodal single encoder (Sun et al., 2019b,a;Zhu and Yang, 2020) favoring tasks that require cross-modal reasoning (e.g. video captioning), or (2) multiple unimodal encoders/decoders (Miech et al., 2019(Miech et al., , 2020Li et al., 2020b;Luo et al., 2020;Korbar et al., 2020) combining specific tasks that require separately embedding each modality (e.g. video Figure 1: Existing models (upper figure) adopt complex architectures and multiple task-specific training to merge two streams of data to cover a wide range of downstream tasks (such as retrieval or text generation). Our video-language model (VLM) (lower figure) uses a single BERT encoder for task-agnostic pre-training (e.g. only masking tokens, no matching or alignment for specific end tasks) in a joint feature space, while still covering a wide range of tasks (see Figure 3). retrieval). We instead show that it is possible to pretrain a task-agnostic model called video-language model (VLM) that can accept text, video, or both as input.
As shown in Figure 1, this task-agnostic single encoder approach has several advantages: (1) it reduces the complexity of pre-training with multiple losses and models (e.g. Luo et al. (2020)), and (2) it holds less assumption on being close to end tasks as in retrieval-based pre-training Miech et al. (2020) and is as general as classic LMs, and (3) it encourages feature sharing among modalities when present, without sacrificing separability, and (4) it is more parameter efficient (see Section 5, we achieve strong performance with BERT BASE sized models). Table 1 summarizes the design choices of recent models.
Our encoder is a transformer block that combines the existing masked frame model and masked language model (MFM-MLM) (Sun et al., 2019a;Li et al., 2020b;Luo et al., 2020) with two new methods to improve the learning of multi-modal fusion. First, we introduce a masking scheme called masked modality model (MMM) that randomly masks a whole modality for a portion of training examples (the rest of the examples goes for traditional MFM-MLM), thereby forcing the encoder to use the tokens from the other modality to produce tokens for the masked modality. We then introduce a single masked token loss to replace two (2) losses on video and text separately for MFM-MLM. Masked token loss uses the embeddings of both video and text tokens to learn joint hidden states for the encoder.
We also show it is possible to fine-tune a single encoder for a wide range of tasks by using taskspecific attention masks. Experiments demonstrate that it performs well on a wider range of tasks than previous models, including outperforming taskspecific pre-training baselines with unimodal encoders of similar hyper-parameters by more than 2% on retrieval tasks and 1% on video captioning. Note that these results are also achieved with a much smaller model than previous approaches, further demonstrating the improved fusion and sharing across modalities.
In summary, the main contributions of this paper are as follows: (1) we propose to pre-train a taskagnostic encoder for video understanding; (2) we introduce masked modality model (MMM) and masked token loss for cross-modal fusion during pre-training without sacrificing separability; (3) experimental results show that the proposed simple baseline achieves competitive performance with significantly fewer parameters.

Related Work
Numerous multimodal task-specific pre-training models are proposed for downstream visuallinguistic tasks. In video and text pre-training, existing research adopts different design choices regarding proxy tasks and neural architectures for end tasks (Luo et al., 2020).
Although this approach is simple, it limits the types of downstream tasks to those that input both modalities simultaneously. For example, (Sun et al., 2019b) may not be able to perform joint retrieval tasks and added another decoder for video captioning during fine-tuning. (Zhu and Yang, 2020) uses [CLS] token for pairwise metric-learning based retrieval (which is an easier problem but requires a quadratic number of examples and is 50 times slower as reported in (Luo et al., 2020)).
Meanwhile, many existing approaches adopt or add task-specific pre-training to accommodate retrieval and video captioning tasks (e.g. twostream encoders (video and text separately) and text decoders). For example, (Miech et al., 2019(Miech et al., , 2020Rouditchenko et al., 2020;Ging et al., 2020;Gabeur et al., 2020; adopts a retrieval task for pre-training. CBT (Sun et al., 2019a), HERO (Li et al., 2020b), VideoAsMT (Korbar et al., 2020) and UniVL (Luo et al., 2020) adopt multi-task learning (MTL) to learn retrieval tasks on video and text encoders. HERO (Li et al., 2020b) and UniVL (Luo et al., 2020) adopts another cross-encoder to further learn the fusion of different modality. UniVL (Luo et al., 2020) and VideoAsMT (Korbar et al., 2020) add another text decoder for video captioning. Compared with the single-stream input in the shared encoder approach, two-stream encoders typically come with a complex architecture and proxy tasks to cover more end tasks. To the best of our knowledge, none of the existing works target task-agnostic pre-training.

Image-Text Pre-training
ViLBERT , LXMERT (Tan and Bansal, 2019) adopt two transformers for image and text encoding separately. VisualBERT (Li et al., 2019), Unicoder-VL (Li et al., 2020a), VL-BERT (Su et al., 2020), UNITER , Unified VLP  use one shared BERT model. These models employ MLM and pairwise image-text matching as pretraining tasks which are effective for downstream multimodal tasks. Our fine-tuning for video captioning is inspired by Unified VLP ) that adopts attention masks and language model heads of BERT for image-captioning.

Video-Text Pre-training
VideoBERT (Sun et al., 2019b) and CBT (Sun et al., 2019a) are the first works to explore the capability of pre-training for video-text. Although VideoBERT and CBT pre-train the model on multimodal data, the downstream tasks mainly take video representation for further prediction. ActBERT (Zhu and Yang, 2020) is a weaklysupervised pre-training method. It leverages global action information to catalyze mutual interactions between linguistic texts and local regional objects and introduces a transformer block to encode global actions, local regional objects, and linguistic descriptions. HERO (Li et al., 2020b) encodes multimodal inputs in a hierarchical fashion. Besides, two new pre-training tasks, video-subtitle matching and frame order modeling, are designed to improve representation learning. VideoAsMT (Korbar et al., 2020) and UniVL (Luo et al., 2020) further adopt a BART-style (Lewis et al., 2020) text generation task for downstream tasks such as video captioning and UniVL adopts a EnhancedV training stage to mask all text tokens for better learning of generation.

Pre-training
As a reminder, our goal is to train a task-agnostic model for various tasks in video-text understanding. This section introduces task-agnostic proxies for pre-training. We first describe two masking schemes as a baseline: masked frame model (MFM) for video frames and masked language model (MLM) for text tokens (Sun et al., 2019a;Li et al., 2020b;Luo et al., 2020). Then we introduce masked modality model (MMM) that encourage to learn the representations of one modality from the other. Lastly, we introduce masked token loss that unifies losses on masked video and text tokens as a single loss function.

Vector Quantization and BERT
Assume we have a clip (v, t) sampled from a video, where v and t corresponds to video modality and text modality, respectively. Since videos are signals in continuous space, we first extract token embeddings from raw videos. We decode v into frames and then feed them into a (frozen) video encoder Encoder video (·) and a trainable MLP layer to obtain video tokens: where we use a bolded symbol to indicate a sequence and f v is a sequence of continuous frames from a video. We use S3D (Xie et al., 2018;Miech et al., 2020), which is pre-trained via selfsupervised learning on the Howto100M dataset. The MLP layer allows the hidden size of video tokens to be the same as BERT's hidden sizes d: x v ∈ R d . Similarly, vectors for text tokens x t are obtained via embedding lookup as in BERT.
To simplify multi-modal pre-training, we adopt a single BERT transformer with minimum changes. We first concatenate video tokens x v and text tokens x t via the [SEP] token so video and text belongs to one corresponding segment of BERT: We further mask x as x masked (detailed in the next subsection) and feed the whole sequence into BERT: where h indicates the hidden states of the last layer of BERT. To encourage learning video/text hidden states in a shared space for the masked token loss (introduced in Section 3.3), we use a shared head to predict video/text token embeddings via a linear projection layer: where e ∈ R d and W and b are the weights from the prediction heads of BERT. In this way, our model learns a joint embedding space for both video and text tokens from inputs to outputs of BERT. This allows for pre-training a single encoder directly from any existing LMs and the only layer that requires initialization is the MLP layer.

MFM-MLM
Inspired by (Sun et al., 2019a;Li et al., 2020b;Luo et al., 2020), we adopt masked frame model (MFM) for videos and masked language model (MLM) for text as a baseline. Note that unlike LMs that typically come with a fixed vocabulary with a special [MASK] token, video tokens are innumerable in the continuous space and we mask a video token by setting a video token with all zeros and ask the encoder to recover the video token. via noisy contrastive estimation (NCE): where V is all indexes of video tokens and where V indicates all non-masked video tokens within the same batch. The final loss is the sum of both MFM and MLM: where L MLM is the same as BERT and we omit its details for brevity. We experiment this classic baseline in Section 5.

MMM and Masked Token Loss
Masked Modality Model We introduce masked modality modal (MMM) that masking either all video or all text tokens out for a given example of video-text clip. This masking scheme complements MFM-MLM (e.g. in our experiments 50% of training examples are masked as MMM and the rest 50% are masked as MFM-MLM). This encourages the encoder to use tokens from one modality to recover the tokens for the other modality. This resolves the issue that an encoder may use nearby tokens from their modality for prediction just because tokens from a single modality are closer As in the lower two (2) sub-figures in Figure 2, we either mask the whole modality of video or text so this modality can be "generated" from the other modality. Our experiments indicate that this is critical for pre-training a single encoder for retrieval tasks.
Masked Token Loss We further introduce masked token loss that unifies loss functions for MFM and MLM. This loss encourages learning a joint token embedding space for video and text and both types of tokens contribute to the prediction of a masked (video or text) token. This also improves the number of contrasted negative embeddings in two separate losses for MFM and MLM. We define masked token loss L VLM as the following: where D is the word embeddings over the vocabulary of BERT and D \s excludes token s (if s is a text token). Further, NCE(x s |x masked ; V ∪ D \s ) is defined as: . (9) Note that j ∈ V ∪ D \s can be either a video or text token and one predicted token e s must be closer to the ground-truth token embedding (either a video token or word embedding) and be away from other embeddings of video/text tokens. We perform an ablation study in Section 5 to show that L VLM works better than L MFM-MLM .

Fine-tuning
In this section, we describe how to use different types of attention masks to fine-tune VLM for a variety of tasks, as shown in Figure 3.

Text-Video Retrieval
One major challenge of pre-training on a single encoder is how to adapt such a model to joint space retrieval without using unimodal encoders for task-specific pre-training on contrastive loss (as in Howto100M (Miech et al., 2019(Miech et al., , 2020). The main reason is that many existing models encode text and video tokens together via self-attention, and one cannot obtain hidden states for text/video alone.
To resolve this, we propose to apply an isolated attention mask with two squared masks that are diagonally placed, as shown in the lower sub-figure of the first box in Figure 3. 1 These two squares disable video and text tokens to attend and see each other, while still allow video and text tokens to use the same self-attention layers for learning representations in the same feature space. Further, note that the first and second [SEP] tokens of BERT will be used by video and text, respectively, aiming to learn sequence-level features (Clark et al., 2019). The [CLS] is disabled as no need to learn features across video and text. After forwarding, all hidden states of video and text tokens are average pooled, respectively. Then we use a contrastive loss on text-video similarity to discriminate a ground-truth video clip from other video clips in the same batch for a given text clip. During the evaluation, to ensure video and text are isolated (to avoid leaking ground-truth of a similar pair), we split text and video and forward them separately. We report an ablation study in Section 5 showing that the MMM introduced in the previous section is crucial to ensure that the pre-trained hidden states (for video or text) are a good initialization for retrieval tasks.

Action Segmentation
Action segmentation is to assign each frame of a video with one of the pre-defined labels. This is similar to the named entity recognition (NER) task in NLP but on video frames. We feed in VLM with the whole video, a dummy text token, and an isolated attention mask. Then we add a classification head (with the number of pre-defined labels) on top of the hidden states for each video token in the last layer of VLM.

Action Step Localization
In action step localization, each video belongs to a task with multiple steps, where each step is described as a short text. Then each frame of a video needs to be aligned with a step in text form. The challenge for applying BERT to action step localization is similar to text-video retrieval: video frames need to be aligned with textual steps in joint space and it is almost impossible for pairwise video/text matching because the number of frame/text pairs is large.
Similar to the text-video retrieval model, we also apply isolated attention masks to video and text. The major difference is that we pass video and text separately to BERT. This is because the video can be several minutes long (more than 100 tokens) but the number of text labels for each video is fixed (e.g. under 10). To keep the format of BERT being consistent for multi-modal inputs, we add a dummy text token for video forwarding and a dummy video token for text, respectively. For a given frame(video token), we compute the distribution of that frame over textual steps via dot products and the softmax function.

Multiple-choice VideoQA
Multiple-choice VideoQA (Yu et al., 2018) aligns each video with one out of several candidate answers in the text. The major difference between action step localization and multiple-choice VideoQA is that the video hidden state is not on frame-level but sequence-level. We apply isolated attention masks to BERT and forward video and text answers (with dummy tokens), respectively. Then the answer with the maximum similarity with the video is reported. During fine-tuning, we apply contrastive loss on video-text similarity to rank answers.

Video Captioning
Another big challenge of using a single encoder is how to apply generative tasks (such as video captioning) without pre-training an explicit decoder. We observe that a transformer decoder (Vaswani et al., 2017) has the following major differences from an encoder: (1) an auto-regressive loss that does not allow a text token to see future tokens; (2) a prediction head to generate texts. To resolve (1), one can easily fine-tune the text segment of VLM as auto-regressive loss by passing in shifted tokens and a lower-triangle attention mask to the text segment, as shown in Figure 3. To resolve (2), inspired by (Rothe et al., 2020; that uses BERT as a decoder, one can re-use language model heads as prediction heads for generation. Note that this setting has less architecture design than a standard transformer decoder (e.g. no explicit self-attention on text or cross-attention on video). The implicit text decoder inside BERT shares self-attention with the video encoder so to save the total number of parameters.

Pre-training
We adopt the Howto100M dataset (Miech et al., 2019) for pre-training, which contains instructional videos originally from YouTube via searching keywords from wikihow (www.wikihow.com). After filtering the unavailable ones, we get 1.1M videos. We split 4000 videos as the validation set and the rest for pre-training. On average, the duration of each video is about 6.5 minutes with 110 clip-text pairs. After removing repeated texts within overlapped clips from ASR, we get about 7.7+ GB texts of captions, with 2.4 tokens per second on average.

Fine-tuning
MSR-VTT (Xu et al., 2016) is a popular dataset for text-video retrieval and VideoQA. It has open domain video clips, and each training clip has 20 captioning sentences labeled by humans. There are 200K clip-text pairs from 10K videos in 20 categories, including sports, music, etc. Following JSFusion (Yu et al., 2018;Miech et al., 2019), we randomly sampled 1,000 clip-text pairs as test data. We further use the QA test data (Yu et al., 2018) as the dataset for multiple-choice VideoQA. Youcook2 (Zhou et al., 2017) contains 2,000 cooking videos on 89 recipes with 14K video clips from YouTube. The overall duration is 176 hours (5.26 minutes on average). Each video clip is annotated with one captioning sentence. Follow the split setting in (Miech et al., 2019), we evaluate both textbased video retrieval and multimodal video captioning tasks. We filter the data and make sure there is no overlap between pre-training and evaluation data. After filtering out unavailable ones, we have 9,473 training clip-text pairs from 1222 videos and 3,305 test clip-text pairs from 430 videos. COIN (Tang et al., 2019) are leveraged to evaluate action segmentation. It has 11,827 videos (476 hours) and each video is labeled with 3.91 step segments on average and 46,354 segments in total. There are 778 step labels, plus one background (Outside) label. Since one video can last for several minutes that are much longer than the maximum length of the video segment of VLM. We apply a sliding window with step size 16 and window size 32. During inference, we average the logits for overlapped frames from multiple windows. CrossTask ) is a dataset for action localization that contains 83 different tasks and 4.7k videos. Each task has a set of steps with text descriptions annotated on temporal frames of the video. We use the testing data split via the official code 2 , which contains annotated 1690 videos. The rest of the 540 annotated videos are used for weakly supervised training.

Hyper-parameters
We extract video tokens from video frames using the S3D encoder pre-trained from (Miech et al., 2020). The fps is 30 and we extract one (1) video token per second with the dimension of 512. We apply an MLP to transform such 512 dimensions to the hidden size (768) of BERT BASE .
tokens are for videos and the rest tokens are for text and special tokens. Remind that texts are 2.4 tokens per second and video tokens are 1 token per second. We form a text clip with a random length in-between 8 and 64 text tokens and collect the corresponding video clip to form a training example. We randomly sample 32 video/text clip pairs from each video and use 8 videos to form a batch of size 256. Each training example has 50% chance for MMM (25% for whole video masking and 25% for whole text masking) and 50% chance on MFM-MLM (with 15% probability of video and text token masking). We pre-train VLM on 8 NVIDIA Tesla V100 GPUs (each with 32 GB memory) for 15 epochs using fp16 for one (1) day. Following (Liu et al., 2019), we choose Adam (Kingma and Ba, 2014) optimizer with initial learning rate of 5e-5 (with betas as (0.9, 0.98)), 1000 steps of warm-up and a polynomial decay learning rate scheduler. Gradients are clipped with 2.0. All fine-tuning tasks use the same hyper-parameters as pre-training except the number of warm-up steps is 122.

Model Comparison
We first investigate the design choices of VLM compared to other transformer-based multimodal pretraining baselines. As shown in Table 1, we collect training paradigms, model sizes, etc. of these models (estimated based on their source codes or papers). VLM is significantly smaller than other models since it is just a BERT BASE (uncased), while it is still fully self-supervised, task-agnostic (e.g. no training on retrieval or auto-regressive style tasks) and supports joint retrieval and text generation.
Text-video Retrieval We use MSR-VTT and Youcook2 to evaluate the performance on textvideo retrieval. The results are shown in Table 2 and 3, respectively. VLM achieves good performance on these two datasets, indicating that the MMM and isolated self-attention mask can be used together for joint retrieval. Ablation study shows that using an isolated self-attention mask alone does not yield good performance, indicating MMM is very important to learn features for alignment. Note that our pre-training is task-agnostic but still outperforms baselines with retrieval style pre-training. Action Segmentation We report the results of action segmentation on COIN dataset in Table 4.

Methods
Average Recall Joint Alignment Alayrac (Alayrac et al., 2016) 13.3 Zhukov  22.4 Supervised  31.6 HowTo100M (Miech et al., 2019) 33.6 MIL-NCE (Miech et al., 2020) 40.5 UniVL (Luo et al., 2020) 42.0 Pairwise Matching ActBERT (Zhu and Yang, 2020) 41.4 VLM (task-agnostic, zero-shot) 28.5 VLM (supervised on 540 videos) 46.5 VLM outperforms other baselines indicating its good token-level video representations. Note that this task only tests the hidden states of the video indicating the unimodal encoding capability of VLM is not compromised. Action Step Localization We setup two (2) evaluations for the CrossTask dataset. First, we evaluate the zero-shot transfer of VLM. Note that existing studies evaluate Crosstask with retrieval/alignment style pre-training, where the aligned hidden states are directly used for action step localization. Our task-agnostic pre-training derives an even harder problem: applying hidden states learned from proxy tasks on video frame/text alignment for action step localization without explicitly training on alignment. We simply use the hidden states from the last layer of VLM for video/text representation and directly compute the similarities between video frames and text descriptions. Surprisingly, the performance is better than some baselines and closer to one supervised method. This indicates masked token loss together with MMM can learn certain video-text alignments in joint space. Second, we use just 540 videos for weakly supervised training and we get a much better result. Video Question Answering We use MSR-VTT QA to evaluate multiple-choice question answering. Recall that this task essentially tests video-text similarity. The performance of VLM is better than Method Accuracy Joint Retrieval JSFusion (Yu et al., 2018) 83.4 Pairwise Matching ActBERT (Zhu and Yang, 2020) 85.7 VLM 91.64  (Sun et al., 2019b) 6.80 4.04 11.01 27.50 0.49 CBT (Sun et al., 2019a) -5.12 12.97 30.44 0.64 ActBERT (Zhu and Yang, 2020) 8.66 5.41 13.30 30.56 0.65 Coot (Ging et al., 2020) 17.62 11.09 19.34 37.63 w/ Pre-trained Decoder VideoAsMT (Korbar et al., 2020) -5.3 13.4 --UniVL (Luo et al., 2020) 16  ActBERT, which leverages pairwise matching for each video/answer pair. Video Captioning We lastly evaluate VLM on video captioning with autoregressive attention mask with other baselines that have an explicit text decoder. As shown in Table 7, our "compact" decoder using BERT's LM heads is surprisingly good at video captioning compared to other fine-tuning baselines with external decoders (e.g. Coot). This indicates that it is possible to remove an explicit decoder and sharing weights between video and text tokens.

Ablation Study
We use Youcook2 as the base task for the ablation study on text-retrieval and video captioning. We are interested in the following study: (1)  The results are shown in Table 8 and Table 9.
Effects of MMM Without MMM (w/ MMM 0%, or MFM-MLM 100%), the performance significantly dropped. This indicates that a naive adoption of traditional MFM-MLM masking may not learn joint video/text representations well, as indicated by both retrieval and captioning task. We suspect a masked token is more likely predicted from tokens of the same modality. We further try MMM with different probabilities (30% or 70%) and 50% is the best. Minimum Length of Texts The length of a clip can be important for retrieval tasks (Miech et al., 2020). We ran VLM on longer (at least 16 text tokens) video/text pairs. The performance is slightly dropped, indicating pre-training on longer clips may not cover fine-tuning tasks with short clips.

Error Analysis
Text-video retrieval. We use MSR-VTT as the dataset for error analysis on text-video retrieval, as shown in Table 10 of Appendix. We pair the query text with the text of the top-1 ranked video to show 100 errors in ranking since video tokens are harder to present. We observe the following types of errors in video understanding: (1) objects sometimes are hard to recognize such as dog or cat; (2) attributes of objects may be hard to match the text, e.g. gender, ages, etc.
(3) subtle differences of actions; (4) specific videos for a general query or vice versa, e.g. people vs basketball player. We believe the last type may not be errors but hard for existing annotations or evaluations to separate. Video Captioning. We further examine the generated text from video captioning. Note that our video captioning has no support from ASR or transcript so the video is the only source to generate text content and errors of video understanding can easily be reflected in the text. From Table 11 of Appendix, we notice that one major type of error is from objects of similar shapes and colors, e.g. onion rings vs shrimp.

Visualization
. We observe that video tokens take the majority of space while text tokens are rather clustered together. This is probably because videos from the physical world are more diverse and sparse than text from a fixed vocabulary. We plot the self-attention of VLM layers within and in-between each modality, as in Figure 4 of Appendix. We observe the following patterns from all 144 attention heads: • Unlike LMs, there are no recurrent (shifted) position-wise patterns for video tokens; • Self-attentions in the 1st layer are more diverse than later layers. This suggests that existing video encoders might be too deep for transformers; • Some attention heads show patterns of crossmodal mapping in-between video and text (e.g. sub-figure (a)); • Word-level cross-modal co-reference: video tokens with pouring soy sauce refers to the text token of "soy" (e.g. sub-figure (b));

Conclusions
We presented a task-agnostic pre-training with new masking schemes that enable the training of a single masked language model that can accept either video or text input, or both. We showed that this simple VLM model can be effectively tuned for a broad range of downstream tasks, such as text-video retrieval and video captioning via different types of attention masks. Experimental results show that the proposed methods maintain competitive performance while requiring a significantly smaller number of parameters than competing methods.

Query
Text of Top-1 video Objects (26%) cartoon show for kids pokemon video game play little pet shop cat getting a bath and washed with little brush several dogs playing dead Attributes of Objects (6%) a little boy singing in front of judges and crowd a woman singing on the voice a woman is mixing food in a mixing bowl a man is stirring something in a pot Action (6%) a person is connecting something to system a man looks at the battery of a computer a boy plays grand theft auto 5 a narrator explains where to find a rare vehicle in grand theft auto a man is giving a review on a vehicle a person is discussing a car a naked child runs through a field the girl shows the boys her medal in this cartoon a man is singing and standing in the road a man in sunglasses and a blue shirt beat boxes Specific vs General (62%) some cartoon characters are moving around an area a cartoon girl and animal jumping on body of male guy girl image still shown displaying on screen baseball player hits ball people are playing baseball the man in the video is showing a brief viewing of how the movie is starting scrolling the the menu of movieclips with different movie trailers a student explains to his teacher about the sheep of another student there is a guy talking to his father a video about different sports a woman talks about horse racing Table 10: Error analysis for text-video retrieval of MSR-VTT on 100 errors: we group errors in four (4) categories: objects, attributes of objects, actions, and specific vs general. Specific videos for general queries (or vice versa) sometimes may not be errors but hard to evaluate.

Hypothesis
Reference add the lamb to the pan add the lamb to the pot add the cilantro cilantro and lime juice to the pot cut the cilantro and lime add the onions to a pot of water add flour to the pot and stir dip the onion rings into the batter dip the shrimp in the batter add water to the bowl and mix pour water into the flour mixture and mix remove the mussels from the pot once the shrimps are defrosted drain the water add the sauce to the pan and stir add the sauce to the wok and stir add lemon juice to the pan and stir add rice vinegar and lemon juice to the pan and stir add the beef to the pan and stir add the diced beef meat to it and roast it