VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/examples/MMPT.


Introduction
The popular "pre-training + fine-tuning" paradigm has revolutionized NLP (Devlin et al., 2019;Liu et al., 2019b;Yang et al., 2019;Lewis et al., 2020b) and CV (Chen et al., 2020a;He et al., 2020) over the last few years. Although models trained this way can achieve impressive performance, they still require task-specific annotated data and fine-tuning for each end task. Recent work adopt pre-training for zero-shot transfer to end tasks without finetuning, including GPT (Radford et al., 2018(Radford et al., , 2019Brown et al., 2020) for NLP tasks and CLIP  for image classification.
This paper focuses on pre-training for zero-shot transfer to video-text understanding tasks. Our approach pre-trains a Transformer model (Vaswani et al., 2017;Devlin et al., 2019) with a contrastive objective (Oord et al., 2018;Chen et al., 2020a) using pairs of video-text clips. Different from CLIP that scales pre-training data for zero-shot transfer to image classification on an explicitly assembled dataset using a simple contrastive objective (Chen et al., 2020a), this paper uses a publicly established  Figure 1: VideoCLIP aims for zero-shot video understanding via learning fine-grained association between video and text in a transformer using a contrastive objective with two key novelties: (1) for positive pairs, we use video and text clips that are loosely temporarily overlapping instead of enforcing strict start/end timestamp overlap; (2) for negative pairs, we employ a retrieval based sampling technique that uses video clusters to form batches with mutually harder videos.
pre-training dataset, HowTo100M (Miech et al., 2019), for zero-shot video understanding. We show that the resulting pre-trained model can be either directly applied to, or fine-tuned on, a series of video-text tasks at both the global sequence and local clip/token level.
We find that straightforward objectives (Chen et al., 2020a) lead to poor results, and hypothesize that learning fine-grained associations between video and text is crucial for success of zero-shot transfer to end tasks. Since end tasks may require different granularities of video-text correspondence. The granularity can be about sequence length (such as long video versus short text (e.g.classification), token level or sequence level) and semantics ("apple" vs "banana" or "apple" vs "car"). Previous efforts sample short, temporally aligned video and text clips with contrastive learning within a random batch, falling short on learning the fine-grained association between video frames and word tokens.
We present VideoCLIP that aims to pre-train a unified video-text representation with contrastive learning using two key techniques (see Fig. 1) to compute the training objective.
First, we aim to improve the association of video and text with different sequence lengths. Although the majority of video clips and text transcriptions are not semantically aligned (Miech et al., 2019), current video-text models are trained with exact temporal alignment. As a result, multiple or longer text clips may have better alignment with a video clip (Miech et al., 2020) and many clips may not have any corresponding captions (see a detailed discussion of issues in §3.3). To address these issues, we pre-train with temporally overlapped pairs of video and text clips (of varying length), thereby greatly increasing the quality and quantity of the video-text alignment. We show in experiments that this simple and general approach significantly improves performance.
Second, we learn fine-grained video-text similarity from a contrastive loss with a new method for gathering (implicitly) harder negative pairs. Although existing works contrast intra-video clips via sampling multiple clips from the same video (Miech et al., 2019(Miech et al., , 2020, we find that mining clips from other videos can provide much more challenging negatives. We propose a retrieval augmented pre-training approach to retrieve a cluster of videos that are similar to each other for each training batch. Retrieval-augmented pre-training alternatively performs retrieving video clusters and uses the retrieved video clusters for pre-training (see § 3.4 for details).
After pre-training, we apply our model for zeroshot transfer without any fine-tuning on target dataset labels. We directly use our pre-trained model on a diverse set of four tasks in five datasets, including text-video retrieval (for text-to-video similarity), VideoQA (for video-to-text similarity), action localization (for video frame to text label similarity) and segmentation (for video token to text label similarity with rejection) (see §4).
Our experiments reveal that VideoCLIP has strong performance, even compared to supervised approaches which use human-annotated labels on the downstream tasks. For example, in text-video retrieval on Youcook2 (Zhou et al., 2017), Video-CLIP outperforms all existing zero-shot methods and even outperforms fully supervised pre-training + fine-tuning methods, but without using any labels.
In summary, the main contributions of this paper include: (i) we propose to pre-train a unified model that is capable of zero-shot transfer to multiple end tasks for video-text understanding, even surpassing fully-supervised methods in some cases, and (ii) we introduce two novel techniques to improve the learning of fine-grained video-text association.

Related Work
Pre-training for Zero-shot Transfer. Recently, the paradigm of pre-training has made impressive progress with the scale of training data and computational power. For example, in NLP, the paradigm has shifted from learning word embeddings for task-specific architecture (Mikolov et al., 2013;Bojanowski et al., 2017;Peters et al., 2018), to pre-training+fine-tuning (Devlin et al., 2019;Liu et al., 2019b;Lewis et al., 2020b) and few-shot/zero-shot transfer (Radford et al., 2018(Radford et al., , 2019Brown et al., 2020; that have task-agnostic architecture. One line of pre-training for zero-shot transfer focuses on generative (auto-regressive) models (Radford et al., 2018(Radford et al., , 2019Brown et al., 2020), where examples and prompts of an end task are used as context for a language model to respond properly to that task (Brown et al., 2020); the other line of studies focuses on discriminative models Miech et al., 2020), where a similarity search or ranking model learns a joint space (e.g. via contrastive learning (Chen et al., 2020a;He et al., 2020)) and later transfer to a particular task. Recently, CLIP  transfers imagetext similarity to many image classification tasks, where the text branch serves as supervision for learning a general image representation and subsequently serves as a hyper network for downstream vision tasks. Our effort aligns with the latter line of work, but is the first to transfer a pre-trained discriminative model to a broad range of tasks in multi-modal video understanding. Multi-modal Video-Text Pre-training. Multimodal models have also adopted the pre-training+fine-tuning paradigm. One line of work adopts multiple unimodal encoders for retrieval tasks. For example, (Miech et al., 2019(Miech et al., , 2020Ging et al., 2020;Gabeur et al., 2020; adopt contrastive learning for pre-training and shows the possibility of zero-shot transfer to text-video retrieval tasks. CBT (Sun et al., 2019a), HERO , VideoAsMT (Korbar et al., 2020) and UniVL (Luo et al., 2020) adopt multi-task learning (MTL) for pre-training on retrieval tasks.
HERO  and UniVL (Luo et al., 2020) further adopt a cross-encoder to further learn the fusion of different modalities.
The other line of work adopts a single crossmodal encoder and concatenates the vision and text sequences as inputs, including VideoBERT (Sun et al., 2019b), Unicoder-VL (Li et al., 2020a), VL-BERT (Su et al., 2020), UNITER , VLP (Zhou et al., 2018), ActBERT (Zhu and Yang, 2020) and VLM (Xu et al., 2021). Although this approach is intuitive, it limits the capability of zero-shot transfer. For example, it is non-trivial to perform retrieval tasks on a single encoder as feeding vision and text in a pairwise manner is not flexible and data efficient (Luo et al., 2020). Retrieval Augmented Training. Augmenting traditional training with a non-parametric retrieval component has recently shown impressive results in pre-training (Khandelwal et al., 2019;Guu et al., 2020;Lewis et al., 2020a) and QA (Izacard and Grave, 2020;Karpukhin et al., 2020). We find that contrastive learning and retrieval augmented training can have good synergy because the former aims to discriminate examples and the latter aims to find harder examples for discrimination. To the best of our knowledge, there is no existing work of retrieval augmented training for video, perhaps because videos exhibit unique challenges for dataefficient training (see §3.4).

VideoCLIP Pre-training
In the paradigm of multi-modal video-text pretraining for zero-shot transfer, the key challenge is to learn fine-grained association in-between video and text to cover the diverse needs of end tasks. We cover VideoCLIP pre-training in this section, and discuss the needs of zero-shot transfer to different end tasks in the next section. We first describe video and text model backbone and contrastive loss; then we propose overlapped video and text clips to improve the association of positive pairs; lastly, we describe retrieval augmented pre-training to improve the mining of negative examples.

Video and Text Encoding
VideoCLIP consumes pairs of video and text clips (v, t) as inputs. It makes no assumptions on the encoder architectures and can work with any video and text backbone. We use Transformer (Vaswani et al., 2017) model for both the video and text. The video features, extracted by a convolutional neural network (CNN), are first projected to video tokens before fed into our video transformer, as described next.
Video and Text Transformers. Let c v be a video clip of a sequence of continuous frames (we use bold symbols to indicate sequences). We feed c v into a (frozen) pre-trained video encoder f θ CNN and then apply a trainable MLP, f θ MLP , with weights θ MLP to obtain video tokens x v ∈ R d with the same embedding dimension, d, as for word embeddings in our architecture: x where stopgrad is a stop-gradient operation, to reflect that the video CNN is frozen.
Similarly, vectors for text tokens x t are obtained via embedding lookup as in BERT (Devlin et al., 2019). Then x v and x t are feed into two separate trainable Transformers, f θv and f θt , respectively, to obtain the hidden states for video and text tokens (2) To obtain the hidden states (i.e. global features) of video and text clips, we apply average pooling over the sequence of tokens for video and text, respectively We use average pooling (instead of using the [CLS] token) to encourage f θv and f θt to learn token-level representations that may benefit tokenlevel tasks, such as action localization and action segmentation (see Section 4). VideoCLIP aims at pre-training the unified video-text representation, captured by the Transformer model parameters θ v and θ t for video and text, and consequently use it for zero-shot downstream tasks. In appendix, we also explore shared weights for video and text, θ v ≡ θ t , and our ablations show that separate video/text transformers yields slightly better performance.
Notably, using a frozen video backbone (f θ CNN ) enables us to go beyond short-term visual input (typical video CNNs (Xie et al., 2018;Feichtenhofer et al., 2019) only capture temporal windows of ∼3 seconds), and allows us to model long-term visual-textual correspondences spanning ∼32 seconds. We describe our training methodology next.

Contrastive Loss
We use a contrastive loss (InfoNCE (Oord et al., 2018) objective) to learn the correspondence between video and text.
In particular, we minimize the sum of two multimodal contrastive losses: where B is the batch that contains sampled videotext pairs and NCE(z v , z t ) and NCE(z t , z v ) corresponds to the contrastive loss on video-to-text similarity and text-to-video similarity. Specifically, the video-to-text contrastive loss is given by with τ being a temperature hyper-parameter and z + t are positive embedded text clips overlapping with video clip embedding z v , and {z − t } are negative embedded text clips that are implicitly formed by other text clips in the training batch. The text-tovideo loss NCE(z t , z v ) is defined symmetrically. The next sections ( §3.3 and §3.4) describe how we construct the positive, z + t , and negatives, {z − t }, in our pre-training objective (5).

Overlapped Video-Text Clips
To build overlapping positive video/text pairs, we (i) sample a text clip (because sampling a video clip first may not have nearby corresponding text); (ii) sample a timestamp within the boundary of text clip as the center for a video clip; (iii) grow a video clip with random duration (up to ∼32 seconds) from this center timestamp.
Our empirical results show this simple method works well in practice, and we discuss its benefits w.r.t. prior efforts next. Low Relevance Temporal Alignment. Existing video-text pre-training methods, e.g., (Miech et al., 2019), consider temporally exactly aligned clips (video and text clips sharing the same start/end timestamps). Although strict alignment seems natural, it is less likely that temporally aligned video and text clips are also semantically close in short clips. For example, a video clip of "a person speaking" may have a low relevance 1 with the exact temporally aligned transcription "I am going to show you how to cook fried rice". However, a later video clip showing "rice in wok" may have a better semantic visual alignment. One explanation for this low relevance of temporal alignment is that humans are less likely to speak and perform actions simultaneously.
Using exact temporal alignment limits the examples considered in the contrastive loss. Taking the previous NCE(z v , z t ) term as an example, the low relevance (positive) pair could be in the numerator of the objective (5), whereas higher relevance pairs (e.g. rice in wok appearing later in a video with an introductionary text clip of "I am going to show you how to cook fried rice") are possibly used as negative pairs, under exact temporal alignment for constructing positive/negative samples. Although existing work (Miech et al., 2020) aligns multiple nearby text clips with one (short) video clip of fixed 3.2 seconds duration, this only partially solves the low relevance problem and can attenuate noise, as the text clips may only partially correspond to the visuals and might have no temporal overlap with the short-duration video clip per se. Better Video-Text Association. As such, we believe a (self-supervised) method that can curate higher relevance video-text pairs at a large-scale is crucial for effective learning. Our approach to sample video and text pairs (v, t) of different lengths while requiring temporal overlap improves videotext relevance and encourages fine-grained association. As such, a video (or text clip) can have a better chance to be aligned or supervised by nearby text and vice versa. By contrast, video clips without any temporally aligned text are never contributing as a positive video-text pair in our objective.

Retrieval Augmented Training
Our intention is to learn to model more fine-grained video-text similarity by using difficult examples in our contrastive pre-training objective (5). We construct negatives in our training batch by using hard pairs {z − t }, which are semantically to the pairs in the numerator, using retrieval based sampling.
Recall that contrastive loss (e.g.in equation (5)) uses positive pairs in a batch B, and typically negative pairs are implicitly induced from other positive pairs in the same batch. Dense Video Cluster Retrieval. Our approach aims to find video clusters to construct a batch of training samples. We formulate this as a dense

Algorithm 1: Retrieval Augmented Training
Input :V is video set; M is model.
sample overlapped video-text pairs from c ∈ C to train M .

end
retrieval process on the latent space of a video, derived from the video/text embeddings of our transformer that is trained by the contrastive loss (5).
Our overall training process can be described as a two-stage method that alternatively performs retrieval and training in each epoch, and is summarized in Algorithm 1.
For each epoch, Line 2-4 corresponds to the retrieval stage and Line 5 corresponds to the training stage. Specifics are as folows.
Line 2 computes the global features z V for each video by averaging the embeddings of all of its video-text clips. An ablation (in appendix) shows that this is better than using the starting clip of a video to infer the representative video embedding.
Line 3 constructs the dense index 2 for all videos to be used in our retrieval-based training.
Line 4 first finds |C| (corresponds to the number of overall batches in the training set) random videos, where each video V yields a video cluster c as follows. We sample |c| videos from k neighboring videos of V . Instead of searching k nearest videos directly (see ablation in Table 7), we sample k videos from the 2k nearest videos. This is because we want videos in a cluster to be mutually closer to each other (not all close to video V ). In this way, all video/text clips sampled from one video can serve as negative examples for clips sampled from another video. 2 We use FAISS: https://github.com/ facebookresearch/faiss.

Zero-shot Transfer to End Tasks
We present methods for zero-shot transfer of VideoCLIP to a variety of end tasks (without using any labels). For each task, we specify requirements that highlight the aspect of pre-training. Text→Video Retrieval. Text→video retrieval tests the text-to-video similarity computed on the learned video-text representation. NCE(z t , z v ) in Equation 4 contributes to this task as it discriminates different video clips in the numerator and denominator for a given text clip. It also tests the distribution of hard negative examples in the denominator given it reports multiple recall metrics. Multiple-choice VideoQA. In multiple-choice VideoQA (Yu et al., 2018), the model aligns each video with one out of several text candidate answers. It tests video→text similarities with a pretrained model. We formulate this task as ranking candidate textual answers for a given video question query. This corresponds to the NCE(z v , z t ) term in Equation 4, where the subtle differences in texts are discriminated against each other. Action Segmentation. Action segmentation assigns each token (or frame) of a video with one of the pre-defined labels to separate meaningful segments of videos from the rest tokens (or frames). This is similar to sequence labeling (e.g. named entity recognition (NER)) in NLP. Inspired by the setup of CLIP , the text encoder of VideoCLIP can serve as self-supervision for videos during pre-training and as a hyper network to provide hidden states of segment textual labels for a video token. As such, the hidden state of each video token can have a distribution of similarity over segment labels. This task tests video token to text similarities.
One challenge in action segmentation is that it contains an Outside label that does not exist in transcription during pre-training. This Outside label is task-dependent because it means a token does not belong to any of the pre-defined labels. This is similar to open set recognition (Scheirer et al., 2012) or out-of-domain intent detection (Lane et al., 2006), where the rejection label is not presented during training but all new classes during inference (not shown in training) should be covered by the rejection label.
Let t ∈ L be one label in the set of all labels L excluding the Outside label. We apply the following conditions to each video token u to curate the prediction with the Outside labelŷ u : where γ is a threshold. Note that in zero-shot transfer, there is no access to training or validation data to decide a threshold as a hyper-parameter. Thus, we estimate γ as the maximum of dot products of intra-labels: γ = max(z t z T t ), where t ∈ L, t ∈ L and t = t . Action Step Localization. In this task, each video is associated with a "task" with multiple steps S, where each step t ∈ S is described as a short text. Action step localization is to assign each video token to one or multiple steps in the associated task. This is similar to action segmentation except that the label set is not pre-defined and does not contain the Outside label. As such, we first obtain the hidden states for each video frame (or token) h u from transformer. Then we separately forward text labels into the text backbone to obtain the hidden states of step labels z S . The distribution of each video token over steps is predicted as Softmax(h u z S T ).

VideoCLIP Pre-training
For pre-training, we use HowTo100M (Miech et al., 2019) that contains instructional videos via searching keywords from wikihow 3 in YouTube. We use 1.1M videos after filtering out videos which are not available or cannot be decoded. We randomly sample 4K videos as the validation set and use the rest for pre-training. On average, the duration of each video is ∼6.5 minutes with ∼110 clip-text pairs. After removing repeated words from ASR, we end up with ∼7.7 GB of text transcriptions, with 2.4 tokens per second on average.

End Task Setups
Text→Video Retrieval. We use Youcook2, MSR-VTT and DiDeMo to evaluate zero-shot transfer to text-video retrieval. Youcook2 (Zhou et al., 2017)  VideoQA. We further use the QA test data (Yu et al., 2018) for MSR-VTT to evaluate multiplechoice VideoQA. Recall that this task can be formulated as a video-text retrieval task except the candidate textual answers are associated with each video and only one answer is correct (most relevant). On average, VideoQA for MSR-VTT has 5 candidate answers per video.
Action Segmentation. We use COIN (Tang et al., 2019) to evaluate action segmentation. It has 11,827 videos (476 hours) in total and the testing set has 2797 videos, where each video is labeled with 3.91 segments per video on average. There are 778 segment labels and we feed these textual labels into the text backbone to obtain their latent space. As a reminder of Section 4, we do not model the Outside label explicitly and determine an Outside label only when all other 778 labels reject a video token. Note that videos in COIN can last for several minutes, we apply a sliding window with a step size of 16 seconds and a window size of 32 seconds. During inference, we average the logits for overlapped tokens from multiple windows.

Action
Step Localization. We use CrossTask (Zhukov et al., 2019) to evaluate action localization. It contains 83 different tasks and 4.7K videos. Each task has a set of steps in the form of text descriptions and each frame of video is annotated with one or multiple steps as a distribution. We use the testing data split via the official code 5 , which contains 1690 annotated videos. We leave details of fine-tuning data to appendix.

Implementation Details
Video Encoder. We use a S3D (Xie et al., 2018) for video encoder f θ CNN . It is pre-trained on HowTo100M (Miech et al., 2020) to extract video tokens of dimension 512. We use 30fps and extract one video token per second. This can be precomputed for efficiency.
Transformers. For the video and text Transformers, f θv and f θt , we initialize their weights with the pre-trained BERT BASE-uncased (Devlin et al., 2019).
Using the same type of transformer further allows us to perform ablation study on sharing video and text backbones (see Table 7). We only use the first 6 Transformer layers for the video input and all 12 layers for the text input. Please note that the video/text encoders in VideoCLIP is generally applicable to other pre-trained Transformers. We use a single layer MLP f θ MLP with GELU activation (Hendrycks and Gimpel, 2016)  A text clip has a random length between 8 and 61 tokens, whereas a video clip has 3 to 32 seconds. We sample 16 video/text pairs from each video and use k=32 videos to form batches of size |B|=512.
Training Details. We pre-train our model on 8 NVIDIA Tesla V100 GPUs (each with 32 GB memory) for 25 epochs using fp16 precision for ∼1 day. We use Adam (Kingma and Ba, 2014) as optimizer with betas of (0.9, 0.98), an initial learning rate of 5e-5, 1000 steps of warm-up, and a polynomial decay learning rate schedule. Gradients are clipped at 2.0. The softmax temperature in objective (5) is set to τ = 1.0.

Main Results
We evaluate VideoCLIP on various end tasks and compare it with other zero-shot and supervised methods that use labels on the target datasets.
On MSR-VTT (Table 1, bottom), VideoCLIP shows solid improvements but with a larger zeroshot to supervised gap than on Youcook2. The major reason could be domain shift from HowTo100M to MSR-VTT. The captions in MSR-VTT are more descriptive (e.g., "a basketball player is playing basketball" and are less likely to appear in the transcriptions of HowTo100M). After fine-tuning, VideoCLIP reaches state-of-the-art R@1. Note that this is achieved without using any supervised data such as ImageNet or large-scale external data (i.e., 65 million Instagram data) used by the second best method, Support Set .
On DiDeMo (Table 2), VideoCLIP has better performance than most supervised methods. Note that ClipBERT  has image pretraining before video+text fine-tuning.
methods but similarly suffers from domain shift from HowTo100M to MSR-VTT. After fine-tuning, it reaches the best performance, indicating Video-CLIP also provides strong features for fine-tuning.
Action Segmentation. We report the results of action segmentation on COIN in Table 4. Zeroshot transfer of VideoCLIP to COIN outperforms all supervised methods, without using any labels on this dataset. This indicates that VideoCLIP also learns good token-level video representations. Finetuning VideoCLIP further yields a ∼10% accuracy gain, indicating potential room for improvement.

Action
Step Localization. Lastly, we report VideoCLIP's performance on CrossTask in Table 5. It shows a small gap to supervised methods when using zero-shot action step localization. Fine-tuning leads to a ∼10% gain, outperforming all prior work on this dataset.

Discussion on Work that Fine-tunes CLIP Model
There are concurrent works (Luo et al., 2021;Portillo-Quintero et al., 2021) about using im-age+text model  for video+text downstream tasks. Note that (Luo et al., 2021) and (Portillo-Quintero et al., 2021) use image pre-training (no video pre-training) and transfer to videos, whereas our focus is about improving video pre-training using a novel pre-training objective. Besides this conceptual difference (Luo et al., 2021;Portillo-Quintero et al., 2021) are using a pre-trained image CLIP  model from OpenAI which is trained on huge, semicurated web image+text pairs that provides exceptional zero-shot performance on many datasets (e.g.ImageNet); however, the CLIP pre-training data is sourced from web-search engines (which on their own use fully supervised neural networks trained on ImageNet and other datasets); therefore, is not fair to compare to our approach which only trains on HowTo100M instructional videos.

Ablation Study
In Table 7, we perform an ablation study on zeroshot transfer for text→video retrieval on Youcook2 to quantify the the contribution of overlapping clips and retrieval augmented pre-training.
In the first group, we study the effectiveness of the two proposed methods. VideoCLIP without retrieval augmented training significantly drops Query Text Text of Top-1 video from VideoCLIP (Zero-shot) Text of Top-1 video from VideoCLIP (Fine-tuned) pick the ends off the verdalago put chickpeas parsley chopped onion chili powder ground cumin in food processor pick the ends off the verdalago add the fried pita to the salad and mix toss the salad add the dressing and bread pieces the the salad place chicken in hot oil and fry until golden brown fry the chicken in oil fry the chicken wings in deep oil fry dark meats together and white meats together add the mutton to the pan add the diced beef meat to it and roast it rub salt and pepper onto the chicken season them with salt and pepper rub salt and pepper onto the chicken  performance by over 4% in R@1 and additionally using exact alignment positives, i.e., the same start/end timestamp for a pair of video and text clips, has another 4% drop in R@1. Therefore, both techniques combined lead to a ∼50% relative improvement in recall. Further, by using MIL-NCE clips and loss we evaluate the potential benefit of using the training objective from MIL-NCE (Miech et al., 2020) (which uses multiple temporally adjacent clips as positives) in our architecture. This ablation isolates the pre-training objective from model and data. We observe that the MIL-NCE loss can improve the direct alignment objective but performs significantly worse than our objective (16.1 vs. 22.7 R@1).
In the second group, we further study the design choices of modeling. shared video/text transformer indicates f θv is the same as f θt , which only decreases performance slightly. This suggests that using a joint backbone for video and text is effective.
retrieve k indicates direct searching k nearest neighbors instead of sampling k videos from 2k nearest neighbors (used by VideoCLIP) in Line 4 of Algorithm 1. Sampling from nearest neighbors yields video clusters of better quality.
use starting 32 sec for retrieval indicates using the first 32 secs of a video as representation for video retrieval, which is an inferior representation of the whole video.
Unlike employing Avgpool, using [CLS] token only prevents VideoCLIP from exploiting token-level information and thus yields worse performance.

Qualitative Analysis
We examine errors for text-video retrieval of Youcook2 in both zero-shot transfer and fine-tuning setting in Table 6. We observe that in zero-shot transfer, VideoCLIP has no prior knowledge about a particular task/dataset on how long a text and video clip should be paired together for the textretrieval task. Fine-tuning allows to correct this type of error. Further, we observe that VideoCLIP tends to mix objects of similar color/shape together. We leave incorporating such type of knowledge into pre-training to future work.

Conclusion
We have presented VideoCLIP, an approach to pretrain a video-text model for zero-shot transfer to end tasks that require fine-grained association between video and language. VideoCLIP uses an objective that contrasts temporally overlapping positives with hard negatives stemming from nearest neighbor retrieval. In evaluation this approach outperforms prior work on a variety of tasks, without any supervision on downstream datasets, and in some cases VideoCLIP is competitive or better than prior work that uses full supervision; nevertheless, we still observe gains for fine-tuning our model. We hope that our code and model will foster future research in multi-modal video understanding.

A Supplementary Material for VideoCLIP
This supplementary material is organized as follows. First we provide additional experimental setups for each end task. Then we specify the hyper-parameters in our model and detail how we train VideoCLIP. Lastly, we provide extra ablations and analysis of various VideoCLIP configurations.

A.1 End Task Setup Details
Text-Video Retrieval. We use Youcook2 and MSR-VTT to evaluate text-video retrieval. We directly use our video and text Transformers to encode the videos and the text queries and measure the text-to-video similarities for retrieval. Youcook2 (Zhou et al., 2017) is a collection of 2K cooking videos with a total duration of 176 hours and 5.26 minutes on average per video. It contains 89 recipes in 14K video clips where each clip is annotated with one descriptive sentence. We follow the splits defined in Miech et al. (2019) and make sure there is no overlap between pre-training and evaluation data. After filtering out unavailable ones, we obtain 9,473 training clip-text pairs from 1222 videos and 3,305 test clip-text pairs from 430 videos.
MSR-VTT (Xu et al., 2016) is a widelycompared benchmark dataset for text-video retrieval and video question answering. It contains open-domain videos where each video clips is around 10 seconds. Each training clip has 20 captioning sentences labeled by a human. In total, there are 200K clip-text pairs from 10K videos. Following JSFusion (Yu et al., 2018;Miech et al., 2019), we sampled 1K clip-text pairs as the test data and the rest is used for training. Multiple-choice VideoQA. We use the testing split and data in (Yu et al., 2018) on MSR-VTT to evaluate multiple-choice VideoQA. On average, VideoQA for MSR-VTT has 5 candidate answers per video. Recall that this task can be formulated as a video-text retrieval task except the candidate textual answers are associated with each video and only one answer is correct (most relevant). In practice, we find the answer with the maximum similarity in-between a video and all candidate answers. Action Segmentation. We use COIN (Tang et al., 2019) to evaluate action segmentation. COIN contains 11,827 videos (476 hours) in total and the testing set has 2797 videos, where each video is labeled with 3.91 segments per video on average.
There are 778 segment labels and we feed these textual labels into the text backbone to obtain their latent space. We do not model the Outside label explicitly and determine an Outside label only when all other 778 labels reject a video token. Note that videos in COIN can last for several minutes, we apply a sliding window with a step size of 16 seconds and a window size of 32 seconds. During inference, we average the logits for overlapped tokens from multiple windows. For follow the original split of COIN for training and evaluation. Action Step Localization. CrossTask (Zhukov et al., 2019) is used to evaluate action localization. There are 83 different tasks and 4.7K videos where each task has a set of steps in the form of text descriptions and each frame of video is annotated with one or multiple steps as a distribution. We use the testing data split and the official codebase (https://github.com/DmZhukov/ CrossTask) that contains 1.7K videos. We use 540 annotated videos for supervised training. Recall that action step localization testing the video's token-level features and we use the representations h v of the last layer of BERT before average pooling. We compute the distribution of similarity for each token over the latent space of textual labels of steps.