Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO

By supporting multi-modal retrieval training and evaluation, image captioning datasets have spurred remarkable progress on representation learning. Unfortunately, datasets have limited cross-modal associations: images are not paired with other images, captions are only paired with other captions of the same image, there are no negative associations and there are missing positive cross-modal associations. This undermines research into how inter-modality learning impacts intra-modality tasks. We address this gap with Crisscrossed Captions (CxC), an extension of the MS-COCO dataset with human semantic similarity judgments for 267,095 intra- and inter-modality pairs. We report baseline results on CxC for strong existing unimodal and multimodal models. We also evaluate a multitask dual encoder trained on both image-caption and caption-caption pairs that crucially demonstrates CxC’s value for measuring the influence of intra- and inter-modality learning.


Introduction
Phrases such as blue, chair, and garden path have strong visual components, yet computational word representations are usually created with text-only corpora. Encouragingly, some recent work that derives representations using visual contexts shows improvements for both word similarity ranking and image-text retrieval (Kiros et al., 2018), and querybased training of image models demonstrates language's power to improve image representations (Juan et al., 2020). Learning representations for both vision and language jointly should be even more effective-indeed, much progress has been made on such cross-modal learning using image captioning data (Karpathy and Li, 2015;Harwath and Glass, 2017;Faghri et al., 2018;Li et al., 2019). However, it is not yet clear whether learning representations in multimodal contexts improves perfor- mance within as well as across modalities as there are no datasets ideally suited for this at present.
Image captioning datasets such as Flickr8k (Rashtchian et al., 2010), Flickr30k (Young et al., 2014), Multi30k (Elliott et al., 2016), Microsoft Common Objects in COntext (MS-COCO) (Lin et al., 2014), and Conceptual Captions (Sharma et al., 2018) only capture relationships between images and textual captions created for them. They miss many valid relationships between unassociated images and captions, from captions to other captions, and from images to other images. We address this gap with Crisscrossed Captions (CxC, exemplified in Figure 1), a dataset with graded, denser annotations for relationships between and among captions and images in the MS-COCO evaluation splits of (Karpathy and Li, 2015) (with 25k English captions and 5k images each).
CxC extends MS-COCO's existing imagecaption pairs with continuous (0-5) semantic similarity ratings for those pairs and new pairs. The rating criteria extend those used for Semantic Textual Similarity (Agirre et al., 2012). Intramodal pairs are selected for annotation via an indirect sampling scheme biased to gain a broad distribution of similarities. In all, CxC contains human ratings for 267,095 pairs (derived from 1,335,475 independent judgments), a massive extension in scale and detail to the 50k original binary pairings.
MS-COCO incompletely supports three retrieval tasks: image-text, text-image and text-text. CxC enhances all of these with new positive pairs, and it also supports a new image-image retrieval task. With its graded similarity judgements, CxC also supports correlation measures comparing model and human rankings. Retrieval metrics focus on positive pairs, but CxC's correlation scores additionally account for low-scoring items (non matches). Supporting these evaluations on a common set of images and captions makes them more valuable for understanding inter-modal learningcompared to disjoint sets of caption-image, captioncaption, and image-image associations. Also, multimodal representations such as CLIP (Radford et al.) are useful for downstream tasks such as Visual Question Answering (Goyal et al., 2017), Vision and Language Navigation (Majumdar et al., 2020), Referring Expressions (Yu et al., 2018) and Visual Commonsense Reasoning (Zellers et al., 2019), and we hope the additional relationships and evaluations provided by CxC will help develop even better representations for tasks that span these modalities.
We furthermore demonstrate CxC's utility by evaluating a dual encoder that combines a bidirectional loss for image-text retrieval with a loss for text-text retrieval. The text encoder is composed of transformer layers over pre-trained BERT word representations and the image encoder is a pre-trained EfficientNet (B4) (Tan and Le, 2019a). This model delivers the strongest overall perfor-mance across all four retrieval tasks and correlation with human scores for text-text, image-image and image-text similarity. Compared to the same dual encoder trained only with image-text pairs, this model realizes small gains for image-text tasks and large gains for text-text task but with some degradation for image-image tasks. This indicates that the model trades capacity to encode images for better text encoding-an insight that would not be easily assessed without CxC's image-image annotations.
Our main contributions are the following: • We describe a method for sampling items to get a broad distribution of similarities. • We annotate the semantic similarity of 267,095 pairs. These enhance existing retrieval tasks and support a new image-image retrieval task. They also support correlation measures; these assess models' judgments of both positive and negative associations. • We establish baseline scores for existing models and a multitask dual encoder on all tasks and demonstrate that CxC allows model performance to be assessed more holistically. • With its new positive pairs, CxC improves the recall@k measures common in image-text and text-image retrieval. This shows a 1-3% increase in recall@k over several models. • We release CxC's annotations at https:// github.com/google-research-datasets/ Crisscrossed-Captions, along with code to merge CxC with existing MS-COCO data.

Dataset Collection
Existing resources already support learning joint representations of images and text. However, we need better evaluation resources, so we extend the MS-COCO evaluation splits with graded similarity associations within and across modalities. MS-COCO has five captions for each image, split by (Karpathy and Li, 2015) into 410k training, 25k development, and 25k test captions (82k/5k/5k for images). An ideal extension would rate every pair, but this is infeasible 1 and most pairs are dissimilar anyway. To obtain new pairs with high expected similarity, we introduce a biased sampling scheme.
The data is collected in two phases. First, we define an indirect sampling scheme that uses modelbased similarities from the co-modality items to Caption 1: A tennis player swinging a racket at a ball.

Caption 2:
A man playing tennis with a crowd watching.
Caption 3: A living room with some black furniture and a colorful rug.

Caption 4:
A dog laying on a leather sofa in a living room. select intramodality pairs. We use these items and their human ratings to select intermodality pairs for annotation. We also annotate all existing intermodal pairs and a large sample of co-captions (captions associated with the same image). See the appendix for details about the annotation interface and instructions, composition of the dataset and illustrative examples.
Intramodality Two images of a man and a dog can be described differently, while two similar sentences about a man and a dog can describe dissimilar images. In Figure 2, caption 1 gives a visual description while caption 2 gives a broader event description. Divergences also occur when caption creators perceive a scene differently: caption 3 describes the room and caption 4 focuses on the dog and sofa. This semantic gap between images and their captions creates an opportunity to sample intramodal pairs with varying similarities. Our key idea is to use model-based similarities of images for biased sampling of caption pairs, and vice versa, and use existing image-caption pairs as pivots between modalities. This selects image pairs that are different in appearance but similar in what they depict based on their descriptions, and vice versa.
Denote the known images and captions as V (v 1 ...v n ) and C (c 1 ...c n ) (the latter representing cocaption groups of five captions each). Each item is encoded with an off-the-shelf unimodal model. Cosine similarity between items defines two symmetric matrices: S C (pairwise caption similarities) and S V (pairwise image similarities). The diagonals are set to zero to not sample identical items.
We encode images with Graph-RISE (486) and construct S I , the image-based similarity for pairs of co-caption groups. We encode captions with Universal Sentence Encoder (USE) (Cer et al., 2018) and average bag of words (BoW) based on GloVe embeddings (Pennington et al., 2014). Co-caption representations are averaged to create a single representation. From these, we construct S C , the caption-based similarity for images pairs. USE and BoW embeddings produce two S C matrices, but we gloss over this detail below.
We use S C to select image pairs and S I for caption pairs. Because of the cross-modal semantic gap, diversity and size of the underlying data, these pairs exhibit a wide range of similarity. Selecting the five most similar items (according to modelbased S V and S C ) thus produces good representation of varying amounts of similarity as judged by people. Because S V covers co-caption groups, one caption is randomly chosen from each group to produce a caption pair for rating.
Caption-caption and image-image candidates are referred to as C2C and I2I, respectively. I2I pairs are selected with the above other-modality method. For C2C pairs, we sample half the pairs using the other-modality method and half from within cocaptions. The latter introduces (mostly) positive associations between caption pairs describing the same image. This gives a balanced set of caption pairs describing same and different images.
Pairs in C2C and I2I are scored by in-house raters using a continuous scale between 0 and 5. We adopt the widely used Semantic Textual Similarity (STS) (Cer et al., 2017) for text pairs and extend it to images to define Semantic Image Similarity (SIS). To recognize that this is a graded (rather than discrete) judgment, we encouraged raters to select scores like 1.3 and obtain the final score for a pair as the average of five individual ratings.
Intermodality We select caption-image candidates C2I based on human ratings for I2I and C2C pairs. We mainly seek new positive matches like those identified by annotators in Ilharco et al. (2019). For each I2I pair (i j , i k ), a C2I pair (c k , i j ) is generated, where c k is a MS-COCO caption for i k . We generate pairs from C2C similarly. Half of the C2I pairs are selected based on C2C ranks and the other half by I2I ranks (skipping pairs already selected from C2C). Finally, all MS-COCO pairs (25k in validation and 25k in test) are selected to obtain caption-image similarity ratings for the known items.    We extend STS to define Semantic Image-Text Similarity (SITS). Raters provide a continuous score from 0 to 5 using an interface similar to that for STS and SIS. Each C2I pair receives five ratings; the average is used as the final SITS score.

Crisscrossed Captions Dataset
Using our selection and annotation methodology, we obtained ratings for 267,095 caption-caption, image-image, and caption-image pairs (1,335,475 total judgments). Figure 3 shows rating distributions for each task (validation split). It also shows the distributions of ratings for STS and SIS pairs included from other-modality selection and from original MS-COCO pairs. The test set distributions are similar. Figure 4 gives the distribution of counts of positive examples in each task (validation split), where a score ≥ 3 (for STS, SITS) and a score ≥ 2.5 (for SIS) is considered positive.
These positive examples are used for intermodal and intramodal retrieval evaluation.
STS. The majority of caption pairs selected using image similarity are negative (ratings in [0, 3)), which is expected given the divergences noted in Figure 2. Nevertheless, the approach produces 20,587 positive pairs. Table 1 shows pairs with their STS annotation scores and cosine similarity with BoW and USE embeddings. There is broad agreement, but the annotated similarity is not fully captured by either BoW or USE. USE provides a broader range, but scores the third pair lower than the fourth. BoW scores are bunched within a high similarity band 2 that aligns well with these five examples. Overall, there is a weak positive correlation between BoW and STS scores, as shown in Figure 5, which plots average BoW cosine similarity versus STS for 1000 randomly sampled pairs. Figure 6 shows a pair of captions (and corresponding images) selected by the other-modality strategy with higher STS compared to their respective co-captions. For co-caption pairs, STS scores are more positive but many are still negative (Figure 3,left). Thus, combining both approaches leads to a more representative distribution overall. The large number of negative pairs from co-captions underscores the problem with assuming captions of the same image are paraphrases.
SIS. All image pairs I2I are selected using the other-modality strategy. This plus the stringent cri-  teria for SIS rating of 5 means very few examples are rated above 4. Nevertheless, there are many pairs with SIS ≥ 3, indicating there are many images depicting similar scenes and events.
SITS. As shown in Figure 4, there are many more pairs with 4-5 SITS ratings, compared to STS and SIS. This is by design, as the C2I pairs are selected based on decreasing STS/SIS scores. This captures more positive intermodality associations and augments the existing validation and test splits. Since these pairs are missing from the existing data, they are among the examples that inappropriately penalize a model that identifies them correctly in image-caption retrieval. The SITS ratings collected for known pairs also support new correlation based evaluations, as discussed in the next section.

Evaluation Tasks and Metrics
CxC supports intermodal retrieval like MS-COCO but with denser annotations between image-caption pairs. It also enables intramodal retrieval and semantic similarity correlation evaluations, which were not possible before. Karpathy and Li (2015) first used MS-COCO for image-to-caption and caption-to-image retrieval. We extend the existing associations with positive CxC pairs, and also add new caption-to-caption and image-to-image retrieval tasks using positive STS and SIS pairs (a total of four retrieval tasks). To the best of our best knowledge, CxC is the first dataset to support image-to-image retrieval over captioned images. Following Karpathy and Li (2015), we evaluate using Recall@K (R@K), computed as the fraction of times a correct item was found among the top K results, and median rank (med. r) of the closest ground truth result in the list.
Semantic similarity tasks such as Semantic Textual Similarity (Cer et al., 2017) and Visual Semantic Textual Similarity(vSTS) (de Lacalle et al., 2020) require a model to produce a continuous similarity score given two inputs. Typically, the models are evaluated based on the Pearson's r of their scores with the human judgments over a set of input pairs. This is valid when training data is available to calibrate model scores to the human ratings. With CxC, we do not have such training data, so we instead use Spearman's r to assess whether a model ranks pairs similarly to human raters.
It would be tempting to simply measure Spearman's r over all pairs, but this would be flawed because CxC's dense annotation means that the scores between many pairs are themselves correlated. To mitigate this, we use a sampled bootstrap correlation instead. For each correlation estimate, we sample half of the queries (to increase diversity across samples) and for each selected query, we choose one of the items for which CxC supplies a paired rating. We compute Spearman's r between the CxC scores and the model scores for the selected pairs. The final correlation is the average over 1000 of these bootstrap samples.
vSTS (de Lacalle et al., 2020) contains 2677 pairs of MS-COCO captions and corresponding images. As noted above, vSTS is related dataset for multimodal semantic similarity. We considered mixing CxC and vSTS; however, this was infeasible because CxC uses the widely adopted Karpathy splits, while items in vSTS's training, dev and test splits are spread among the Karpathy splits. We could not just make a separate cut of CxC because vSTS pairs can cross splits, e.g. an image-caption item in Karpathy training and another in Karpathy test. Given the small size of vSTS, we focused our efforts on CxC evaluations.

Evaluated Models
In order to establish baselines for CxC, we benchmark pretrained models for images and text. Note that these are off-the-shelf models that have not been trained on MS-COCO. We also evaluate crossmodal retrieval models that are trained on MS-COCO. Here, we focus on models that support efficient retrieval (e.g. dual encoders). We expect models with extensive cross-modal interactions, such as ViLBERT (Lu et al., 2019) and LXMERT (Tan and Bansal, 2019), will show strong performance on CxC tasks, either as standalone models that (inefficiently) score all possible item pairs or as rerankers for outputs of retrieval models.
To the best of our knowledge, there is no prior work that explores joint learning or evaluation on intra-and inter-modality retrieval tasks. Ngiam et al. (2011) and Collell Talleda and Moens (2016) show evidence that inter-modality learning helps improve intra-modality performance, but do not explore multitask learning. Lu et al. (2020) explore multitask learning but only focus on intermodal representation learning for intermodal downstream tasks. To illustrate how CxC allows us to measure how intermodal representation learning can improve both intra-and inter-modal performance, we train a dual encoder model on bidirectional image-text and text-text in-batch retrieval losses.

Pretrained Model Baselines
Text-only Models First, we use a bag-of-words (BoW) approach using averaged GloVe embeddings (Pennington et al., 2014) for each token in a caption as the caption representation. Second, the Universal Sentence Encoder (USE) (Cer et al., 2018) is a sentence level representation model that has shown strong performance on the related STS benchmark. We use the multilingual transformer version from TensorFlow Hub (Yang et al., 2020). 3 Image-only Models InceptionV3, ResNet-152, and SimCLRv2 are deep convolutional models (Szegedy et al., 2016;He et al., 2016;Chen et al., 2020a,b) trained on the ImageNet dataset. We extract 2048-dimensional image-level representations on a central crop containing 87.5% of the original image area. We access them via TensorFlow Hub. 4 3 universal-sentence-encoder-multilingual-large/1 4 imagenet/inception v3/feature vector/4, imagenet/resnet v1 152/feature vector/4 and gs://simclrcheckpoints/simclrv2/finetuned 100pct/r50 1x sk0/hub/ respectively Intermodal Models VSE++ (Faghri et al., 2018) is a dual encoder (see Sec. 5.2) trained to learn a joint space of aligned images and captions. The state-of-the-art VSRN model (Li et al., 2019) is another dual encoder that uses additional training annotations to predict and use bounding boxes for more fine-grained and coherent image analysis, while using only a simple text encoder trained from scratch. 5

Dual Encoder Baselines
We also consider several neural baseline models, all of which are dual encoders (Gillick et al., 2018;Yang et al., 2019) that encode both inputs separately. Dual encoder models have been proven as an effective approach to learn strong semantic representations (Cer et al., 2018;Chen et al., 2020a,b). They are often trained using an in-batch sampled softmax loss, as this has been observed to converge quickly and perform well on retrieval tasks (Gillick et al., 2018;Yang et al., 2019). We employ the bidirectional in-batch sampled softmax loss (eq. 1): where S(x, y) is the dot product of embeddings of examples x and y. This loss encourages the score of a correct pair S(l i , r i ) to be higher than scores of non-matching input pairs from the batch S(l i , r j ). Unlike full cross-attention models, this architecture enables large-scale retrieval through approximate nearest neighbor search. We train dual encoders for caption-image and caption-caption tasks, as well as a multitask model that combines both tasks. We use EfficientNet-B4 (Tan and Le, 2019b) (pre-trained on ImageNet) as our image encoder; it yields a 1792-dimensional representation. The text encoder employs a frozen 6 BERT-Base model (Devlin et al., 2019) followed by three transformer layers. The additional transformer layers have 8 attention heads, hidden dimension of 3072, and-like BERT-base-output 768dimensional token-level features. We use the fea-   tures at the 0 th token position of the final layer as the caption representation. BERT parameters are initialized from the public BERT checkpoint. 7 The additional, trainable transformer layers are randomly initialized. We construct three dual encoder models from these base encoders. (1) A Text-Text model (DE T2T ) uses a shared text encoder for both sides. (2) An Image-Text model (DE I2T ) uses the aforementioned text and image encoders, and includes a layer above the text encoder to project its 768 dimensions to 1792 (to match the image encoder output). (3) A Multitask model (DE T2T+I2T ) is trained on a combination of tasks (Chidambaram et al., 2019). It shares DE I2T 's architecture and is trained in the same way; however, its loss is a weighted sum of image-text (i2t, t2i) and text-text (t2t) losses: Here c is a scalar controlling the weights of losses from each task. This model has one text encoder, shared between all retrieval tasks. For hyperparameter tuning and training setup, see the appendix.

Results
Intermodal Retrieval Table 2 Figure 7 shows three examples of images retrieved for caption queries. The CxC annotations capture missing examples in the first two cases, and the last two show there are still more positive pairs that remain unassociated in CxC. Figure 8      and SIS ratings respectively. USE mling-large is a strong baseline for Text→Text, but all the crossmodal models beat USE mling-large by a wide margin, likely due to learning on in-domain captions. InceptionV3 and ResNet-152 prove surprisingly weak for Image→Image, but SimCLRv2 proves to be a strong unimodal baseline for this task. The cross-modal models nevertheless beat SimCLRv2 by a wide margin, even though none were trained on image-image retrieval directly. In terms of joint intra-and inter-modal learning, the multitask DE T2T+I2T model provides strong, balanced performance: it is close to DE T2T for Text→Text and DE I2T for Image→Image and far outperforms the latter for Text→Text. The strong performance is especially notable considering that both DE I2T and DE T2T+I2T have the same model capacity. Table 5 shows Spearman's R bootstrapped correlation for all models with respect to CxC's STS, SIS and SITS scores. Overall, VSE++, VSRN-github and DE T2T+I2T perform better than unimodal baselines, but interesting further patterns emerge. Despite being much worse for retrieval, VSE++ actually beats VSRN-github on STS and SIS; however, its low SITS score indicates it fails to bridge the two modalities as well. The correlation scores also show that DE I2T is too focused on images: it has the highest SIS (81.3), but has worse STS (50.9) than even BoW (55.1). Adding the text-text loss to DE I2T training, i.e. DE T2T+I2T , produces much more balanced overall performance. On SIS, SimCLRv2 is stronger than all cross-modal models, except DE I2T . SITS scores appear to rank all models similarly to retrieval ( Table 2). The fact that DE T2T+I2T is better than both DE T2T and unimodal baselines for STS and Text→Text retrieval is encouraging, and it demonstrates the value of having a single set of annotations covering the relatedness of a common set of images and captions. We expect that a multitask model which also uses image-image training pairs could demonstrate gains across all tasks-measurements made possible by the CxC annotations (especially the new image-image associations).

Conclusion
The CxC dataset provides a much more complete set of relationships between and among images and captions than the raw MS-COCO image-caption pairs. We demonstrate that a dual encoder that learns from both image-caption pairs and caption-caption pairs (DE T2T+I2T ) exhibits strong, balanced performance across four retrieval tasks and three correlation measures. There is much remaining headroom for future models for all these tasks.
CxC's annotations themselves validate the strong semantic alignment between images and their original captions-these have an average similarity of 4.85. However, we also find that co-captions (captions for the same image) have an average score of just 3.0. This calls into question the use of such pairs in training and evaluating paraphrase generation models (Gupta et al., 2018) and reinforces the need for images as context for human evaluation in paraphrasing (Wang et al., 2019).

CxC Annotations -Collection and Analysis
We present additional details of the human annotation process. To annotate the CxC dataset, in house annotators were employed: 170 (74 men, 96 women) for STS, 61 (28 men, 33 women) for SIS and 113 (46 men, 67 women) for SITS. All were aged between 20-35 years. The annotators were paid hourly wages that are competitive for their locale. They have standard rights as contractors. They were fluent English speakers. We define separate annotation interfaces for each of Semantic Textual Similarity (STS), Semantic Image Similarity (SIS) and Semantic Image Text Similarity (SITS) tasks. We define a similarity scale ranging from 0 to 5 for all three tasks, following Cer et al. (2017).
We conducted a few pilot annotation rounds with the annotators to evaluate the effectiveness of the annotation instructions and the annotation interface. We learned that allowing the annotators to rate on a continuous 0-5 scale instead of a discrete one like STS resulted in higher correlation between the individual ratings. As a result, we decided to use the continuous ratings in the task but still keep the similarity definition for each discrete value in the annotation instructions. The final annotation interfaces are illustrated in Figures 4 for STS, 5 for SIS and 6 for SITS. Task-specific high-level instructions are displayed at the top in an expandable text box followed by a pair of examples. At the bottom there is a sliding bar with 0-5 score instructions along the scale. Since the SITS instructions are longer, they are shown when the annotator hovers over the corresponding score to improve readability for this task.
Each annotator is required to evaluate the displayed example based on the instructions and score them. The sliding scale makes it intuitive for the annotators to rate an example 2.87 if they feel the semantic similarity of the pair lies between score descriptions of 2 and 3, leaning towards 3. Finally, the annotator response is recorded when they click the submit button at the bottom of the page. The absolute score is deliberately not displayed so as not to distract the workers towards trying to get a clean integer value like 3.0 instead of 2.94 or 2.97.
The annotators were able to get a better grasp of the task through the pilot annotations and got quicker at scoring the pairs. They took an average of 37, 17 and 17 seconds per example for STS, SIS and SITS tasks respectively for the final round   of annotations. Table 2 describes the instructions shared with the annotation workers for each task. The side-by-side comparison shows how each rating on the SIS and SITS scales compares to the STS benchmark. Figure 1 shows a set of SIS and SITS examples for the 0-5 rating scale shared along with the instructions. Table 1 contains the breakdown of the number of annotations per task per split. Figure 2 shows a distribution of the standard deviation of raw annotations for each item per task. For STS, there is larger overall deviation compared to the other two tasks-it seems that pairs of short captions leave more ambiguity and are open for broader interpretation than when at least one image is involved. Note also that SITS is expected to have lower deviation because of the sampling based on STS and SIS annotations.  Figure 3 shows the basic architecture of the dual encoder models from Section 5.2, which establish strong baselines on all the retrieval and correlation tasks. The image encoder and text encoder are both pre-trained in all experiments. Following Ilharco et al. (2019), we pretrain our dual encoders on the Conceptual Captions dataset (Sharma et al., 2018) with image-to-caption and caption-toimage losses. Conceptual Captions contains 3.3 million pairs of images and captions-far larger than MS-COCO. Pre-training uses the Adam optimizer (β 1 = 0.9, β 2 = 0.999) and a learning rate that starts at 1e-4 and decays by 0.1% every 1000 steps. We stop pre-training after ≈30k steps and select the checkpoint that maximizes R@10 on a held-out set. We then fine-tune this checkpoint on MS-COCO using the same hyper parameters, except for a smaller learning rate of 5e-6.

Training Setup
Our models are trained on 32-core slices of Cloud TPU V3 pods, with a per-replica batch size of K = 64 during both pre-training and fine-tuning. Because in-batch sampled softmax loss is known to perform best when computed over a large number of negative samples (Gillick et al., 2018), our training setup pools image and caption encodings from all replicas before computing the loss. That is, each replica computes l and r for its local minibatch and broadcasts them to all others to be used as negative samples. Training with N cores thus allows the loss to be computed over the global batch of

Ablation Experiments
Our model architecture and training setup differ from prior work in key ways. In particular, best known results for VSE++ and VSRN are from models that were trained with much smaller batch sizes,  did not undergo Conceptual Captions pre-training, and had different image encoder architectures. To evaluate the effect of these factors, we trained variants of our DE I2T model (here, the baseline training recipe) with the following one-off ablations: • The small batch size ablation reduces the training batch size to 128 examples, to match that of VSE++ and VSRN in (Faghri et al., 2018) and (Li et al., 2019), respectively.
• ResNet-152 uses the same recipe as baseline, but replaces the EfficientNet-B4 image encoder with ResNet-152, which was used in VSE++. Notably, EfficientNet-B4 has fewer parameters than ResNet-152, but achieves higher classification accuracy on ImageNet. Table 3 summarizes the performance of the ablated models. Reducing the batch size causes a small but consistent reduction in recalls across all tasks. Removing Conceptual Captions pretraining leads to larger regressions on all tasks -except on Text-Text retrieval, where results are curiously better than the baseline. Likewise, models using ResNet-152 image encoders perform worst overall, Score Semantic Textual Similarity (STS) Semantic Image Similarity (SIS) Semantic Image-Text Similarity (SITS) 5 The texts are completely equivalent as they mean the same thing.
The scenes are near duplicates, possibly being viewed from a different perspective.
The image and sentence are perfectly matched. The sentence is an almost perfect description for the image. 4 The texts are mostly equivalent but some unimportant details differ.
The two scenes are mostly equivalent, but some unimportant details differ such as involving different but the same or highly similar types of participants, actions, objects and background.
The image and sentence are mostly matched, but some unimportant details differ such as involving different but the same or highly similar types of participants, actions, objects and background. The text can partially describe the image. 3 The texts are roughly equivalent but some important information differs or is missing.
The two scenes are roughly equivalent, but some important details are different or missing such as involving a notable difference in the types of participants, actions, objects or background.
The image and sentence are roughly matched, but some important details are different or missing such as involving a notable difference in the types of participants, actions, objects or background. The image cannot be described using the text. 2 The texts are not equivalent but share some details.
The two scenes are not equivalent, but share some details in terms of the types of participants, actions, objects or background.
The image and sentence are not matched, but share some details in one or more of the types of participants, actions, objects or background. 1 The texts are not equivalent but are on the same topic.
The two scenes are not equivalent, but are loosely thematically related.
The image and sentence are not matched, but are loosely thematically related. 0 The texts are on different topics.
The two scenes are completely dissimilar.
The image and sentence are completely unmatched. but also perform (slightly) better than the baseline on Text-Text retrieval.
Overall, we conclude that pretraining and choice of image encoder architecture have large effects on model performance; large-batch training is beneficial, but has a smaller impact. Finally, the asymmetric shifts in task performance suggest models make implicit trade-offs based on the relative difficulty of each task -here, apparently, a function of encoder strength and quantity of training data. Understanding these dynamics, and building models that perform well across all tasks, requires future study. Crisscrossed Captions enables such work by giving a more complete picture of model quality on both intra-and inter-modal tasks.