CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans. This is in contrast to the reference-free manner in which humans assess caption quality. In this paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references. Experiments spanning several corpora demonstrate that our new reference-free metric, CLIPScore, achieves the highest correlation with human judgements, outperforming existing reference-based metrics like CIDEr and SPICE. Information gain experiments demonstrate that CLIPScore, with its tight focus on image-text compatibility, is complementary to existing reference-based metrics that emphasize text-text similarities. Thus, we also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation. Beyond literal description tasks, several case studies reveal domains where CLIPScore performs well (clip-art images, alt-text rating), but also where it is relatively weaker in comparison to reference-based metrics, e.g., news captions that require richer contextual knowledge.


Introduction
For most text generation tasks, reference-based ngram overlap methods are still the dominant means of automatic evaluation. For image caption generation, recent reference-based metrics have sought to transcend overlap by considering richer models of reference-candidate similarity: e.g., approximate scene graphs , allowing reference-based methods to incorporate the image (Jiang et al., 2019;Lee et al., 2020). But, references can be expensive to collect and comparing reference captions -Two dogs are running towards each other across the sand.
-Two dogs are running towards each other on a beach.
-Two dogs run toward each other.  CLIPScore uses CLIP to assess image-caption compatibility without using references, just like humans. Bottom: This frees CLIPScore from the well-known shortcomings of n-gram matching metrics, which disfavor good captions with new words (top) and favor any captions with familiar words (bottom). Attribution: Paperclip, robot icons by Hasanudin, Adiyogi (resp.) from the Noun Project. against even multiple human-authored captions for each image is often insufficient (see Figure 1). As a result, for many corpora, a significant gap remains between reference-based scoring and human quality judgments. 1 Should we need references for the evaluation of image captions? After all, when humans assess the appropriateness of an image caption, we do so just by looking at the image and reading the candidate's text.
A recent trend in machine translation serves as inspiration: there, a key hurdle for reference-free evaluation (sometimes called quality estimation) has been estimating cross-lingual similarity between source+candidate pairs (Blatz et al., 2004;Specia et al., 2010;Mehdad et al., 2012;Specia and Shah, 2018). But recent work (Lo, 2019;Yankovskaya et al., 2019;Zhao et al., 2020) has improved correlation with human judgment not by gathering more monolingual references, but instead by utilizing cross-lingual representations learned by large-scale, pre-trained, multilingual models e.g., LASER (Artetxe and Schwenk, 2019) or M- BERT (Devlin et al., 2019). 2 We hypothesize that the relationships learned by pretrained vision+language models (e.g., ALIGN (Jia et al., 2021) and CLIP (Radford et al., 2021)) could similarly support reference-free evaluation in the image captioning case. Indeed, they can: we show that a relatively direct application of CLIP to (image, generated caption) pairs results in surprisingly high correlation with human judgments on a suite of standard image description benchmarks (e.g., MSCOCO (Lin et al., 2014)). We call this process CLIPScore (abbreviated to CLIP-S). Beyond direct correlation with human judgments, an information gain analysis reveals that CLIP-S is complementary both to commonly reported metrics (like SPICE,and CIDEr) and to newly proposed reference-based metrics (e.g., ViLBERTScore-F (Lee et al., 2020)).
We additionally (1) propose a referenceaugmented version of CLIPScore, RefCLIPScore, that achieves even higher human correlation, (2) verify that CLIP-S is sensitive to adversarially constructed image captions, where one noun-phrase has been swapped for a plausible (but incorrect) distractor; and (3) construct a corpus of images that have never been posted publicly online to verify that CLIP-S is able to reconstruct human judgments on never-before-seen images.
Finally, we assess CLIP-S in the context of four case studies that diverge from context-free, literal photograph description. In two cases, CLIP-S works well: it achieves high correlation with alt-text quality rating on Twitter, and demonstrates surprising capacity to reason about clipart images+captions. For news caption generation, reference-based meth-ods correlate best with human judgments. And, for emotive captions inspired by language use on social media, even reference-based metrics fall short.

Related Work
Reference-only image caption evaluation In general, image caption generation models are evaluated by a suite of 5 reference based metrics:   (Papineni et al., 2002) (which measures a version of precision between a candidate and the references), ROUGE-L (Lin, 2004) (which measures a version of recall), METEOR (Banerjee and Lavie, 2005) (which computes a word-level alignment), CIDEr (Vedantam et al., 2015) (which combines n-gram tf-idf weighting and stemming) and SPICE  (which applies a semantic parser to a set of references, and computes similarity using the predicted scene graph). 3 Yi et al. (2020) give a method for re-weighting BERTScore (Zhang et al., 2020) specifically tuned to the image caption generation domain (we refer to their method as BERT-S++).
Reference+image caption evaluation Recent metrics incorporate image-text grounding features in addition to references: TIGEr (Jiang et al., 2019) uses a pretrained SCAN model (Lee et al., 2018), and ViLBERTScore-F (Lee et al., 2020) uses a pretrained ViLBERT model (Lu et al., 2019) that is also fine-tuned on 12 downstream vision and language tasks (Lu et al., 2020). Our work provides perspective on the next logical extension: instead of incorporating visual-textual interactions in addition to references, can we ignore references entirely?
Self-retrieval for image captioning Prior works have proposed incorporating a self-retrieval loss into caption generation, with the intuition that good captions should be able to uniquely identify their images with high accuracy (Dai and Lin, 2017;Luo et al., 2018;Liu et al., 2018); monitoring this type of loss can provide insight into how distinctive the captions are according to the model itself. CLIP-S is similar in spirit, but distinct for its utility as an extrinsic evaluation metric like BLEU-4 or CIDEr.
Reference-free evaluation In addition to the machine translation cases highlighted in the introduction, reference-free evaluations have been proposed for other generation tasks, including summarization (Louis and Nenkova, 2013;Peyrard and Gurevych, 2018;Sun and Nenkova, 2019) and dialogue (Tao et al., 2018;Mehri and Eskenazi, 2020). These metrics can be supervised, relying on human judgments for quality estimation, or less-supervised, relying on pre-trained model representations. For image captioning, a version of VIFIDEL (Madhyastha et al., 2019) was proposed for reference-free evaluation; however, VIFIDEL, computed based on a list of detected objects in the image from a fixed object vocabulary, generally produces less correlation with human ratings vs. reference-based metrics.
3 CLIPScore Model Details. CLIP (Radford et al., 2021) is a cross-modal retrieval model trained on 400M (image, caption) pairs gathered from the web. 500K search queries, consisting of common unigram/bigrams, named entities, etc., were executed on a search engine. For each query, up to 20K (image, caption) pairs were collected.
The model we use is the ViT-B/32 version. 4 It represents images via a Vision Transformer (Vaswani et al., 2017;Dosovitskiy et al., 2021), which forgoes convolutional filters in favor of selfattention maps computed between a 7 by 7 grid of image patches, which evenly divides a 224 by 224 pixel input image. This model has 12 transformer layers and 86M parameters. The text is similarly represented by a 12-layer transformer trained over a vocab of 49K BPE token types (Sennrich et al., 2016) (and is more fully described in Radford et al. (2019)). Both the text and image networks output a single vector; these vectors aim to represent the content of an input caption or an image, respectively. In the case of ViT-B/32, these vectors are 512-D. The model's weights are trained to maximize the scaled cosine similarity of truly corresponding image/caption pairs while simultaneously minimizing the similarity of mismatched image/caption pairs using InfoNCE (Sohn, 2016;Oord et al., 2018). We hold fixed this set of weights for our experiments.

Evaluating Caption Generations with CLIP.
To assess the quality of a candidate generation, we pass both the image and the candidate caption through their respective feature extractors. Then, we compute the cosine similarity of the resultant embeddings. 5 We found that prefixing candidates with the prompt: "A photo depicts" improved correlations slightly (and is our recommended/standard configuration), though "A photo of", the recommended prompt from Radford et al. (2021), worked well too. Following Zhang et al. (2020), we perform a re-scaling operation. 6 For an image with visual CLIP embedding v and a candidate caption with textual CLIP embedding c, we set w = 2.5 and compute CLIP-S as: To compute corpus-level CLIP-S, we simply average over (candidate, image) pairs. Note that this evaluation does not depend on underlying references. The runtime of CLIP-S with the ViT-B/32 backbone is fast: on our single consumer GPU and hard drive, roughly 4K image-candidate pairings can be processed per minute.
RefCLIPScore CLIP-S can additionally be extended to incorporate references, if they are available. We extract vector representations of each available reference by passing them through CLIP's text transformer; the result is the set of vector representation of all references, R. Then, RefCLIPScore is computed as a harmonic mean of CLIP-S, and the maximal reference cosine similarity, i.e.,

Benchmark Captioning Evaluations
We first evaluate on a set of literal description corpora. Broadly, the captions in these corpora aim to identify and highlight the literal, salient objects/actions in a photographic image, presented without additional context. 7

Caption-level likert judgments
We first explore three corpora consisting of human likert-scale judgments at the level of individual image/caption pairs. Flickr8K-Expert (Hodosh et al., 2013) contains 17K "expert" human judgments between 5664 images: humans graded captions on a scale of 1 to 4 (4="caption describes the image without any errors"; 1="caption is unrelated to the image"). Flickr8K-CF is a set of 145K binary quality judgments gathered from CrowdFlower over 48K (image, caption) pairs (1K unique images). Each pair has at least 3 binary judgments, and we take the mean proportion of "yes" annotations as a score for each pair to compute correlations. Composite (Aditya et al., 2015) contains 12K human judgments between images from MSCOCO (2007 images), Flickr8k (997 images), and Flickr30k (Young et al., 2014) (991 images). Each image originally has five references, but one of the references was selected to be rated by humans in the set (and so we remove it from the reference set when computing metrics; this differs from some prior work, see Appendix A for why we consider the more difficult setting). For Composite and Flickr8K judgments, we compute correlation between each metric and the human ratings using Kendall τ .

Results
The results for Flickr8K-Expert are given in Table 1, for Flickr8K-CF are given in Table 2 (in τ b , following Cui et al. (2018)), and for Composite are given in Table 3  ences achieves higher correlation with human judgment compared to previously proposed metrics that rely on references. Additionally, in all cases, RefCLIP-S improves correlation even further. This provides strong evidence that, in terms of correlating with human judgment at the caption-level for these literal photographic image description tasks, a relatively direct application of CLIP can serve as a strong automatic evaluation metric.

Pairwise ranking on Pascal-50S
In Pascal-50S (Vedantam et al., 2015), raters made pairwise preference judgments between pairs of sentences. There are 4K sentence pairs total, split evenly across four categories, e.g., two human captions, two machine captions, etc. For each pair, 48 human pairwise judgments were gathered. 8 Following prior work, instead of computing correlation coefficients, we compute accuracy, i.e., we consider the caption preferred by a majority of annotators to be correct, and measure how often the evaluation metric assigns a higher score to that member of the pair. Ties are broken randomly. Due to random selection of 5 references among the 48 candidates to serve as ground-truth for the reference-based metrics, the results may differ slightly from prior work (we average over 5 random draws of references). The results are given in Table 4. Evaluation is split across four categories of caption pairs (detailed in the table caption). CLIP-S and RefCLIP-S generally achieve high performance in all categories.  Table 3: Composite correlations with human judgment. All metrics use between 4 and 5 ground truth references, except for CLIP-S (which uses none). In contrast to some prior work, we consider a harder setting, and remove the candidate from the reference set (see Appendix A for details; for comparison purposes, RefCLIP-S achieves τ c = 60.0 in the easier setting). * indicates a result reported in prior work.

System-level correlation for MSCOCO
CLIP-S achieves high correlation with human judgments at the system-level as well: we evaluate the outputs of systems submitted to the 2015 MSCOCO Image Captioning Challenge (Vinyals et al., 2016). We have some concerns with standard evaluation setup on this corpus, mostly related to the fact that it consists of only 12 datapoints (see supplementary for more discussion). Nonetheless, following the standard procedure, we correlate CLIP-S and RefCLIP-S with two metrics: "the percentage of captions that are evaluated as better or equal to a human caption (M1)" and percentage of captions that pass the "Turing Test" (M2), respectively. CLIP-S achieves Spearman ρ M 1 /ρ M 2 = .59/.63 and RefCLIP-S achieves ρ M 1 /ρ M 2 = .69/.74 (all p < .05) with these system-level metrics.

Sensitivity of CLIP-S to hallucination
Prior work has demonstrated that, for many literal description tasks, humans often prefer correctness in captions over specificity (Rohrbach et al., 2018(Rohrbach et al., , 2017. 9 Thus, understanding if and how evaluation metrics handle image captions that contain incorrect "hallucinations," e.g., references to objects that  are not depicted, is important. We use a sample of image captions from the FOIL dataset, constructed by Shekhar et al. (2017), to test how sensitive CLIP-S is to detecting potentially subtle inaccurate details in descriptions. This corpus consists of modified reference captions from MSCOCO that have a single noun-phrase adversarially swapped out to make the FOIL caption incorrect, e.g., switching "motorcycle" for "bicycle". To adapt the corpus to our setting, for each of the 32K test images, we sample a (FOIL, true) pair, and compute the accuracy of each evaluation metric in their capacity to assign a higher score to the true candidate versus the FOIL. To compute referencebased metrics, we give access to the MSCOCO reference captions for the image (excluding the the true candidate being assessed against the FOIL). While the paired setting we consider isn't identical, Shekhar et al. (2017) estimate roughly 92% human agreement on the unpaired version of the task, relative to a 50/50 random guessing baseline. Table 5 contains the results. In this setting, having access to more annotation is quite helpful for reference based metrics, e.g., the accuracy of SPICE and BLEU-4 increase by over ten points when shifting from one to four references. But in the reference-limited setting, CLIP-S, without any ref-   (2018)'s finding that "object hallucination can not be always predicted based on the traditional sentence metrics" using a corpus derived from Shekhar et al. (2017), particularly in the case where there are few references available. However, CLIP-S and RefCLIP-S offer a performance improvement in the pairwise setting.

Sensitivity of CLIP-S to memorization
One concern with model-based scoring methods is memorization, i.e., if a model's weights are pretrained using a large corpus, there's a risk that data used at evaluation time have already been seen at pretraining time. While Radford et al. (2021) conduct a train-test overlap analysis and find that CLIP is unlikely to succeed because of memorization, we nonetheless conduct an experiment with images CLIP has never seen before.
The authors of this work created a set of 250 images that have never been posted to the Internet by aggregating personal photographs. The set contains a variety of Flickr-like situations, e.g., nature scenes, animals, city streets, objects, etc. For each image, we collect two automatically generated captions: one from a commercial API, Microsoft Azure Cognitive Services (v 3.1) 10 and one from Luo et al. (2018)  baseline. 11 Then, for each image, three authors of this work independently selected which caption described the image content more accurately. Relative to a 50% random baseline (and a 72% length baseline of selecting the shorter caption) CLIP-S correctly recovers majority human preference in 86% of cases. Human agreement for this corpus is 93%. 12 While this setup cannot definitively refute the notion that CLIP works well because it has memorized images, we hope the results here contribute to the evolving discussion about the nature of generalization for web-scale pretrained models.

Which metrics should I report?
Most caption generation works report multiple metrics, each of which (presumably) correlates with human judgment to different degrees. But it's not always clear if individual metrics capture distinct or redundant dimensions of human judgment. For example, while CLIP-S and ViLBERTScore-F both produce high correlations, are they redundant or complementary?
We seek a (minimal) set of metrics that explains the most variance in human judgment. To find this set, we undertake a forward selection on a set of ten candidate metrics comprising six widelyreported metrics, 13 and four newer metrics, BERT-S (RoBERTa-F), TIGEr, ViLBERTScore-F, and CLIP-S (we also include experiments starting with RefCLIP-S instead of CLIP-S, too). Starting from an empty set, we perform an iterative greedy selection by picking the most informative additional metric to add. 14 To estimate variance, we repeat the forward-selection process 10 times with bootstrap re-sampled versions of the corpus. Figure 2 shows the information gain that results from running this experiment on the Composite and Flickr8K-Expert corpora; we also show which metric is most commonly selected at each iteration (earlier = more information gain). For Composite, CLIP-S (or RefCLIP-S) is always selected first, followed by ViLBERTScore-F, and then (most commonly) BERT-S (RoBERTa-F). For Flickr8k-Expert, the top three choices are always CLIP-S (or RefCLIP-S), ViLBERTScore-F, and SPICE. While CLIP-S and ViLBERTScore-F tend to be the most informative metrics, (1) while they are correlated, they are not purely redundant; and (2) image-unaware, reference-based metrics like SPICE can still be useful.
In summary, these results suggest that evaluation metrics like CLIP-S, which take into account visual content, indeed capture axes of human judgment not currently covered by text-only reference-based metrics. For the literal image description evaluation settings we consider, a reasonable mix of metrics to report is at least one image-aware metric (e.g., CLIP-S) plus a strong reference-only metric (e.g., SPICE).

Case Studies Using CLIPScore
Our results thus far have demonstrated that CLIP encodes information useful for evaluating literal image description tasks. But, reference-based metrics may a priori seem more adaptable versus CLIP-S. Does CLIP-S correlate with human judgment beyond cases like MSCOCO and Flickr8K?
To address this question, we consider four case studies, exploring the correlation between CLIP-S and human judgment across "divergent" image description datasets. These corpora qualitatively differ from the more popular domains explored in §4, either because the images are not "everyday" images from Flickr, or because the captions are not literal description (Figure 3 illustrates).

Alt-Text ratings from Twitter
When uploading an image alongside a tweet, users of Twitter have the option of providing alterna-14 Our criteria is how much additional R 2 correlation with human judgment a metric adds according to a linear regression. We use sklearn (Pedregosa et al., 2011)'s forward selection, which applies 5-fold cross-validation at each step.  Figure 3: Instances from our four case-study corpora.
tive text: while few use this feature (Gleason et al. (2019) find that fewer than .1% of image tweets have alt-text), its broader adoption might someday make social media more accessible for low vision and blind users. We measure CLIP-S's capacity to reconstruct a set of 2.8K human judgments of alttext quality. This corpus was collected and rated by the authors of Gleason et al. (2019Gleason et al. ( , 2020. Each alt-text was rated on a scale of 0 to 3 in terms of its probable utility as an alt-text. While the humanraters raters themselves are sighted thus cannot directly assess the utility of a given alt-text to a low vision or blind user, they are experts in designing and evaluating alt-text systems. Tweets were sampled from a mix of the Twitter FireHose API, and the timelines of low vision and blind users of the site. The images, qualitatively, are a broader mix of web content in comparison to Flickr-like domains, e.g., screenshots, memes, etc. Alt-text candidates are a mix of user-uploaded and machine-generated. The corpus contains no references, but for the purposes of comparison to reference-based metrics, we (programmatically) treat any textual context of the tweet as a reference. CLIP-S achieves 48.4 τ c correlation with the human judgements. In contrast, likely due to the unreliability of Tweet texts as viable alt-texts, reference-based methods struggle: the best performing purely-reference based metric, BERT-S (RoBERTa-F) (which achieves 15 τ c ) under-performs relative to length baseline (which achieves 25 τ c ). While gathering high-quality, contextual reference alt-texts is a promising avenue for future work, 15 CLIP-S offers a promising evaluation metric candidate in this domain.

Abstract-50S
We assess CLIP-S's capacity to generalize to abstract, non-photographic clip-art images using Abstract-50S (Vedantam et al., 2015). This dataset pairs clip-art images (originally constructed by Zitnick and Parikh (2013)) with 48 human-written reference captions. These images depict two cartoon characters, Mike and Jenny, in various outdoor situations, e.g., playing sports, having a picnic, etc. For 400 human-written candidate caption pairs (200 pairs are from the same image, 200 are from different images), human judgments were collected: annotators were instructed to choose which of the paired captions were more similar to each reference caption, so 48 judgments were collected for each candidate pair (for a total of 19200).
We compare CLIP-S to several reference-based metrics when given access to a random sample of five reference captions. Following our procedure for Pascal-50S, we randomly re-sample 5 times, and report average pairwise accuracy. Two baselines (BL) both achieve 53: length-only (i.e., saying the longer caption is better); and randomly shuffling images as input to CLIP-S (so that it cannot rely on meaningful visual-textual interactions). Overall, while CLIP-S underperforms relative to the reference-based metrics, it outperforms the baselines by a wide margin. This result suggests that CLIP-S is capable of reasoning about visualtextual interactions, even in non-photographic images.

Personality Captions
Inspired by language use on social media, Shuster et al. (2019) collected image captions by prompting annotators with a "personality" (e.g., dramatic, sympathetic, sad, etc.) and asking them to "write a comment in the context of [a] given personality trait... about an image that someone else would find engaging." To evaluate their models, the authors collected pairwise human judgments, where evaluators were instructed to "to pick which comment is the most engaging". We assess CLIP-S in two capacities: (1) does it prefer literal descriptions, or the less-literal, more engaging, personality captions?; and (2) if it is given two personality captions, can it predict which humans judge to be more engaging?
For (1): Over a set of 2.4K "traditional" vs. personality captions pairwise ratings, humans rate the personality captions to be more engaging 65% of the time, whereas CLIP-S prefers the traditional 80% of the time. 16 Our takeaway: when given a direct description and a more engaging, non-literal caption, CLIP-S will generally prefer the literal.
For (2): CLIP-S performs slightly better than random, e.g., 57% over 2.5K human pairwise judgments comparing two neural generator models: TransResNet (ResNeXt-IG-3.5B) vs. TransRes-Net (ResNet-152) (see Shuster et al. (2019) Table  7, Row 5), but no better than a length-only baseline (also 57%). Notably, even reference-based metrics fail to provide correlation with pairwise human judgment of engagingness on this corpus: e.g., BLEU-4, CIDEr, and SPICE agree with human judgment 52%/53%/51% when provided with one personality-primed reference. Our takeaway: when given two engaging, non-literal descriptions, both CLIP-S and traditional reference-based metrics fail to predict which humans will judge to be more engaging.

News image captioning
Biten et al. (2019) consider caption generation for images from New York Times articles; their task differs from MSCOCO because 1) 95% of captions contain at least one named entity, e.g., a politician, celebrity, or place; and 2) captions generally "do not describe scene objects, but rather offer a contextualized interpretation of the scene." They collected 2.1K pairwise human judgments over 106 images that compare the performance of two news image captioning models. For each image, 20 annotators were instructed to pick which of two model generations was closer to the ground-truth caption (they were also presented with the image itself). We compare metrics in terms of their accuracy in matching human judgment between the two candidates.
Reference-based metrics dominate: METEOR and BLEU-4 achieve the highest accuracies of 93 and 91 respectively, whereas CLIP-S achieves only slightly above random at 65. Qualitatively, CLIP-S succeeds when there are visually-verifiable content, e.g., matching black-and-white photos to older dates (e.g., picking 1933 vs. 1977, in one case), and matching particularly iconic celebrities (e.g., it confidently identifies Muhammad Ali boxing). 17 But, its most common failure case are captions that may 16 Preliminary prompt-engineering experiments (e.g., "when I look at this photo, I feel [PERSONALITY] and think [CAP-TION]") could not overcome this. simply be unverifiable given only the image content. For example: CLIP-S selects "The dining room at Elle Decor" for an image of a room, but annotators preferred a caption that mentioned "the Junior League of New York;" the ground truth caption reveals why the image was pictured in the first place: "A Manhattan home on a May 7 tour by the Junior League of New York." Overall, we do not advocate for reference-free evaluation in this case, especially because our results suggest that (at least for this particular set of annotations) reference-based n-gram overlap metrics achieve high correlation with human judgment.

Conclusion
For literal image description tasks, CLIPScore achieves high correlation with human judgments of caption quality without references when used in an off-the-shelf fashion. Additional experiments in divergent domains suggest that CLIP can also reason about non-photographic clip-art, and serves as a reasonable option for reference-free evaluation in the alt-text case. Promising future work includes exploring 1) CLIP-S as a reinforcement learning reward for literal caption generators; and 2) whether a small amount of labelled human rating data could help CLIP-S adapt to domains where it struggles, e.g., engagingness prediction. We hope our work can contribute to the ongoing discussion about the role of pretrained models in generation evaluation.
Reference-free evaluation runs some risks. Much like BERTScore, model-based metrics like CLIP-S reflect the biases of the pre-training data. While we believe that using CLIP-S as an offline evaluation metric for literal caption quality accords with the recommendations of CLIP's model card 18 (Mitchell et al., 2019), Agarwal et al. (2021)'s study demonstrates that CLIP can make disproportionate incorrect classifications of people, e.g., "male images were misclassified into classes related to crime." Exploring potential social biases of candidate generations (as in, e.g., Hendricks et al. (2018)) remains paramount, particularly if a system is to be deployed.
Contemporaneous work While this work was under submission, two alternate reference-free evaluation metrics for image caption generation were introduced: FAIEr (Wang et al., 2021) (based on a pretrained object detector, and fine-tuned on MSCOCO) and UMIC (Lee et al., 2021) (based on UNITER (Chen et al., 2020)). UMIC, in particular, produces similar correlations with human judgment on the literal image description tasks ( §4) compared to CLIP-S, but with the complementary approach of fine-tuning on synthetic negative captions. Future work would be well-suited to explore if the textual data augmentations proposed by Lee et al. (2021) (1) result in a metric that complements or overlaps with the non-finetuned CLIP-S ( §4.6); and (2) could be extended beyond cases of literal description ( §5). Wei Zhao, Goran Glavaš, Maxime Peyrard, Yang Gao, Robert West, and Steffen Eger. 2020. On the limitations of cross-lingual encoders as exposed by reference-free machine translation evaluation. In ACL.
C Lawrence Zitnick and Devi Parikh. 2013. Bringing semantics into focus using visual abstraction. In CVPR.  introduced a set of corpora, metrics, and experimental settings for comparing image caption generation evaluation metrics. Perhaps unwittingly, their introduced protocols have become the accepted standard for evaluation of new caption generation metrics. However, seemingly innocuous preprocessing+reporting choices can significantly impact correlations with human judgment on these corpora. In what follows, we detail our replication efforts. Our goal was to make the experimental comparisons involving CLIPScore reported in the main paper as fair as possible. We hope it can be useful for researchers reporting metrics on this setup going forward.

Flickr8K details
We contacted the authors of some prior work, and did our best to re-create their evaluation settings. We uncovered two types of discrepancies when reporting on this corpus. The first discrepancy is that prior work has mixed evaluating rank correlations with kendall-C and kendall-B. These metrics handle ties differently, and ties are frequent because human Likert judgements are discretized. The second discrepancy is the method of aggregation of human ratings. Three human ratings were gathered for 5664 (image, candidate) pairs. The majority of prior works flatten all human judgments to a single list, and report rank correlation over 5664 * 3 = 16992 instances (method A). However, another (possibly more defensible) evaluation choice is to average human ratings for each pair, and report rank correlation instead over 5664 instances (method B). The choice of aggregation method has a significant impact on correlations. For example, when we used aggregation method A and τ c for SPICE, we can exactly replicate the correlation, 44.9, originally reported in . But, if we use τ c and instead use aggregation method B, the correlation increases to 52.9: this inflation occurs with other metrics, too. For our results, we do our best to report all results for the most common setting: using τ c correlation, and using aggregation method A. Thus, the results we report may differ slightly than the results reported in prior work.

Composite details
For this corpus too, prior work has mixed evaluating with kendall-C and kendall-B correlations,   Table 6. We suspect the discrepancy in BLEU-4 likely results from a smoothing issue related to the application of BLEU-4 to individual captions vs. the whole corpus (as mentioned in Kane et al. (2020)). Based on these replication efforts, it's likely that the original evaluations for this corpus were computed using τ c with GT references removed. We agree that the fairest analysis on this corpus should not include a reference that is also a candidate. And while we didn't go through all prior works and recompute their metrics with this change, we did compute ViLBERTScore-F in this setting, because it was, before CLIPScore, the state-of-the-art for this corpus. If it's helpful for future reporting: in the setting where all references (including the GT reference) are used, RefCLIP-S gets τ c = 60.0.

MSCOCO system-level details
The MSCOCO 2015 image captioning challenge is a standard corpus for evaluation the system-level correlation between new evaluation metrics and hu-man judgments on the MSCOCO test set. To our knowledge, this evaluation was first conducted by  using a random sample of 1K test set submissions from 15 teams. But because the test set predictions are not public, more recent work (e.g., Cui et al. (2018); Zhang et al. (2020)) has evaluated using dev set predictions from systems, and assuming dev set results correlate with test set results (12 teams submitted dev predictions). However, there are some potential problems with this setup: 1. There's reason to believe that some teams give dev set predictions with different models vs. test set predictions. For example, the dev set predictions are identical between the two submissions: m-RNN and m-RNN (Baidu/ UCLA), but the test set predictions differ (and achieve significantly different scores). 2. Correlations are reported over 12 (or possibly only 11, given the duplicate predictions) systems. But spearman/pearson correlation over only 12 observations is unfortunately simple to (accidentally) "game" due to the low statistical power of the comparison (see Card et al. (2020) for an overview of statistical power in NLP). Consider a (nonsense) evaluation metric that assigns a random uniform [0, 1) "score" to systems without examining outputs, and consider applying this metric, e.g., N = 10 times to the 12 systems and taking the best performing run as the final metric (simulating either a single researcher developing a new evaluation metric and/or the community's collective trials). We ran a simulation of this process 1000 times: the average spearman/pearson correlation between human judgments and our bogus metric were r/ρ = .91, due to repeated evaluation and low sample size.
Thus, while the intent of this evaluation is understandable, and it may be possible to garner some insight if relatively few evaluations are conducted, this specific setup as a fine-grained comparison between new evaluation metrics for caption generation has likely outlived its utility.

Pascal-50S Setup Erratum
In March 2022, Jin-Hwa Kim reported some small discrepancies in a replication effort for the Pascal-50S corpus. Upon further investigation, it was discovered that the original version of this work was using a different set of human judgments  Table 7: Pascal50S-11-judgment accuracy results (5 references, non-standard 11 human judgment version). HC = two human correct captions; HI = both captions are human written, but one is wrong; HM = both captions are for the image, but one is written by a human, one by an algorithm; MM = both captions are for the image, and both are written by an algorithm. We average our results over 5 random samples (but CLIP-S doesn't change because it doesn't use references).
than the usual setup. In particular, the Pascal-50S corpus contains two types of human judgments: 11 human judgments per pair (located in a file named pair_pascal.mat); and 48 human judgments per pair (located in a file named consensus_pascal.mat). The 48 judgments are intended to be used, and the results in the main paper have been updated accordingly. For reproducability sake, in case future work utilizes the 11 judgments, we have included those results in Table 7.

B Rescaling CLIPScore
For readability purposes, as in Zhang et al. (2020), we sought to re-scale the raw cosine similarities computed by CLIP ViT-B/32. While such a monotonic rescaling operation doesn't affect ranking results, for reporting purposes, it can be easier to compare raw values if they are on a scale more closely-aligned with other evaluation metrics (e.g., from roughly zero to roughly one). Figure 4 shows the raw candidate-reference and candidateimage cosine similarities for four corpora. (Many "reference"-candidate similarities for the Twitter corpus are 1.0 because users frequently use the text of their tweet as the AltText.) Across all of these cases, we never observed a negative negative cosine similarity. But, to be safe, we take a maximum between the cosine similarity and zero because the harmonic mean used to compute RefCLIPScore would be undefined for negative values. Multi- plying by 2.5 has the effect of "stretching" the CLIPScore distribution to more uniformly span between zero and one, though, CLIPScore can be greater than 1. Furthermore, when computing RefCLIPScore, we maintain this weighting, because it has the effect of mapping the visual-textual cosine similarity distribution to more closely match the reference-candidate distribution: this provides a roughly equal importance weighting between the image-candidate and reference-candidate similarity factors. We note that the exact parameters of our rescaling method only apply to CLIP ViT-B/32. If future, bigger models are released, e.g., the presently unreleased ViT-L/14 CLIP variant, they could exhibit a different cosine similarity distribution.