SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption Evaluation via Typicality Analysis

The open-ended nature of visual captioning makes it a challenging area for evaluation. The majority of proposed models rely on specialized training to improve human-correlation, resulting in limited adoption, generalizability, and explainabilty. We introduce “typicality”, a new formulation of evaluation rooted in information theory, which is uniquely suited for problems lacking a definite ground truth. Typicality serves as our framework to develop a novel semantic comparison, SPARCS, as well as referenceless fluency evaluation metrics. Over the course of our analysis, two separate dimensions of fluency naturally emerge: style, captured by metric SPURTS, and grammar, captured in the form of grammatical outlier penalties. Through extensive experiments and ablation studies on benchmark datasets, we show how these decomposed dimensions of semantics and fluency provide greater system-level insight into captioner differences. Our proposed metrics along with their combination, SMURF, achieve state-of-the-art correlation with human judgment when compared with other rule-based evaluation metrics.


Introduction
Visual captioning serves as a foundation for image/video understanding tools and relies on caption evaluation for identifying promising research directions. Rule-based caption evaluation approaches like the n-gram based CIDEr  and parsed semantic proposal based SPICE (Anderson et al., 2016) specifically are able to provide researchers with meaningful feedback on what their algorithm is lacking. However, ngram based methods are sensitive to stop words and sentence parsers are often inconsistent, leading to Liu et al. (2017) showing that neither method  (Chen et al., 2015;Karpathy and Fei-Fei, 2015) with one used as a baseline for automatic captioners (Cornia et al., 2020;Pan et al., 2020;Vinyals et al., 2015). For each captioner, a 75% confidence ellipse (1.15 standard deviations from the mean) is generated. A caption near the centroid of each captioner is shown as an example along with the caption scores from 100 randomly sampled images. The normalized ellipse overlap between an automatic captioner and human captions, H∩M Area M Area , gives an overall evaluation of typical performance at a system-level on a scale of 0 to 1, with 1 being human-caption level. fully captures either the fluency or the semantic meaning of text. More recently proposed metrics attempt to learn cues of caption quality by training models via image grounding techniques (Cui et al., 2018) or human and generated captions (Sellam et al., 2020). These approaches, however, lack generality, require domain specific training, and offer little insight for improving captioners, leading to none of the proposed models being adopted for use as a caption evaluation benchmark. We instead postulate that quality in semantics and descriptive language is universally recognizable.
The primary difficulty of caption evaluation is its cross-modal nature introducing ambiguity into the expected output, resulting in a ground truth that is no longer a single outcome, but a large set of potential outcomes of varying levels of quality. From this problem setting, the novel concept of "typicality" arises naturally. A desirable caption is one that is atypical enough linguistically that it uniquely describes the scene, follows typical natural language protocols, and matches a typical semantic description of a scene.
Linguistically, the number of typical sequences is characterized by the entropy rate (Cover, 1999). Current work estimates the English language as having an entropy rate of only 1.44 bits/letter (Takahashi and Tanaka-Ishii, 2018), implying that the typical set of English is only a tiny fraction of the full space of potential text. Self-attention transformers are language models that are able to identify the distinguishing contextual features of this typical set and as a result have now become the staple of natural language understanding tasks. Here we define typicality based on the distance of a candidate text's features from expected features of the typical set. We call this linguistic typicality estimation method Model-Integrated Meta-Analysis (MIMA) and use the function, f MIMA , to create referenceless fluency metrics attune to captioning needs. Rather than assuming a predefined evaluation task and introducing bias by fine-tuning the self-attention transformer, our method extracts the inherent properties of language learned by transformers (Devlin et al., 2019; by treating self-attention layers as probability distributions as demonstrated in Clark et al. (2019). Our approach represents the first integration of a fluency specific metric that demonstrably improves correlation with human judgment for caption evaluation.
By removing stop words from the candidate text, f MIMA is able to create a metric that assesses a relatively new fluency criteria in captioning: style. We refer to this metric as Stochastic Process Understanding Rating using Typical Sets (SPURTS). Style can be thought of as the instantiation of diction and is necessary for generating human-level quality captions. Stylized captions describe a much smaller set of media, leading to machines instead generating the most typical caption that is still semantically correct. This results in a significant gap between machine and human captioners that can be seen in diction-based examples such as the use of the common words like "dog" and "food" instead of more descriptive words like "Schnauzer" and "lasagna". The other aspect of fluency assessed by f MIMA is grammar. Unlike style, grammar is not essential for caption quality, however, highly atypical syntax can potentially lead to awkward captions, so we develop a separate grammatical outlier penalty.
We then define a lightweight and reliable typicality based semantic similarity measure, Semantic Proposal Alikeness Rating using Concept Similarity (SPARCS), which complements our referenceless metrics and grounds them to the reference captions. By matching word sequences, current methods limit the scope of their evaluation. Instead, we take non-stopword unigrams and further coalesce them into concepts through stemming, then combine the reference texts, like in Yi et al. (2020), using a novel semantic typicality measure of the reference text's concepts to evaluate the semantic similarity of a candidate and reference text.
SPURTS and SPARCS can be used to assess system-level differences between captioners as shown in Figure 1. Based on this analysis, the M 2 Transformer lags behind 2015 models in terms of similarity to human captions, even though both 2020 captioners achieved state-of-the-art results based on CIDEr standards. This difference becomes even more significant when you consider that the use of style makes it more difficult for a caption to be semantically correct. Human captions, M 2 Transformer (Cornia et al., 2020), X-Transformer (Pan et al., 2020), and Google (Vinyals et al., 2015) incur a total grammar outlier penalty of −44.93, −7.47, −7.56, and −4.46, respectively. In order to provide caption-level insight as well, we combine SPURTS, SPARCS, and our grammar outlier penalty into one metric -SeMantic and linguistic UndeRstanding Fusion (SMURF) -which rewards captions based on semantics and fluency. Contributions: Our key contributions are: 1. A novel and widely-applicable model metaanalysis technique, MIMA, which estimates the typicality of candidate text and which provides a means of assessing transformer robustness. 2. Three novel evaluation metrics useful for both caption-level and system-level evaluation: stylefocused SPURTS, semantic-focused SPARCS, and their combination which incorporates grammatical outliers as well, SMURF. 3. Experiments showing that SPARCS and SMURF achieve SOTA performance in their respective areas of semantic evaluation and humanmachine evaluation at both a system and caption-level. 4. Evidence showing that the performance of automatic evaluation metrics has been underestimated relative to voting-based human evaluation metrics.

Related Work
Originally, popular rule-based metrics from machine translation that were mostly n-gram based, namely METEOR (Banerjee and Lavie, 2005), BLEU (Papineni et al., 2002), and ROUGE (Lin, 2004), were used for caption evaluation.  introduced the more semantically sensitive CIDEr which uses tf-idf to identify distinguishing n-grams and then compares them using cosine similarity. SPICE (Anderson et al., 2016) greatly improved upon n-gram based approaches by using a sentence parser to generate semantic propositions. Word moving distance scores (Zhao et al., 2019;Kilickaya et al., 2017) have also been used for semantic evaluation with limited success. BERTScore  used cosine similarity of embeddings from the self-attention transformer, BERT, and achieved state-of-the-art results on COCO but provided little interpretation of their approach.
Domain specific training approaches have also been introduced with limited adoption. Cui et al. present a training approach for caption evaluation where an image grounding and/or caption based Turing test is learned based on training data from human and machine captioners. An adjusted BERTScore (Yi et al., 2020), BLEURT (Sellam et al., 2020), and NUBIA (Kane et al., 2020) utilize transformer embeddings for comparison between reference and candidate text, then perform caption dataset specific fine-tuning of the model downstream.
The importance of fluency in captioning has been widely recognized. Liu et al. (2017) attempted to integrate CIDEr and SPICE to create a cost function attune to both lexicographical and semantic qualities for captioning optimization. Cui et al. (2018) identified the presence of less frequent, distinguishing words within human-generated text in the COCO dataset. Mathews et al. (2018) recognized the importance of style in captions and integrated it into their model without sacrificing semantics.
Referenceless evaluation, first proposed in Napoles et al. (2016) as a referenceless grammar error correction (GEC) evaluation metric, has been recognized as an effective avenue for fluency evaluation as a whole (Asano et al., 2017), along with combined approaches (Choshen and Abend, 2018). More recently, Perception Score (Gu et al., 2021) outlined a general paradigm for training referenceless quality evaluation.

Self-Attention Transformer Background
First introduced in Vaswani et al. (2017), transformers are made of layers of parallel attention heads which extract contextual information about inputs using attention. They take in a sequence vector of tokenized words from candidate text, y n , add start and separator/end tokens, and pass the input through a series of separate linear transforms with parameters, p, to create query, key, and value vectors, denoted as q i ,k i ,v i , respectively. These vectors are then used to compute the attention weight parameters of the heads as shown: where α ij and o i are each layer's attention weights and output, respectively. Here α ij (y n , p) is a joint distribution with marginal distributions α i (y n , p) = j α ij (y n , p) and α j (y n , p) = i α ij (y n , p).
BERT (Devlin et al., 2019) and RoBERTa  are encoder-decoder instantiations of transformers, pretrained on fundamental language tasks over large corpora. Both BERT and RoBERTa have achieved state-of-the-art results in various language understanding tasks. In order to speed up inference time, many papers have employed knowledge distillation to reduce the number of parameters these transformers require while still preserving their inference capabilities (Sun et al., 2019;Sanh et al., 2019;Chen et al., 2020).

Information Theory Background
Transformers like BERT and RoBERTa take text tokenized into sub-word components as input, capturing both the syntax and morphology of the text. The text sequences used as training data, x n , can be modelled as a stationary ergodic stochastic process, {X k } ∞ k=1 , with instantiations limited to finite alphabet X and based on joint probability distribution, P (X 1 = x 1 , ..., X n = x n ), whose transition predictability is governed by entropy rate, H(X ).
The entropy of a distribution, or entropy rate in the case of a stochastic process, can be used to describe the number of instantiations expected to be observed from a random variable or process, referred to as the typical set. From the Asymptotic Equipartition Property (AEP), it is known that the size of the typical set of sequences is bounded by where 2 nH(X ) estimates the size of the typical set.

Model-Integrated Meta-Analysis
We assume that a self-attention transformer learns to fill in words from a sentence by extracting features, F . The quality of a piece of text can then be assessed by determining the distance of features taken by the model from candidate text, Y n = y n , from the expected value of features taken from correctly written text, X n = (x n ∈ A n ), shown visually in Figure 2 and mathematically in Equation 4 Here dist does not does not refer to a specific distance metric and is instead an unspecified norm that exists in some realizable projection space. We then postulate the existence of a surrogate function, f MIMA , which maps the sequence input and transformer parameter set, p, such that resulting in a value indicating the typicality of a candidate input sequence. This value can be used to characterize the input for evaluation purposes.

Attention-Based Information Flow as MIMA Function
We postulate that input text that differs more greatly from members of the typical set generates a greater "spark of interest" in a transformer, resulting in greater information flow through parts of the network as shown in Figure 3. Conversely, if the input text is similar to the positive examples the transformer trains on, less information flows in through the layer, indicating that the model has already captured information about the sequence previously. We formulate information flow in terms of the attention dimensions α i (y n , p), α j (y n , p), and their joint distribution α ij (y n , p) as defined in Section 3.1. We consider information flow based on the redundancy between α i (y n , p) and α j (y n , p) and use normalized mutual information (MI): as defined in Witten and Frank (2005) to capture this redundancy.
We are interested in attention heads with large information flow values, but find empirically that heads with the largest information flow values depend very little on the input and simply function as all-pass layers. Thus, we downselect to a single attention head information flow value to obtain f MIMA (y n , p) = 1 − median layer (max head [I f low (y n , p)]).
(7) Here, the max over a given layer's attention heads captures the largest "spark of interest". The median removes outlier layers that have largely invariant information flow values.

Caption Evaluation
MIMA provides us with a foundation for computing the fluency of input text. We divide fluency into two categories: grammar and style. Grammar depends on the typicality of the sequence as a whole, f MIMA , and is computed using the distilled BERT model since it achieves the highest Pearson correlation in the grammar experiment from Table 1. Style depends on the distinctness, or atypicality, of the words directly associated with the image description, which we evaluate by removing the stop words from the text, then computing what we define as SPURTS as shown where y w/o is the candidate sequence without stop words and f MIMA is computed using the distilled RoBERTa model since it performs well on out-ofdistribution text as shown in Figure 5. We formulate semantic similarity using typicality as well. Assuming a comprehensive set of all valid captions for a single image were available, we consider the distribution of all concepts, S. Here we define concepts as the set stem terms that would remain if all stop words and affix/suffixes were removed from the text. The distribution of concepts sampled from such a set of captions, S m , would have a typical set, S β m , of the most relevant concepts. Thus, a valid caption that is representative of the image semantically and demonstrates fluency should contain concepts that are members of the typical set of concepts, S β m , and be a member of the typical set of correctly formed language sequences defined in Section 3.2, A n , as shown in Figure 4.
To extract concepts from a caption, we use a stemmer on y w/s and estimate the typicality of each reference concept using the document frequency, df , of the concept across the available reference captions, gt(S), where gt is the function that maps concepts to a reference caption set. We then use an adjusted F1 score to determine the similarity between the reference concepts and candidate concepts.
The first portion of the F1 score is precision, corresponding to caption correctness. Our adjusted precision is where C is the candidate concept set and gt(S) is the reference caption set. Our approach equally weights correct and incorrect concepts if only one reference is used, but as the number increases, gradually decreases the importance of less common correct concepts.
The second portion of the F1 score is recall, corresponding to caption detail. Our adjusted recall is where a candidate concept set, C, which included all concepts from the reference set, S, would achieve a score of 1. We then use the standard F1 score combination SPARCS = F 1 (C, S) = 2 * P (C, S) * R(C, S) P (C, S) + R(C, S) .
(11) To give an overall evaluation of performance, we fuse the proposed metrics. To begin, we standardize the output score distribution of human generated captions for each metric using the captions from the COCO Karpathy test split from is semantically correct. For all of our proposed metrics, a larger value corresponds to higher quality caption.

Preliminary Experiment
We first seek to validate that our proposed f MIMA , extracted from the attention layers of BERT, RoBERTa, and their knowledge distilled versions, is proportional to the distance from the expected value of features of the typical set. To this end, we create an experiment where we can control the randomness of input text. We begin with 11 different paragraphs from unrelated Wikipedia articles. We extract all the words from the paragraphs and create a word set corpus. We then sample 25 sentences from the paragraphs randomly. Each sentence is iteratively degraded by substituting a fraction of the words with random words from the word set corpus. At each iteration step, the sentences are passed through the transformers and the value of f MIMA is computed. Eventually the sentence is incoherent and bears no resemblance to "natural" text. The process and results can be seen in Figure 5. The average f MIMA value for our information flow formulation shows a strong correlation with the degradation in both models up until about 10% of the tokens have been replaced, beyond which RoBERTa remains reliable but BERT does not, demonstrating RoBERTa's superior robustness.

Datasets
CoNLL-2014 The CoNLL-2014 competition (Ng et al., 2014) was a shared task of correcting grammatical errors of all types present in different sentences of an essay written by a learner of English as a second language. The essay consisted of 1312 separate sections to correct. A system-level human evaluation study of the grammatical quality of the corrected sentences from 12 competition submissions was presented in Grundkiewicz et al. (2015). Participants were asked to rate how natural the corrected sentences sounded and did not have access to any reference sentence. Microsoft COCO 2014 We use the Microsoft COCO validation set (Chen et al., 2015), comprised of 40,504 images, for a system-level human correlation experiment. These images are annotated with five human-generated captions, one of which is used as a baseline caption candidate. Human evaluations of competition entries were collected using Amazon Mechanical Turk (AMT). These evaluations were framed as questions from which 2 primary dimensions of system-level caption quality were derived as a ground truth to rank competitors: M1 (percentage better than or equal to human description) and M2 (percentage passing the Turing Test). Three additional categories were also included as an experimental ablation study but were not considered in the final competition ranking. In total, 255,000 evaluations were collected. Flickr 8K We use the graded human quality scores for the 5,822 remapped captions from the Flickr 8k dataset (Hodosh et al., 2013) for a caption-level semantic human correlation study. The dataset was formed by selecting captions from one image and assigning them to another. These captions are then graded based on how well they align with the image using two different standards. The first standard is Expert Annotation, where human experts rate the image-caption pairing on a scale of 1 (caption and image unrelated) to 4 (caption describes image with no errors). Each caption-image pairing has 3 scores, which we combine by taking the average. The second standard is Crowd Flower Annotation, where at least 3 students vote yes or no on whether the caption and image are aligned. Composite Dataset An additional dataset for caption-level study of semantic human correlation from Aditya et al. (2018). It contains 11,095 human judgments (on a scale of 1-5) over Flickr 8K, Flickr 30K (Young et al., 2014), and COCO and in contrast to the Flickr 8K dataset, includes machine generated captions in addition to human reference captions as candidates. Each evaluation is either based purely on correctness or detailedness. PASCAL-50S Human evaluators were asked to identify which of two sentences, B or C, is more similar to reference sentence A. Unlike other caption datasets, human evaluators in Pascal-50S  did not have access to the original image. The captions for sentence A were sourced from a 1000 image subset of the UIUC PASCAL Sentence Dataset (Rashtchian et al., 2010) for which additional human captions were collected using AMT. Sentence B and C were sourced from both human and machine generated captions. The human captions were sourced from the original PASCAL dataset, resulting in four different pairing combinations: human-correct (HC), human-incorrect (HI), human-model (HM), and model-model (MM).

System-Level Human Correlation
System-level experiments evaluate how closely human evaluation and automatic evaluation models align in terms of their overall evaluation of captioning models. To confirm that f MIMA can capture grammar information, we replicate the experiment performed in Napoles et al. (2016) and show improved performance over previous benchmarks in Table 1. GLEU (Napoles et al., 2015), I-measure (Felice and Briscoe, 2015), and M 2 (Dahlmeier and Ng, 2012)   We then benchmark our proposed caption evaluation metrics against the rule-based metrics used in the Microsoft COCO 2015 Captioning Competition, which still serve as the standard for caption evaluation, and the recall-idf configuration of BERTScore. We observe that the original COCO submissions and many of the original codebases for the submissions are not publicly available or do not provide pretrained models. Other authors attempt to reproduce the submissions using open source reimplementations that they have trained themselves, which will not be consistent with the submissions for which the human evaluations were performed. Thus, we instead opt to use the 4 representative baseline caption sets (Vinyals et al., 2015;Xu et al., 2015;Karpathy and Fei-Fei, 2015) provided publicly by Cui et al. (2018), which include 3 competition submissions from open sourced models and 1 human caption baseline. These are guaranteed to be consistent with their work and reproducible. In Table 2, we show the COCO results for SPARCS, SPURTS, and SMURF.
SMURF and BERTScore demonstrate the highest correlation with human judgment in this dataset. BERTScore's performance is partially due to incorporation of idf dataset priors also used by CIDEr, which we do not utilize to keep our metrics as general and consistent as possible. To illustrate this point, we also report BERTScore's correlation without idf weighting (BS-w/oidf) for this experiment. Despite its simplicity, SPARCS also performs well along with SPURTS. The rest of the metrics fail to adequately reflect human judgment.

Caption-Level Human Correlation
Caption level experiments evaluate how closely human evaluation and automatic evaluation models align for each individual caption. We begin with the Pascal-50S dataset in Table 3. We follow the procedure used in Anderson et al. (2016) and use the first 5 sentence A entries of each image.
The Pascal-50S dataset is based on a direct comparison between the reference and candidate captions, which gives similarity based metrics a distinct advantage. As a result, SPARCS achieves the top score in this experiment. Another interesting result is the fact that SPURTS performs reasonably well in the human-machine category despite having no access to the reference sentence. This shows SPURTS effectiveness as a Turing Test at both a system and caption-level, independent of semantic information. The additional information provided by SPURTS to SMURF in the human-machine category actually improves its performance.  To evaluate our semantic metric specifically, we use the Flickr 8K and Composite dataset and follow the experiments specified in Anderson et al. (2016). However, we have discovered a flaw in previous comparisons between the correlation of automatic evaluation metrics with expert evaluation and interhuman correlation using the Flickr 8k dataset. Only a small subset of annotations between the Crowd Flower and Expert Annotations overlap, which often consists of ties causing the ranking metric to fail. To give a fair comparison, we also test the automatic metrics on a tie-free subset of the Flickr 8k data and use these results for human comparison. All of these results can be seen in Table 4. SPARCS outperforms other metrics in the Flickr 8k dataset. However, SPICE outperforms SPARCS on the Composite dataset. This is likely due to the fact that evaluations of "correctness" in the Composite dataset are based on semantic propositions and do not consider partial correctness.
Additionally, these new results show that automatic metrics can actually outperform voting-based human metrics in terms of their correlation with experts, further motivating their use. This warrants further study as some recent datasets opt to use voting-based human metrics due to their ease of collection (Levinboim et al., 2021).

Generalization/Robustness Study
We perform a caption-level generalizability and robustness case study on the most commonly used caption evaluation algorithms using the COCO validation set in Table 5. We define a critical fail-  ure, F , as a disparity of greater than 1 between system-level human (M2) and caption-level algorithm correlation of a reference evaluation metric and a tested evaluation metric for a given caption set of an image. The last column of Table 5 shows the likelihood of a critical failure occurring for each metric.
In a human study, we identify the primary cause of critical failure in the 20 most severe discrepancies in order to identify potential areas for improvement for each metric. We use SMURF as a reference evaluator for the other evaluators and SPICE as a reference for SMURF. The estimated probability of each of these failure causes is shown in the first three columns of Table 5.
The first failure cause, c1, refers to a scenario where the metric fails despite there being enough word overlap between the candidate and reference captions for a correct judgment to be made. This implies that the choice of words/sequences made by the metric for the comparison needs improvement. The second failure cause, c2, refers to the use of correct and distinct words or phrases by the human captioner that are not seen in the references. Lastly, we include the case where the reference evaluator may have incorrectly identified the correct caption ranking (according to the human annotator) as matching system-level human judgment. We refer to this as a reference failure, RF .  The focus of previous studies has been robustness to distractors (Sharif et al., 2019;Cui et al., 2018;Hodosh and Hockenmaier, 2016). We ob-serve no captions where this is a primary cause of failure. On the contrary, we find that each metric is highly susceptible to specific c1 scenarios: n-gram based: Both CIDEr and METEOR are sensitive to stopwords, leading to rewards for words or sequences that supply no additional information. SPICE: Semantic proposal formation or sentence parsing issues can lead to the metric unpredictably failing to recognize highly informative proposals. SMURF: The metric may fail to adequately reward additional information if the words used are too common, like 'few' or 'some'.

Conclusion and Future Work
In this paper, we use information theory based typicality analysis to capture a new perspective on the problem of caption evaluation. Our analysis leads us to two caption evaluation metrics that capture separate dimensions of caption quality and a fused metric. We have performed experiments demonstrating their correlation with human judgment, showed how these methods could be used to perform multi-aspect system-level analysis of algorithm performance, and performed caption-level studies explaining why combining these two algorithms leads to more robust and generalizable evaluations. The underlying mechanism, MIMA, opens many new avenues for the analysis of selfattention transformers and potentially other models. Future work could also focus on optimal weighting between semantics and style.

Ethical Impact
Harmful bias, especially towards gender (Hendricks et al., 2018), has been shown to be present in image caption datasets and is often further magnified by automatic captioners. Prior caption evaluation methods have the potential to further exacerbate the problem by rewarding such captions due to their reliance on dataset specific images or captions. Referenceless evaluations like our style metric, SPURTS, offer a preemptive approach for mitigating harmful dataset bias, like in Simpson's Paradox (Mehrabi et al., 2019), by utilizing intrinsic properties of descriptive language learned by self-attention models over far larger and more diverse corpora. This gives the evaluator a more wholistic view of caption quality rather than viewing the world through the lens of a single visual dataset.