Evaluating and Improving Factuality in Multimodal Abstractive Summarization

Current metrics for evaluating factuality for abstractive document summarization have achieved high correlations with human judgment, but they do not account for the vision modality and thus are not adequate for vision-and-language summarization. We propose CLIPBERTSCORE, a simple weighted combination of CLIPScore and BERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary, respectively. Next, due to the lack of meta-evaluation benchmarks to evaluate the quality of multimodal factuality metrics, we collect human judgments of factuality with respect to documents and images. We show that this simple combination of two metrics in the zero-shot setting achieves higher correlations than existing factuality metrics for document summarization, outperforms an existing multimodal summarization metric, and performs competitively with strong multimodal factuality metrics specifically fine-tuned for the task. Our thorough analysis demonstrates the robustness and high correlation of CLIPBERTSCORE and its components on four factuality metric-evaluation benchmarks. Finally, we demonstrate two practical downstream applications of our CLIPBERTSCORE metric: for selecting important images to focus on during training, and as a reward for reinforcement learning to improve factuality of multimodal summary generation w.r.t automatic and human evaluation.


Introduction
Multimodal abstractive summarization is the task of generating an abridged text that contains the most important information of the source inputs from various modalities.This challenging task builds upon the success of document summarization, where the input is only text documents.For document summarization, there has been tremendous progress in improving the quality of the summaries with the help of large pre-trained models (Lewis et al., 2020;Zhang et al., 2020;Raffel et al., 2020).However, one crucial problem for such models is hallucination, where the model generates contents that are not present or entailed by the document (Maynez et al., 2020;Falke et al., 2019).
While there have been significant advancements in developing metrics that correlate highly with the human judgment of factuality (Kryscinski et al., 2020;Durmus et al., 2020;Goyal and Durrett, 2021;Scialom et al., 2021), these metrics only measure factuality between the document and the summary.The lack of judgment between other modalities, such as vision, and the summary makes such metrics not suitable for multimodal settings.We demonstrate this with the example in Figure 1.The given summary captures less relevant information (cutting the nail) from the document, but it is still considered factual to the document.However, the image shows the main point of the document (finding the place where the nail separates from the quick), making the summary not factual with respect to the image.Current factuality metrics do not account for the image and thus cannot correctly assess factuality for multimodal summaries.
In this work, we introduce a metric that judges factuality of the summary with respect to each input modality.Focusing on the vision-and-language summarization, we propose CLIPBERTSCORE, a simple and robust automatic factuality evaluation metric for multimodal summaries that combines two successful metrics: CLIPScore (Hessel et al., 2021), which shows strong performance in detecting hallucinations between image and text, and BERTScore (Zhang* et al., 2020), which correlates well with the human judgment of factuality for document summarization (Pagnoni et al., 2021).
Next, due to the lack of corpora containing ground-truth human factuality judgments to eval-uate multimodal factuality metrics via correlation with human evaluation, we propose a Multimodal Factuality Meta-Evaluation (MUFAME) benchmark by collecting human annotation for four summarization systems and the reference summary on WikiHow,2 a large collection of how-to articles containing rich images relevant to the document.We find that our simple CLIPBERTSCORE metric, which combines two off-the-shelf metrics in the zero-shot setting, achieves higher correlation with human judgment over existing text-based factuality metrics, outperforms current multimodal summarization metric MMAE (Zhu et al., 2018), and performs competitively with a metric trained with a triplet loss specifically for this task.
Next, we perform a detailed analysis of CLIP-BERTSCORE by evaluating the correlation of the metric and each of its modules on four additional factuality metric-evaluation benchmarks.We first propose the WikiHow Factuality (WikiHowFact) task, derived from the Visual Goal-Step Inference task (Yang et al., 2021).This ranking experiment measures how well the metric can select the correct summary from four choices given the correct document and image.Since hallucinations are also present for image-captioning task (Xiao and Wang, 2021), we also evaluate the correlations on two captioning tasks focusing on hallucinations, FOIL (Shekhar et al., 2017) and BISON (Hu et al., 2019), and one document summarization factuality benchmark FRANK (Pagnoni et al., 2021).Across all these tasks, we demonstrate the robustness of CLIPBERTSCORE and its components, as they achieve the highest correlations compared to strong baselines in each of its respective settings.
Lastly, we present two practical applications for improving the factuality of downstream multimodal summarization models using CLIP-BERTSCORE: (1) Selecting the most important images as visual guidance (Zhu et al., 2020), and (2) Using the metric as a reward for self-critical sequence training (Rennie et al., 2017).Our results indicate that both applications improve the factuality of the generated summaries across two multimodal summarization datasets, as measured by three factuality metrics and human evaluation.
To summarize, our contributions are: 1. We propose a simple and robust factuality metric for multimodal summarization based on a combination of CLIPScore and BERTScore.

CLIPScore BERTScore
Cut just the tip of the nails.
Be sure you know where the quick is before you attempt to cut the nail … You should first cut just the tip of the nails ... 2. We create MUFAME, a meta-evaluation for factuality of multimodal summarization, and the WikiHowFact task to evaluate the quality of multimodal factuality metrics.
3. We present a detailed study of our metric and its components on various factuality metricevaluation benchmarks and present strong empirical evidence of its robustness.
4. We demonstrate two useful downstream applications of our metric to improve the factuality of multimodal abstractive summarization models.

CLIPBERTSCORE
CLIPBERTSCORE consists of two parts that tackle the image-summary and document-summary factuality judgments, respectively.We show an illustration of the computation in Figure 1.
Image-Summary.We use a variant of CLIP-Score (Hessel et al., 2021) for evaluating the factuality between images and the summary.This metric is based on CLIP (Radford et al., 2021), a crossmodal model trained on 400M image and caption pairs with InfoNCE (Oord et al., 2018) contrastive loss.Hessel et al. (2021) finds that using this offthe-shelf model as a metric achieves the highest human correlation on the FOIL (Shekhar et al., 2017) benchmark that explores how well the metric can detect hallucinations present in the captions.
Thus, it serves as a fitting candidate for factuality evaluation between the image and the summary.We use CLIP-S, which calculates the cosine similarity between the image embedding v and the text embedding of the summary sentence t.To adapt to multimodal summarization, where we have multiple images and multi-sentence summaries,3 we take the average of the scores of all image and sentence pairs.Formally, given a list of image embeddings V and summary sentence embeddings T from CLIP's image and text encoder, respectively: Document-Summary.To better detect hallucinations present in the summary with respect to the document, we use the precision variant of BERTScore (Zhang* et al., 2020), which achieves high correlations with human judgments of factuality for document summarization (Pagnoni et al., 2021).See Section 4.4 for a detailed discussion and comparison against other text-based factuality metrics.Formally, given the contextual embeddings of each token in the document D and summary S, it calculates the pairwise cosine similarity between each document and summary token embeddings: Full Metric.The final score is a combination of the factuality score for image-summary with CLIP-S and that for document-summary with BERT-S: CLIPBERTSCORE = αCLIP-S+(1−α)BERT-S, where α is a tunable parameter.Please see Section 3.4 for other ways to learn this combination.

Metric Meta-Evaluations
Next, after defining the multimodal factuality metric CLIPBERTSCORE, we want to evaluate the quality of this new metric by checking whether it correlates with human judgments, similar to what has been done for textual factuality metrics (Wang et al., 2020;Kryscinski et al., 2020;Durmus et al., 2020;Scialom et al., 2021).As there is no human annotations of factuality for multimodal summarization, we first propose a Multimodal Factuality Meta-Evaluation (MUFAME) benchmark derived from WikiHow to test the correlations of CLIP-BERTSCORE with human judgments of factuality.

MUFAME
Dataset.We construct an English multimodal WikiHow summarization dataset (Koupaee and Wang, 2018) for the human evaluation, as this datasets has been extensively studied for document summarization (Koupaee and Wang, 2018;Ladhak et al., 2020), and the images associated with the how-to-articles are relevant to the text.We use a recent WikiHow collection effort by Yang et al. (2021) containing images. 4 We generate the steplevel multimodal WikiHow dataset by breaking each article into steps and following the construction described in Koupaee and Wang (2018): We consider the first sentence of a step as the summary and the rest of the paragraph as the document, and add the corresponding image.We randomly select 6,000 articles as the validation and test set each, and break each example into steps. 5Statistics of the dataset can be found in  2021), we include model summaries from summarization models with varying factuality capabilities.We train four abstractive summarization systems on the multimodal WikiHow dataset, including two document summarization models, T5 (Raffel et al., 2020) and PEGASUS (Zhang et al., 2020), and two multimodal summarization models, CLIP-BART (see section 5), and MOF (Zhu et al., 2018).Details of the models are provided in Appendix A.2.We additionally include the reference summaries, resulting in a total of 260 and 965 examples for the validation and test set, respectively.
Annotations.We conduct the annotations on Amazon Mechanical Turk6 (AMT) platform.For each HIT, we provide the document and the image and ask the workers to read the five summaries.
The workers then need to choose whether each summary is faithful to the document and the image separately.An example of the annotation page can be seen in Appendix A.3.For high-quality annotations, we first conduct a qualification test, where we compare the annotations from the workers against annotations by the authors.Only the workers who have the same annotations on the selected example can perform the actual annotation task.We further select workers from the United States, who have more than 10,000 HITs approved and an approval rate greater than 98%.We pay 0.18 USD per task to ensure a > $12 hourly rate.Each task consists of three unique workers, and we take the majority class for the document and image factuality judgments, similar to Pagnoni et al. (2021).
We consider the summary to be faithful only if it is considered faithful to both document and image.
We also experiment beyond binary judgment by taking the average over the two factuality judgment to indicate a summary may be partially faithful to one of the source, which is shown in Appendix B.
Inter-Annotator Agreement.We report Fleiss Kappa κ (Fleiss, 1971) and percentage p of annotators agreement on the majority class, similar to Durmus et al. (2020).We obtain κ = 0.50, with p = 88.5%,indicating a moderate agreement (Landis and Koch, 1977).7 3.2 Experimental Setup CLIPBERTSCORE.For CLIP-S, we use the RN50x64 visual backbone instead of the ViT-B/32 version used in the original metric, as the larger backbone shows a higher correlation on factuality benchmarks.For BERT-S, we choose RoBERTa-large-mnli to compute the contextualized embeddings instead of RoBERTa-large for the same reason.We refer readers to Section 4 for more details.We use the validation set of MUFAME to tune α, where we find that α = 0.25 achieves the best correlations on the combined judgment.We use this parameter for all experiments (See Section 3.4 for other ways to learn this combination).Baseline Metrics.Having separate judgments for document-summary, image-summary, and multimodal settings allows us to evaluate the metrics' performance with different modality combinations.
For image-summary evaluation, we compare our CLIP-S against Triplet Network, as described in Yang et al. (2021).We train this metric on the multimodal WikiHow dataset, allowing comparisons of correlations between CLIP-S in the zeroshot setting and a fine-tuned metric for this task.
For multimodal factuality metrics, we experiment with several weighted combinations of documentsummary and image-summary metrics by tuning the weights on the validation set, including combinations of DAE with CLIP-S, Triplet Network with BERT-S, and RefCLIP-S.We also compare to MMAE (Zhu et al., 2018)

Meta-Evaluation Results
Table 1 shows the Pearson correlation of the automatic metrics.We first note that the combined judgments require taking both modalities into consideration.Metrics that only consider the document correlate less with the combined judgment than with the document-only judgment, indicating the importance of capturing the vision modality for evaluating factuality for multimodal summarization.Multimodal factuality metrics, on the other hand, show positive transfers, as they correlate higher on all three settings than its components.
Next, for the document-summary factuality judgments, BERT-S achieves the highest correlation, outperforming DAE by 8 points and the original BERTScore by 4 points.Compared to MMAE, which is developed for evaluating the quality of multimodal summarization, CLIPBERTSCORE significantly outperforms on all three categories, showing the importance of targeting the factuality aspect.While Triplet-Net achieves better correlations on image, CLIPBERTSCORE actually outperforms the fine-tuned variants for the document case and provides the same correlations on the combined case.We thus stress the simplicity of CLIPBERTSCORE of only requiring the use of two off-the-shelf metrics in the zero-shot setting without the need for extra training to compare competitively with fine-tuned method.

Comparison of Combination Strategies
While CLIPBERTSCORE uses α to decide the weights for CLIP-S and BERT-S, we also explore using logistic regression (logis) and multi-layer perceptron (MLP) to output a final score given the two modules, following Zhu et al. (2018).8Similar to the α parameter, we tune the two methods by fitting the metric on the combined human judgment scores and selecting the parameters that Do not keep parrots that have polyomavirus with birds that do not have the disease.It is highly contagious.Quarantine affected parrots in a separate cage or area so they do not transmit the disease to other birds.
Get your hamster plenty of toys.Isolate any sick quail.
Isolate birds that already have the disease.
Remove toys, blankets, beds, and other objects from the crate.would achieve the highest Pearson correlation on the development set of MUFAME meta-evaluation dataset. 9The result is presented in Table 2.While logistic regression performs the worst, using MLP for combining the two modules provides similar correlations as CLIPBERTSCORE that uses the α parameter.Specifically, MLP provides a point increase in correlations with respect to the document but provides the same correlations on the combined judgment.The weight combination strategies can be chosen based on preference, but we advocate for the simplicity with the α parameter.

Additional Factuality Metric-Evaluation Benchmarks
We evaluate CLIPBERTSCORE and its components on additional factuality metric-evaluation benchmarks, focusing on how robust the metrics performs across a variety of tasks and domains.

WikiHow Factuality
We propose the WikiHow Factuality (WikiHow-Fact) task that evaluates how well the metric can choose the correct summaries over incorrect ones.We derive this task from WikiHow VGSI (Yang et al., 2021) to evaluate the text-image matching performance as a ranking experiment, which has been explored for factuality metric evaluation (Falke et al., 2019).An example can be seen in Figure 2.Each example uses the correct document and image as the prompt and includes four choices consisting of the correct summary and three negative summaries generated by random, category, and similarity sampling strategies described in Yang et al. (2021).We note that this setup is actually a more challenging task than the original VGSI task.See Appendix C.1 for more details.Similar to the meta-evaluation in Section 3.2, we consider the document, image and combined settings depending on the choice of the prompt, and evaluate using the the same sets of metrics.We further compare CLIP-S to that using the smaller ViT-B/32 visual backbone.We compute the ranking accuracy of assigning a higher score to the correct summary.See Appendix C.1 for more details.
Results.We present the WikiHowFact result in Table 3. First, for the image-summary setting, we observe the power of larger visual backbone at improving factuality, as CLIP-S achieves a 3, 5, and 6.3 point increase compared to CLIP-S ViT-B/32 for the random, category, and similarity split, respectively.For document-summary setting, CLIPText-S and BERT-S achieve high accuracy across the sampling strategies.Interestingly, CLIPText-S performs better than BERT-S, but this does not apply to the multimodal case: CLIP-BERTSCORE actually outperforms RefCLIP-S, showing the better positive transfer between CLIP-S and BERT-S.Similar to the meta-evaluation results, the Triplet Network outperforms CLIP-S for the image-summary setting, but the difference on random and category splits is only around 1 point.CLIPBERTSCORE still outperforms Triplet Network + BERT-S on the random and category splits, indicating the strong performance of combining the two metrics for evaluating factuality.

Hallucination in Image Captioning
The FOIL (Shekhar et al., 2017) dataset measures how well the metric can differentiate correct MSCOCO captions from hallucinated ones generated by adversarially swapping out noun phrases.We follow Hessel et al. ( 2021) and evaluate metrics on the paired setting.We compute the ranking accuracy by giving only the image (no-ref), and with 1 or 4 additional reference captions.We compare CLIPBERTSCORE and its components with CLIPScore variants using the ViT-B/32 backbone.We refer the readers to Appendix C.2 for more details and results on all visual backbones.We present the results in Table 4. BERT-S achieves the highest accuracy in terms of the text-matching performance.Especially when more text (4 references) is provided, it outperforms CLIPText-S by 1.6 points.For image-text matching, we observe similar improvement with larger visual backbones.CLIPBERTSCORE showcases its strength at positive transfer of its two components: we observe improvement over RefCLIP-S RN50x64 of 0.7 points for 1-ref and 1.2 points for 4-ref.

Fine-grained Visual Grounding
BISON (Hu et al., 2019) measures the ability of the metric to select the correct MSCOCO image from a semantically similar image, requiring more fine-grained visual grounding to achieve high accuracy.We compare the image-summary metrics, and refer the readers to Appendix C.3 for results on all CLIP-S variants.the other hand is not robust to this task, achieving much lower accuracy than all other metrics.

Document Summarization Factuality Evaluation
We compare how BERT-S and CLIPText-S correlate on FRANK, a factuality benchmark evaluation for document abstractive summarization containing 2,250 annotations for generated summaries on XSum (Narayan et al., 2018) and CNN/DM (Hermann et al., 2015).We report Pearson and Spearman correlations, using the official evaluation script. 10The result is shown in Table 13 in the Appendix.CLIPText-S does not perform well for detecting faithful summaries for summarization, as Pearson and Spearman coefficients are around 0.10 for all datasets and 0.05 for XSum Spearman.In contrast, BERT-S that uses RoBERTa (Liu et al., 2019) model finetuned on the MNLI (Williams et al., 2018) correlates higher than the original BERTScore on Pearson and Spearman across both datasets.It is thus useful to treat factuality as an entailment problem and use the appropriate model.

Downstream Applications
Finally, we present two useful downstream applications for improving factuality of multimodal summarization models: first by using the metric as a reference image selection to guide the model in attending important images, and second by using it as a reward for self-critical sequence training.For both applications, we train strong baseline models by adapting CLIP-BART (Sung et al., 2022) for multimodal summarization.Specifically, we extract visual features with CLIP and use a projection layer to transform the dimension of the visual representation to the correct dimension of BART (Lewis et al., 2020).Then, the projected features are concatenated with the text features from the original encoder as the joint input representation for BART.See Appendix D for more details.

Multimodal Visual Guidance
One of the well-known tasks is multimodal summarization with multimodal output (Zhu et al., 2020, MSMO), which incorporates the associated images with the CNN/DM articles.The authors shows that previous models suffer from modality bias, as the cross-entropy loss is only based on the text modality.To help the model also attend to the vison modality, the authors propose to create visual references by ranking and selecting the most relevant input images.While the authors show improved performance by ranking the images by the ROUGE score between the corresponding caption and the reference summary, such reference does not explicitly guide the model to generate summaries faithful with respect to the images.We thus propose to use CLIPBERTSCORE to select reference images.To incorporate the visual guidance into the training, we add a guidance loss by minimizing the crossentropy loss, where we consider the selected images by the reference as correct, and the remaining images as incorrect.We use each image's hidden representation from the encoder to produce a binary prediction using a linear layer.
We compare against the model using ROUGE as the visual guidance.Following Zhu et al. (2018), we report the performance of the models on ROUGE, and image precision (IP) of the model's recommended images and human-annotated relevant images.We additionally evaluate factuality using BertScore, FactCC, DAE, and QuestEval.The result is shown in Table 6.We observe a correlation between the guidance metric and the metric score, as the model with ROUGE guidance achieves higher ROUGE scores, and the model with CLIPBERTSCORE guidance improves all factuality metrics and IP.Though the gain is relatively small, the improvement on factuality metrics is greater than the negligible drop in ROUGE.

Self-Critical Sequence Training with CLIPBERTSCORE Reward
A more generalized application to improve factuality is to use CLIPBERTSCORE as a reward for the self-critical sequence training (Rennie et al., 2017, SCST), which optimizes the model using the RE-INFORCE algorithm (Williams, 1992).Formally, given document d, images v, and the summary y, the self critical loss is defined as: where r(•) is a reward function, y s is the sampled summary, and ŷ is the summary obtained by greedy decoding.We follow previous works (Pasunuru and Bansal, 2018;Li et al., 2019;Parnell et al., 2021) and train on the combined loss of cross-entropy L XE and the self-critical loss: L = αL RL + (1 − α)L XE , where we set α = 0.998.
We use CLIPBERTSCORE and ROUGE-2 as the rewards, so as to improve factually while maintaining informativeness.Following Pasunuru and Bansal (2018), we alternate the rewards for each step during the training.We upweight CLIP-BERTSCORE by 2x (tuned on the validation set).We experiment on MSMO, and the multimodal sentence summarization (Li et al., 2018, MMSS) task, which combines the Gigaword corpus (Graff et al., 2003;Rush et al., 2015) with crawled images. 11As the base models, we use the CLIP-BART + CLIPBERTSCORE model trained in Section 5.1 for MSMO, and we similarly train a CLIP-BART model for the MMSS.We then use the fine-tuned 11 Since there is only one image associated with each example for MMSS, we do not add visual guidance for models trained on this dataset.models and train with SCST.Details of the training details can be found in Appendix D. We report the same metrics for MMSS except for IP, since the task does not contain gold image labels.
The result is shown in Table 7.We see consistent improvement over all metrics with SCST for MMSS.Specifically, we observe a 5-point improvement for FactCC and DAE, and a 4-point increase for QuestEval while maintaing similar ROUGE score.We observe a similar trend for training with SCST on MSMO dataset, where SCST training improves FactCC, DAE and QuestEval by 8 points, 5 points, and 1.5 points, respectively.
To evaluate the factuality of the summaries generated by models trained with SCST against that by the base model, we conduct a human evaluation on a randomly sampled 100 articles from the MMSS test set.We perform the same AMT experiment as described in Section 3.1.We ensure the same > $12 hourly rate and pay 0.1 USD per HIT.For each summary, we aggregate the 3 annotator scores for the document, image, and combined judgments.The final factuality score is the average across the 100 examples.The result is shown in Table 8.The model with SCST training achieves a statistically significantly better factuality score with respect to the document (p = 0.002), image (p = 0.041), and especially to the combined case (p < 0.001) using bootstrap test (Efron and Tibshirani, 1993).This aligns with the factuality improvement we observe with the automatic factuality scores in Table 7.

Related Work
Multimodal Summarization.The task of multimodal summarization takes additional inputs from multiple modalities apart from the input text document, including images (Li et al., 2018;Zhu et al., 2020;Li et al., 2020a) and videos (Li et al., 2020c;Palaskar et al., 2019).To incorporate multiple modalities, many works have developed models with multimodal attention (Zhu et al., 2020).When multiple images are present, the rich information present in the images may distract and thus hurt the model's performance.To this end, approaches such as selective gating (Li et al., 2020b), visual guidance (Zhu et al., 2020), and knowledge distillation Zhang et al. (2022) have been proposed.While these methods have demonstrated improvement in ROUGE, to the best of our knowledge, factuality for such tasks has not been studied.We aim to provide an evaluation benchmark for evaluating factuality, and demonstrate methods to improve factuality for the multimodal summarization task.
Faithfulness and Factuality Metrics.Many metrics have been proposed to evaluate the factuality of generated summaries.The metrics can be roughly categorized into entailment-based and question generation and question answering (QGQA) metrics.Entailment-based metrics (Kryscinski et al., 2020;Goyal and Durrett, 2021) train metrics to predict entailment between the document and summary units, such as sentences or dependency arcs.QGQA approaches (Durmus et al., 2020;Wang et al., 2020;Scialom et al., 2021;Fabbri et al., 2022) generates questions from one source using a question generation model and then in turn uses a question answering model to answer the generated questions given the other source.Additionally, counterfactual estimation (Xie et al., 2021) and embedding-based metrics (Zhang* et al., 2020) have been explored.While significant progress has been made, the proposed metrics rely only on the document to detect hallucinations and ignore the other modalities.We thus propose CLIPBERTSCORE that addresses the missing modalities while maintaining similar or higher correlations with human judgment of factuality for the document and mulitmodal summarization task.Meta-evaluations have also been proposed to evaluate such metrics for text summarization that differ in sizes and datasets (Fabbri et al., 2021;Maynez et al., 2020;Wang et al., 2020;Kryscinski et al., 2020;Pagnoni et al., 2021).Our MUFAME is a similar effort but is the first metaevaluation proposed for the multi-modal summarization task.

Conclusion
In this work, we present CLIPBERTSCORE, an automatic metric for evaluating factuality for multimodal abstractive summarization.Through metaevaluation with MUFAME and additional factuality benchmarks, we show CLIPBERTSCORE and its modules correlate well with the human judgment of factuality with respect to the document, image and combined.CLIPBERTSCORE is robust across the different image and text domains and achieves competitive correlation in the zero-shot setting with more complex metrics.We hope this work provides a meta-evaluation for evaluating future multimodal factuality metrics with MUFAME, a strong baseline metric CLIPBERTSCORE to compare against, and two methods to improve the factuality of multimodal abstractive summarization models.

Limitations
We limit our work to the task that only contains the vision modality through images and the text modality.However, we note that multimodal summarization also contains video and audio, which we leave for future works.Furthermore, similar to all pretraining models, CLIPScore and BERTScore are also known for reflecting biases of the pre-training data (Hessel et al., 2021;Agarwal et al., 2021), leading to some incorrect predictions.Our work is also focused for datasets in English.Ladhak et al. (2020) proposed a multi-lingual WikiHow by aligning the steps from various languages with the image, and thus our work could be extended to include other languages by including the images to that dataset. is around 60 million.We use mixed precision, and training was performed on 2 NVIDIA RTX A6000 GPUs for approximately 6 hours.
PEGASUS.PEGASUS (Zhang et al., 2020) is another encoder-decoder model specifically designed for the abstractive summarization task by imitating the summarization setup during pretraining.We use PEGASUS-large checkpoint and fine-tune without the images.The total number of parameters is around 571 million.Training was performed on a single NVIDIA RTX A6000 GPU for approximately 28 hours.
CLIP-BART.The architecture of CLIP-BART is described in Section 5.The total number of parameters is around 140 million.We fine-tune the model starting from the BART-base checkpoint, and use the CLIP RN50x64 visual encoder to extract image features.We use mixed precision, and the training was performed on a single NVIDIA RTX A6000 GPU for approximately 6 hours.Zhu et al. (2018), a multimodal summariaziton model with multimodal attention (Li et al., 2018).The model consists of a single-layer unidirectional LSTM (Hochreiter and Schmidhuber, 1997) with the embedding dimension of 256 and hidden dimension of 512 for the text encoder and text decoder.The multimodal attention is computed by concatenating the textual attention layer and visual attention layer over the visual features, extracted from the global fc7 layers from VGG19 (Simonyan and Zisserman, 2015).The total number of parameters is around 83M. Training was performed on a single NVIDIA RTX A6000 GPU for approximately 40 hours.

A.3 Annotation Details
Figure 3 shows a screenshot of the annotation task on AMT.

B Meta-Evaluation with Continuous Labels
We also experiment with combining the two judgments in a continuous way, by taking the average of the two judgments so that a score of 0.5 indicates that the summary is faithful to only one modality.The combined judgment is shown in Table 10.While the correlations are higher overall for all metrics, the trend is similar to the   the fine-tuned metric, Triplet Net + BERT-S, indicating the effectiveness and simplicity of our metric.
from the same WikiHow category12 , and Similarity selects top-3 most similar images from different articles using similarity computed using FAISS (Johnson et al., 2019) The three sampling strategies provide settings with increasing difficulty in terms of the negative summaries; random is the easiest setting and similarity is the hardest.Depending on which modality we provide as the prompt, we can further break down the task and evaluate the metric's performance with different combinations of modalities.We use the same sets of metrics described in Section 3.1 for each modality combination.FactCC and DAE produce binary labels and thus are at a disadvantage for the ranking experiment, and we thus use the probability for the factual label for FactCC and the token error for DAE.To explore how larger visual backbone can improve image-summary factuality detection, we compare against the original CLIP-S that uses the ViT-B/32 backbone.
Comparison with VGSI.As described in Section 4.1, the difference between VGSI and Wik-iHowFact is what information is provided as the prompt and the choices.For VGSI, we use the step sentence, or the summary, as the prompt and ask the models to select the correct image.Since the document is not provided, we use CLIP-S to calculate the score for each summary and image pair.We show the result in Table 11.We see the same surprising result that CLIP-S with the ViT-B/32 backbone achieves better ranking accuracy than the Triplet Net model trained on the training split.Increasing the capacity of the CLIP-S with RN50x64, the ranking accuracy improves by 4 points for random, and 8 points for category and similarity, ap- proaching the human performance, especially for the random case.Additionally, when comparing the performance of the same model for WikiHow-Fact and VGSI, the ranking accuracies for VGSI are much higher, indicating that WikiHowFact is more difficult.

C.2 FOIL
We explore the ability of CLIP-S at differentiating hallucinating captions.The FOIL (Shekhar et al., 2017) dataset measures how well the metric can differentiate hallucinated captions from the correct ones.The task uses MSCOCO reference captions and adversarially swaps out noun phrases to create hallucinating summaries to create 64K test pairs.One benefit of captioning tasks is that the captioning data contain references that can be treated as the document in our setting, and thus enable eval- determining the α from {0.90, 0.95,0.99,0.995, 0.998, 0.999, 1.0}, where we find 0.998 to perform the best.We then tune the weight of CLIP-BERTSCORE from {1,2,5} and find that 2 performs the best for both datasets.

Figure 1 :
Figure 1: Illustration of the computation of CLIP-BERTSCORE.CLIP-S and BERT-S computes the image-summary and document-summary factuality score, respectively, and the final score is a weighted combination of the two.

Figure 2 :
Figure 2: An example of the WikiHowFact task.Given the image and document, the metric needs to select the correct summary C. The task can be split into imagesummary and document-summary evaluation by only providing the respective input.

Figure 3 :
Figure 3: Screenshot of the annotation interface on AMT.

Table 1 :
Pearson correlation coefficients between automatic metrics and human judgments of factuality with respect to the document, image, and combined.The top section corresponds to factuality metrics for document summarization, the middle section corresponds to image-summary factuality metrics, and the bottom section corresponds to multimodal metrics.

Table 2 :
Meta-evaluation results with different combination methods.

Table 3 :
WikhowFact ranking accuracy given different input modalities.CLIPBERTSCORE shows the largtest positive transfers when combined, outperforming RefCLIP-S on all settings and Triplet Net + Bert-S on random and category settings.

Table 6 :
(Zhang et al., 2022)ferent guidance strategies.DAE: lower is better (↓).For reference, the SOTA model UniMS(Zhang et al., 2022)achieves 42.94 for R1, 20.50 for R2, and 69.38 for image precision (IP).CLIPBERTSCORE as a guidance improves IP and all factuality metrics with a minor decrease in ROUGE.

Table 7 :
SCST result on MMSS and MSMO.DAE: lower is better (↓).train a CLIP-BART model as the base model for MMSS, and use CLIP-BART CLIPBERTSCORE guidance as the base model for MSMO.We observe consistent improvement on all metrics with SCST over the base model on MMSS, and consistent improvement on all factuality metrics on MSMO.For reference, the SOTA model on MMMS byLi et al. (2020b)achieves 48.19/25.64/45.27for ROUGE.

Table 8 :
Human evaluation results on MMSS.Model with SCST training are statistically signficiantly more factual with respect to document, image, and both.

Table 10 :
Pearson correlation coefficients between automatic metrics and human judgments of factuality with respect to the continuous combined judgment.

Table 11 :
Original Wikihow VGSI.Results with * indicates results taken from the original paper.
. Random consists of 153,961 examples, similarity consists of 153,770 examples, and category contains 153,961 examples.