GO FIGURE: A Meta Evaluation of Factuality in Summarization

While neural language models can generate text with remarkable fluency and coherence, controlling for factual correctness in generation remains an open research question. This major discrepancy between the surface-level fluency and the content-level correctness of neural generation has motivated a new line of research that seeks automatic metrics for evaluating the factuality of machine text. In this paper, we introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. We propose five necessary and intuitive conditions to evaluate factuality metrics on diagnostic factuality data across three different summarization tasks. Our benchmark analysis on ten factuality metrics reveals that our meta-evaluation framework provides a robust and efficient evaluation that is extensible to multiple types of factual consistency and standard generation metrics, including QA metrics. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.


Introduction
The goal of text generation systems is to produce text that is fluent, coherent, relevant, as well as factually correct. Recent progress in neural approaches to building semantically constrained text generation systems has shown tremendous improvements in this direction (Liu and Lapata, 2019;Guo et al., 2018;Durmus et al., 2020;Wang et al., 2020). However, an important issue in text generation systems is that they can yield factually inconsistent text, caused by somewhat distorted or fabricated facts about the source text. Especially in document summarization tasks, models that abstract away salient aspects, have been shown to generate text * Work done while first author was interning at MSR. with up to 30% factual inconsistencies (Kryscinski et al., 2019;Falke et al., 2019a;Zhu et al., 2020).
Commonly used metrics for measuring quality of generated text fail to capture structural aspects of language like negation and poorly correlate with human judgements (Hashimoto et al., 2019;Clark et al., 2019;Sellam et al., 2020), leading to a rapidly progressing search for factuality-driven summarization metrics.
In this work, we propose GO FIGURE 1 , a metaevaluation framework for assessing the effectiveness of factuality metrics across multiple domainsextreme summarization, multi-sentence news summarization and the understudied dialogue summarization domain. Our contributions are as follows: (i) a set of diagnostics for measuring sensitivity of metrics to factual inconsistency, (ii) a diagnostic evaluation dataset of context/summary pairs for measuring effectiveness of new factuality metrics in a controlled setting, and (iii) an evaluation dataset of summaries generated by transformerbased models (Raffel et al., 2019) annotated with types of factual errors.

Factuality Metric Meta Evaluation
Since reference summaries may be an incomplete representation of the salient facts in a source document or unavailable, we consider factuality in terms of how well candidate summaries are factually grounded with respect to the source document.
We define a set of five conditions for a factual consistency metric M (D, S i ) to measure factuality of a summary S i with respect to a source document D. These conditions are given in Table 1.

Testing Factuality Metric Validity
For the purposes of testing boundedness (Condition I), we define the Lower Bound for a metric M as In general, the exact factuality level of Si may be unclear. Metric bounds provide points of comparison. Sensitivity (II) The metric value for Si should correlate with the level of factuality captured by Si.
A bounded but insensitive factuality metric may assign higher values to mostly nonfactual or unrelated summaries over summaries that are close to the reference.

Robustness (III)
The metric should be robust across types of factual errors.
A metric that is sensitive only to a subset of errors might ignore a significant number of modelgenerated errors (Figure 1). Generality (IV) The metric should satisfy conditions I,II,III and V across domains.
Prior work such as Reiter and Belz (2009) highlight the risk of claiming validity without testing generality. Human Correlation (V) The metric should correlate with human judgements of factuality.
The scoring function H(D, Si) represented by human evaluation is a gold standard for assessment of generation quality (Chaganty et al., 2018), so M (D, Si) should be an approximation. M (D, S r ) where D is the source document and S r is a randomly sampled summary from the corpus. 2 We define the Upper Bound for the metric as M (D, S f ), where S f is the reference groundtruth summary. Since our controlled experiments use transformed versions of the reference summary with injected errors, the original reference is guaranteed to be at least as factually consistent as a transformed summary.
To test sensitivity (Condition II), we measure the correlation (Pearson's r) between the factual inconsistency level 3 of the summaries (i.e. the number of injected errors) and the average metric score. Then we measure statistical significance using the p-value from a two-tailed hypothesis test. We check whether metrics satisfy robustness and generality (Conditions III and IV) by separately running this analysis over multiple domains and the factual error types shown in Figure 1. We measure how well metric values match human assessment of factuality by checking the correlation between factual consistency levels determined using manual annotation.

Theoretical Cases
For Condition I, we scope boundedness to only consider cases that are likely to arise in realistic sum-2 While this may not be the strictest lower bound in theoretical terms, we consider it appropriate as an empirical lower bound since the content is irrelevant to the document. A single random summary is used. 3 For our experiments, we inject up to a maximum of x errors with x ∈ {1, 2, 3}. pled generated summaries (96.37% of all errors). We draw from the same error types for our controlled analysis to ensure we match the true distribution of errors. Here extrinsic entity refers to entities that did not previously appear in the source, while an intrinsic entity appeared in the source. marization settings. However, there are hypothetical cases that may have ramifications for metric validity. For example, we expect that M (D, D) ≈ 1 and M (D, ∅) ≈ 0 for a metric M with values in the range [0, 1], a document D, and an empty string summary ∅. For non-deterministic metrics, restrictions on variability between runs may also be desired. CNN/DailyMail dataset (Nallapati et al., 2016), and the recently released SAMSUM corpus (Gliwa et al., 2019) consisting of English language conversations written by linguists and aligned multisentence summaries.

Diagnostic Datasets
To test the ability of proposed metrics to fulfill our predefined conditions, we set up two diagnostic datasets consisting of (i) transformed reference summaries with simulated factuality errors that allow us to induce and measure factuality levels in a controlled setting and (ii) summaries generated by state-of-the-art transformer summarization models that allows us to measure the effectiveness of metrics in a real data setting. We sample 500 source / summary pairs for each domain. 4

Model-Generated Datasets
In order to observe how metrics perform on machine-generated summaries, we generate summaries from fine-tuned T5 encoder-decoder summarization models (Raffel et al., 2019) that was pretrained on news summarization data. We generate summary text using either beam search or samplebased decoding strategies. We then annotate the generated summaries for fine-grained factual errors using the types in Figure 1 to create a hand-curated factual consistency diagnostic dataset.

Factuality Metrics for Evaluation
We mainly focus on meta-evaluating most recently proposed factual consistency metrics which use two types of proxy natural language understanding (NLU) objectives aimed at implicitly capturing factuality in generated text: question-answering (QA) and a masked token prediction cloze task. For QA we evaluate using SummaQA (which uses QA pairs from the source, Scialom et al., 2019) and FEQA (which uses QA pairs from the summary, Durmus et al., 2020), while for the cloze task setting we use BLANC-Help and BLANC-Tune (Vasilyev et al., 2020, see the appendix for details of metrics). We also measure the factual-awareness of BERTScore (Zhang et al., 2020), a summarization metric that is aimed primarily at improving coherency rather than factual consistency, and standard summarization evaluation metrics (e.g. ROUGE (Lin, 2004)).

Controlled Data Experiments
We provide the results of the sensitivity analysis over our controlled data on the XSUM domain in Table 2, on CNNDM in Table 3 and on SAMSUM in Table 4. Our analysis reveals that QA metrics, ROUGE-(2/3) and BERTScore generally perform well at evaluating factuality. In contrast, ROUGE-(1/L) are frequently invalid as factuality metrics (Tables 2 and 3), and the performance of Cloze metrics varies across domains (BLANC-Tune is invalid on XSUM, but does fairly well on other domains). Also, performance of metrics tends to be much lower on news domains when we consider non-entity-based errors with the exception of QAbased metrics, ROUGE-(2/3) and BERTScore, indicating that while factuality and standard metrics are fairly attuned to changes in factual consistency that relate to entity-based errors, they are less robust to other types of factual errors.

Comparison with Human Evaluation of Model Generations
We find that metrics displaying invalid behavior on controlled data (for instance assigning higher metric values to more factually inconsistent summaries on XSUM in Table 2) also display this invalid behavior in model generations (Table 5). This indicates that meta-evaluation with controlled data is effective as a diagnostic tool for finding weak factuality metrics, and follows our intuition that non-entity errors, while frequently produced by abstractive summarization models, are difficult for standard summarization metrics to identify. When considering better-performing factuality metrics identified by the controlled error analysis, we find that the controlled data analysis is generally able to identify better-performing metrics (SummaQA, ROUGE-(2/3) and BERTScore) for XSUM with the exception of FEQA (FEQA metric performs well on XSUM controlled analysis (Table 2), but only approaches this performance on SAMSUM when we consider human eval). The strong overall performance of ROUGE-3 is consistent with the findings of (Fabbri et al., 2021) on CNNDM, our work confirms that this metric is more consistently correlated with factuality than other ROUGE variations across domains.     SAMSUM generated summaries with fine-grained labeling. The arrow next to "Corr" indicates the direction of a correct correlation.

Related Work
Prior work concerning evaluation of automatic metrics and human evaluation for NLG systems has mainly focused on general analysis of output quality or coherence and fluency (Callison-Burch et al., 2007;Graham, 2015;Fabbri et al., 2021), rather than factuality. Recent efforts by NLP researchers have drawn attention to the issue of factual errors and hallucinations in the output of neural summarization models (Cao et al., 2018;Massarelli et al., 2019;Zhao et al., 2020;Falke et al., 2019b;Goodrich et al., 2019;Celikyilmaz et al., 2020). A number of works have highlighted the effectiveness of QA and cloze task objectives for evaluating or improving factuality on specific domains (Eyal et al., 2019;Huang et al., 2020). We aim to evaluate these metrics more broadly, and consider a wider range of domains (notably dialogue).

Discussion of Meta Evaluation and Conclusion
Our analyses show that in contrast to prior work on factual consistency that mostly concentrated on one specific domain and dataset, our GO FIGURE framework is effective at evaluating sensitivity and validity of factual consistency metrics with only reference summaries, rather than requiring computationally intensive testing across summarization model variants to identify metric strengths and shortcomings. We highlight the following key points from experiments run using meta-evaluation: Standard summarization metrics are not always valid measures of factuality. ROUGE-1 and ROUGE-L fail to accurately measure factual inconsistency across domains in our controlled analysis. The ROUGE-L results raise the question of context relevance. While ROUGE-L takes into account more context than other ROUGE variations, this context may not be relevant for assessing factuality. For example, swapping "decreased" for "increased" dramatically changes the meaning in the summary "Scotland's renewable energy output increased by 45% in the first quarter of this year, compared with the same period last year.", but ROUGE-L is not affected. Despite the frequent use of ROUGE-L as a more contextual measure, prior work has also noted that ROUGE-N outperforms ROUGE-L (Rankel et al., 2013;Fabbri et al., 2021).
Analysis on human annotated data is still necessary as an upper-bound on meta-evaluation quality. While BLANC-Help, FEQA metric and BERTScore values decrease with factual inconsistency on controlled data, the metrics may sometimes be positively correlated with factual inconsistency on generated data. This emphasizes the importance of a expert curated test set as part of the GO FIGURE meta evaluation for the most rigorous testing. A question-answering objective is promising for measuring factual consistency across domains, but effectiveness depends on the question. While QA metrics can perform well at measuring factual consistency of generated summaries, our meta-evaluation reveals this is dependent on the way in which questions are asked. While both QA metrics use SQuAD-based systems (Rajpurkar et al., 2016), asking questions from the source rather than the summary is most robust across domains. This opens the door to metrics based on more contextual QA like commonsense (Shwartz et al., 2020).
We will release our meta-evaluation framework and diagnostic datasets to aid in development of effective summarization factuality metrics. In future work, summary meta-metric results (e.g. correlation on simulated data) could be used as rewards for reinforcement learning driven approaches to training factuality metrics.

Ethics and Broader Impact Statement
Ethical considerations involving our metaevaluation framework primarily revolve around human evaluation. News articles and dialogues may contain references to distressing events or abnormal social behavior. All our expert annotators voluntarily took part in the human evaluation with prior knowledge of the type of content being evaluated. Crowd-sourced human evaluation trials were conducted under an IRB exemption.
Our work outlines a simple and effective approach for evaluating factuality metrics in summarization. This can aid in development of more robust and sensitive factuality metrics to accurately evaluate the factual correctness of generative models. This is key as improvement in the coherency of models accelerates, potentially leading to generations that appear to be high quality while containing factual inaccuracies. Our framework could also evaluate factuality metrics for use in identifying human-written errors, mitigating potential spread of misinformation.

A.1 Additional Details of Datasets
We provide dataset statistics for each of our domains in Table 6.

A.2 Evaluation Metric Details
QA-Based Quality Score. Given a source or reference document D and candidate summary S i , QAbased evaluation metrics assign a generation quality score to S i to measure the ability of a QA system by accurately answering questions generated from D or S i . We use the SummaQA (Scialom et al., 2019) and FEQA (Durmus et al., 2020) metrics. For the SummaQA metric, questions are generated from the source document D and the candidate summary S i is used as input to the QA system. Alternatively, FEQA generates questions from S i and uses D to answer these questions. The generation quality score is typically the aggregated F 1 score measuring the similarity between ground-truth answers for questions generated from D and the answers predicted by the QA system. SummaQA also generally includes the aggregated model confidence probabilities for predictions.
Masked LM Prediction (Cloze Task) Score. Given a source document D and candidate summary S i , Cloze-based evaluation metrics assign a generation quality score to S i by measuring the ability of a NLU system to accurately predict masked tokens in the source document, given access to the information in S i . We use two variants of BLANC (Vasilyev et al., 2020), BLANC-Help and BLANC-Tune. BLANC-Help uses both D and S i as input to a pretrained masked token prediction model, while BLANC-Tune only uses D as input to a model that has been finetuned on the candidate summary. Both metrics are aimed at capturing fluency, informativeness and factual correctness of summaries.
Semantic Similarity. Semantic similarity metrics measure the overlap between contextual embeddings of a source or reference document D and candidate summary S i . We use BERTScore (Zhang et al., 2020), which has been shown to correlate better with human judgements of coherency than standard summarization metrics and similarly to n-gram metrics on factual consistency of CNNDM summaries (Wang et al., 2020).
Lexical Overlap. Finally, we test ROUGE (Lin, 2004), which is the standard metric used for evaluating summarization. ROUGE measures the ngram overlap between a source or reference document D and candidate summary S i . We evaluate results using ROUGE-1 and ROUGE-2, as well as ROUGE-L, which measures longest common subsequence overlap. We follow prior work that considered ROUGE in factual consistency evaluations (Wang et al., 2020), though it has also been previously noted that ROUGE can underweight good summarization examples (Novikova et al., 2017).

A.3 Simulated Data Transformations
We inject errors into reference summaries by first using a part-of-speech tagging model and named entity recognition system (spaCy) 5 to extract entities, verbs, and adjectives from these summaries. For each named entity, we keep track of the label type (e.g. ORG, GPE, etc). All datasets are comprised of English language articles or dialogues and summaries, and we use the spaCy English NLP models.
Intrinsic entity errors. To inject intrinsic entity errors into a summary S, we construct a dictionary of all unique entities appearing in the source document for S only, organized by entity label type. We then swap a random entity in the reference summary for a different entity of the same label type in the constructed dictionary.
Extrinsic entity errors. For extrinsic entity errors, we use the same dictionary construction for all unique entities appearing in all the corpus source documents. To change a random adjective, we use WordNet (Miller, 1995) to obtain the synsets for that adjective and swap the adjective for its antonym.
Pronoun entity errors. Pronoun errors are introduced with a preset list of commonly used pronouns. We randomly extract a pronoun set (e.g. she/her) from the text using the preset list and swap it with another random pronoun set (e.g. he/him).
Verb Negation. We use a rule-based system for verb negation based on verb tense, and predict tense based on the suffix and preceding words.  We note that injecting a certain level of error into a summary will have varying effects depending on the average length of summaries for a corpus. We use the same methodology for each corpus to maintain consistency, but future work may explore length-controlled error injection based on the objectives of the evaluation.

A.4 Metric Implementation Details
For all metrics, we use the publicly shared implementations. Due to BERT context size constraints, we limit the length of document input sentences to 400 tokens for BLANC variants. We use Robertalarge for BERTScore.

A.5 T5 Training
We fine-tune the T5-base model (220M parameters) trained on news summaries for each domain using the AdaFactor optimizer (Shazeer and Stern, 2018) with a learning rate of 0.001 and a batch size of 8. The learning rate was tuned using ROUGE score on a dev set, and we experimented with learning rates in the range of [0.01,0.0001]. All other hyperparameters follow from the original T5 paper. Best performing models were trained using one random seed on NVIDIA V100 GPUs.

A.5.1 Human Annotation Layout
For human annotation of factual consistency in summaries, we show the source document, reference summary and a candidate summary that should be assessed for factuality. We then ask a factuality question with three choices: • Yes (i.e. the summary is factual) • No (i.e. the summary contains factual inconsistencies) • Not Sure (i.e. the summary is too incoherent to judge) If a summary is judged to be factually incorrect, annotators are allowed to select the number and type of errors they observe using a predefined list of factual errors. A screenshot of the error types and examples shown in the annotation task is given in Figure 2. For less obvious cases of factual inconsistency (for example when summaries contain locations or political figures that require regional background knowledge), we check factuality using external knowledge bases to ensure correctness of annotation. We also adhere to a strict binary notion of factuality in deciding cases where summaries are imprecise but ambiguous in terms of correctness, opting to label these summaries as factually inaccurate. If summaries are completely incoherent, we treat these summaries as having the highest level of factual inconsistency.
We validated the effectiveness of the setup by computing inter-annotator agreement of in-house expert annotators for 30 XSUM summaries. We achieve "fair" agreement of Krippendorff's α = 0.32 with 3 annotators and "moderate" agreement of α = 0.44 with 2 annotators (Landis and Koch, 1977;Ageeva et al., 2015). The remaining annotations are done by one in-house expert annotator.   for the same 500 reference summaries). We provide results for the average number of induced factuality errors for factual inconsistency level 1 (L1), level 2 (L2) and level 3 (L3), as well as the percentage (%) of summaries that were transformed for each level and across all levels (All). We split the diagnostic dataset into two subsets based on whether simulated errors are related to entities (Entity) or non-entity changes like verb negation (Non-Entity).