Are Factuality Checkers Reliable? Adversarial Meta-evaluation of Factuality in Summarization

With the continuous upgrading of the summarization systems driven by deep neural networks, researchers have higher requirements on the quality of the generated summaries, which should be not only ﬂuent and infor-mative but also factually correct. As a result, the ﬁeld of factual evaluation has developed rapidly recently. Despite its initial progress in evaluating generated summaries, the meta-evaluation methodologies of factuality metrics are limited in their opacity , leading to the insufﬁcient understanding of factuality metrics’ relative advantages and their applicability. In this paper, we present an adversarial meta-evaluation methodology that allows us to (i) diagnose the ﬁne-grained strengths and weaknesses of 6 existing top-performing metrics over 24 diagnostic test datasets, (ii) search for directions for further improvement by data augmentation. Our observations from this work motivate us to propose several calls for future research. We make all codes, diagnostic test datasets, trained factuality models available: https://github.com/zide05/ AdvFact .


Introduction
With the rapid development of neural networks in text summarization (Liu and Lapata, 2019;Zhong et al., 2019;Zhang et al., 2019;Lewis et al., 2019;Zhong et al., 2020;Liu and Liu, 2021), especially the use of contextualized pre-trained models (Devlin et al., 2019;Lewis et al., 2019), the state-of-the-art performance, measured by automated metrics such as ROUGE (Lin, 2004) and BERTScore (Zhang et al., 2020) has been constantly updated. However, although these systems can generate informative, and fluent summaries, they suffer from the problem of making factual errors-generating incorrect facts that can not be supported by the source document (Cao et al., 2018a).
Among this background, a large body of recent works (Wang et al., 2020a;Kryscinski et al., 2020;Durmus et al., 2020;Cao et al., 2020) are trying to search for new automated metrics that can assess the factuality of generated summaries due to the fact that existing metrics (e.g., ROUGE) are not correlated well with factual consistency (Maynez et al., 2020;Goyal and Durrett, 2020).
Generally, the process of designing these evaluation metrics w.r.t factuality is commonly formulated into different forms of NLP tasks, ranging from text entailment (Falke et al., 2019;Kryscinski et al., 2020) at sentence level or more fine-grained level (Goyal and Durrett, 2020) to question answering (Wang et al., 2020a;Durmus et al., 2020). Improving the understanding of these factuality metrics with diverse paradigms is critical for further metric improvement. However, the evaluation methodologies of factuality metrics are limited in their opacity-they are opaque to their results, which are usually holistic scores (e.g., accuracy) and not interpretable. Specifically, different from traditional non-learnable metrics like ROUGE, whose scores are relatively straightforward to interpret, e.g., lower ROUGE-2 Recall implies fewer bigrams from reference summaries are covered by generated summaries, there are diverse factors that could lead to lower score of factuality metrics (e.g., entity replacement, number inference). However, most of existing meta-evaluation strategies fail to tell (i) which types of factual errors the metric evaluated at hand are better at identifying, (ii) on which categories the error recognition ability of factuality metrics can not be well generalized. As a result, (1) the relative advantages between a better-and worse-performing systems w.r.t factuality are unclear.
(2) the lack of understanding of factuality metrics' applicability reduces their reliability, and users may take the risk of over-estimating their generalization ability so as to apply them to inappropriate evaluation samples. (3) it's unclear how to improve the metric further.
Thus instead of further pursuing a new method, we take a step back to understand the shortcomings of existing metrics. We present an adversarial meta-evaluation framework which can perform fine-grained evaluation of factuality metrics. Methodologically, we (i) first conduct error analysis of existing state-of-the-art factuality metrics, (ii) define effective adversarial transformations based on the results of error analysis. We (iii) construct diagnostic examples by applying adversarial transformations to test datasets with different distributions and then diagnose existing top-scoring factuality metrics. (iv) We finally show that, the technique of data augmentation, driven by adversarial transformations, can increase the diversity of training samples, making factuality metrics more robust and reliable.
Our contributions can be summarized as follows: (1) We figure out several representative errors made by the existing top-performing factuality metrics ( §4.2), inspiring the direction for further improvement.
(2) We propose effective adversarial transformations that can either be applied to test set for model diagnosis ( §5) or applied to training set for data augmentation ( §6.2), by which we further improve the performance of current checkers. (3) We propose a fine-grained meta-evaluation methodology for factuality metrics and re-evaluate existing top-performing metrics to assess their relative strengths and weaknesses. (4) We call for a more fine-grained and interpretable meta-evaluation of factuality metrics for future research. As a first step, we released our constructed diagnostic test sets with various characteristics, as well as augmented training data and more robust factuality metrics.

Related Work
Factuality in Text Summarization Recent studies on factuality of text generation revolve around metric design and system optimization. Regarding the metric perspective, researchers formulate the design of automated metrics w.r.t factuality as different problems: text entailment over sequential (Kryscinski et al., 2020;Goyal and Durrett, 2021a) or tree (Goyal andDurrett, 2020, 2021a) structures; question answering (Wang et al., 2020a;Durmus et al., 2020) and sequence labeling (Zhao et al., 2020a). Concurrent to our work, Pagnoni et al. (2021a) constructs human annotated test sets for factuality metrics while using a different typology. Additionally, their method is difficult to be used as automatic data augmentation. Other works aim to learn factuality-aware summarization systems, which can be achieved by leveraging open information extraction and dependency parsing (Cao et al., 2018b;Zhu et al., 2020). Chen et al. (2020) explore how factuality metrics are influenced by domain shift and conclude that out-of-domain systems can even surpass in-domain systems in terms of factuality and factuality checkers like FactCC is limited in predictive power of positive samples.
Adversarial Evaluation of NLP Systems Adversarial evaluation has been extensively explored in many NLP tasks recently. The adversarial challenge sets have been introduced into tasks of natural language inference (Naik et al., 2018) question answering (Jia and Liang, 2017), machine translation (Burlot and Yvon, 2017) and language model (Marvin and Linzen, 2018) to examine system drawbacks. More recently, Gardner et al. (2020) introduces the concept of "contrast set" and proposes to use it to measure the generalization of different NLP systems. Instead of adversarially evaluate an NLP system, we perform an adversarial metaevaluation of evaluation metrics.
Meta-evaluation for Automated Metrics Metaevaluation aims to evaluate the reliability of automated metrics based on their correlation with human judgments (Graham, 2015;Peyrard, 2019;Bhandari et al., 2020). Most existing works perform meta-evaluation on metrics that measure semantic equivalence, such as ROUGE (Lin, 2004) and BERTScore (Zhang et al., 2020). Yuan et al. (2021) more recently propose BARTScore and meta evaluate it on multiple evaluation perspectives. By contrast, in this paper, we focus on the evaluation of factuality metrics using our constructed diagnostic test sets. Concurrent with our work, Goyal and Durrett (2021b); Pagnoni et al. (2021b) also look into the error patterns of existing factuality checkers. 1

Definition of Factuality
Although researchers have slightly different definitions of factuality (Maynez et al., 2020;Kryscinski et al., 2020). In this paper, we consider factuality as how well generated summaries are supported by source documents without using any external knowledge. A factual error happens when generated summaries contain salient facts (Kryscinski et al., 2020) that can not be inferred from source documents. The summary sentences that need to be verified are also called claims below to keep consistent with the field of fact verification (Zhou et al., 2019;Schuster et al., 2019;Liu et al., 2020).

Models
Type Train data

Factuality Metrics
There are two major task formulations of factuality metrics: natural language inference (NLI) and question answering (QA). Model types and training data are summarized in Tab. 1.

NLI-based Metrics
NLI-based metrics consider factual consistency as a natural language inference problem, the core idea of which is to infer if facts from generated summaries can be entailed by its source documents.  (Williams et al., 2018). The neutral class samples are deleted in the dataset for fair comparison following Goyal and Durrett (2020).

QA-based Metrics
The basic idea behind QA-based metrics is whether similar answers can be replied when we ask the same question to a generated summary S and its source document D (

Existing datasets for Meta-evaluation
To get a holistic overview of factuality metrics performances, we collect four different human judgment datasets that can be used to meta-evaluate the correctness of factuality metrics.   (1) FACTCC has achieved the best performance in most of the test sets except in FaithFact. The reason for this is that the claims in this set are highly paraphrased (its novelty is 99.2% in Tab. 2) thus will mislead FACTCC which is trained on less abstractive claims (CNNDM-G as shown in Tab. 1).
(3) With the same pre-trained model (ELECTRA), DAE outperforms MNLIELECTRA in FaccTe and QagsC. However, DAE with dependency information doesn't show constant superiority over MNLI-ELECTRA in all evaluation sets.

Fine-grained error analysis
Setup and Error Typology To get a more finegrained understanding of factuality checkers and define the upper bound of the difficulty for the task, we choose FACTCC as the representative factuality checker (for its superior performance as described in §4.1) and perform error analysis on it. We examine 140 samples 4 that the checker fails to predict correctly in FaccTe and QagsC, and divide the reasons into diverse categories. Examples are pre-• R1: VANs replacement: the checker is hard to detect Verb, Adjective and Noun replacements (e.g., antonym, synonym) thus producing the wrong prediction. Here noun represents noun or noun phrase excluding entity. • R2: Numerical inference the checker obtains worse performance when verifying samples that require numerical inference (e.g., date). Similar results are also observed in (Zhao et al., 2020b). • R3: Entity coreference: a slight change of person name or replacing the pronoun with its reference name will mislead factuality checker which suggests the lack of entity coreference resolution ability. • R4: Missing details: when the claim lacks some detailed information (e.g., location), the checker tends to predict it as inconsistent though it is not. While this is frequently occurring in the scenario of summarization when the summarizer only extracts the most important information. • R5: Paraphrase The more complex paraphrase patterns (e.g., complex reorder, passive-active transformation, sentence fusion and so on) other than simple token replacement or omission that cause the model to make wrong predictions. • R6: Background knowledge The checker is fragile when extra knowledge is required. • R7: Truncate The checker truncates long documents and will ignore the information of evidence sentences in later part of documents, therefore making wrong judgment. • R8: Wrong label Incorrect annotated label. • R9: Others Other reasons.

Analysis of Error Reasons
As presented in Tab. 3, VANs replacement and Missing details account for a large proportion in all error reasons. It is because verb, adjective and noun (besides entity) replacement and detail omission are not included in the training data for FACTCC. Moreover, misclassifications that caused by paraphrase are account for 11.8%, which lies in the lack of paraphrase for training data of FACTCC as the only paraphrase pattern is introduced by backtranslation (Edunov et al., 2018). While entity and number swap are included in negative sample construction in (Kryscinski et al., 2020), FACTCC still makes wrong prediction facing samples requiring entity coreference resolution and numerical inference.

Typology
Source document Claim Ratio R1: VANs replacement (inco → co) ...Japanese court issued a landmark injunction halting plans to restart two nuclear reactors in a western prefecture... japanese court orders to restart two nuclear reactors in a western prefecture.
ahmed farouq was the deputy emir of al qaeda in the indian subcontinent.

17.0%
R4: Missing details (co → inco) ...Phil Rudd, the drummer for legendary hard rock band AC/DC, has pleaded guilty to charges of... rudd has pleaded guilty to threatening to kill and possession of drugs in a court.

31.4%
R5: Paraphrase (inco → co) ...A police motorcycle stopped the rest of the pack, before organisers of the 151-mile race slowed the leaders to allow the pack to catch up...
Leaders of the tour de france were stopped by police as they crossed a railway line to avoid a train.

11.8%
R6: Background knowledge (co → inco) Scientists from harvard medical school have discovered a way of turning stem cells into killing machines ...
Scientists in the us have developed a stem cell therapy for brain tumours. These days we are increasingly using outdoor space for the occasional barbecue or to relax in a hot tub rather than for tending flowers.
these days we are increasingly using outdoor space for tending flowers.
12.4% Table 3: Error reasons with their corresponding examples and the ratio of them. The bold span is corresponding to the error reason. co → inco represents the gold label is factually correct while checker misclassifies it as factually incorrect (inco → co means the opposite).
[>512] means there are more than 512 subwords before this position.

Construction of Diagnostic Set
It is not realistic to produce large scale human annotated test sets with multiple error reasons observed above. As a consequence, former work (Hidey et al., 2020) and (Naik et al., 2018) construct diagnostic test sets automatically. In this section, we first introduce automatic rule-based transformation methods based on error analysis ( §5.1). Then we construct 24 diagnostic test sets based on three types of baseline test sets.

Adversarial Transformations
We introduce four types of automatic transformation methods corresponding to the R1-4 error reasons in error analysis ( §4.2). Paraphrasing (R5) is not included here for it is hard to produce simply with rule, thus we introduce it in another way-using gold references as claims in §5.2. The rest four error reasons are either too hard for models (R6, R9) or correspond to systematic error (R7) or lie in annotation error (R8), and also will not be included here. The adversarial transformation examples are shown in Tab. 4.

R1: Antonym Substitution
We first use Stanza (Qi et al., 2020) to do Part-of-Speech tagging and then use WordNet wrapped in NLTK (Bird et al., 2009) package to find antonyms for verb and adjective. Negative samples are produced by replacing the original word with its antonyms. The reason we do not include synonyms replacement is that simply replacing word with its synonyms can introduce factual error and cause the gold label ambiguous.
R2: Numerical Editing FACTCC exhibits worse performance when it needs numerical reasoning to derive the result as §4.2 shows, which motivates us to design a numerical adversarial transformation. Specifically: (1) to produce negative samples, we replace numerical entity 5 with a randomly chosen entity of the same type in source document and guarantee the transformed claim differs from the origin. On the other hand, we also add preposition (e.g.,"after") before date and timing type entities while adding "more than" and "less than" before other types of numerical entities; (2) For positive samples, we change the number or date 6 and add "before", "after", "more than" and "less than" properly (e.g., "in 2019" to "two years before 2021"). We include more complex negative and positive transformations for numerical inference compared with Kryscinski et al. (2020).
R3: Entity Replacement At the phase of error analysis, we discover FACTCC fails to understand the equivalence between named entities referring to R3: EntRep pos actor isaiah washington → isaiah tweeted : ' okay , watching the #walterscott video was horrible , but i think the brave person who captured the murder is a hero and a godsend #truthdom . ' neg actor isaiah washington → michelle williams tweeted : ' okay , watching the #walterscott video was horrible , but i think the brave person who captured the murder is a hero and a godsend #truthdom . ' R4: SynPrun prepo. the queen and the duke of edinburgh appeared in good spirits as they arrived to a red carpet at the event .
clause the mystery hero who raced to the edge of a cliff and pulled a driver from his precariously-balanced car has been identified as a 29-year -old man who fled the scene to go to work . R4: Syntactic Pruning Syntactic pruning is used to produce positive examples with detail omitted. Despite using dependency parsing, we choose constituency parsing to disentangle the summary sentence for it is more suitable to capture clauses and phrases. To produce positive examples, clauses with label "S" and "SBAR" and prepositional phrases with label "PP" are deleted based on the assumption that the lack of sub-clause will not affect the factual consistency.

Diagnostic Datasets
We construct 24 diagnostic datasets 7 based on three types of base test sets as follows: Besides only using sentences in source document (DocAsClaim) as input to transformation as previous work (Kryscinski et al., 2020) does, we propose to use another two base test sets: gold sum- 7 We have released the datasets on our Github repository. And the detailed information of it is included in the appendix. mary (RefAsClaim) and generated summary (FaccTe, QagsC,RankTe and FaithFact) to serve as input to the adversarial transformation. Reasons are: (i) the diagnostic set constructed based on reference summaries corresponds to the error reason R5 in §4.2, which is a more challenging test set for factuality checkers due to its more complex paraphrase patterns. (ii) the distribution of generated summaries will be more closed to summaries verified by factuality checkers in real scenarios (e.g., generated summaries from BART). Finally we obtain 6 base test sets and 24 diagnostic test sets (4 adversarial transformations on every base test set).

Quality Examination
In order to explore the reliability of the automatically generated diagnostic test sets, we conduct human examination on whether the generated claim is grammatically correct and maintains correct label. This is carried out on 50 randomly chosen samples for each type of adversarial transformation. Results 8 show that all the diagnostic sets are grammatically correct (ratio around 85%) and possess correct factuality labels (ratio higher than 90%).  shows (nearly all entry values of AntoSub columns are negative). However, FEQA and FACTCC obtain obvious performance improvement in the AntoSub diagnostic set of RefAsClaim. It is because claims in RefAsClaim original set are highly paraphrased which will mislead the checkers to produce negative labels and cause lower accuracy. While Antonym Substitution introduces factual inconsistent samples, thus instead, model performance improves. Models transferred from MNLI and DAE are more robust to samples with highly paraphrased claims.
Numerical Editing Nearly all factuality checkers get worse performance with NumEdit transformation (almost all results of NumEdit columns are negative in Tab. 5). Even FACTCC is not the exception though it may possess numerical inference ability to some extend. It emphasizes the importance to improve numerical inference ability for factuality checkers. However, FEQA and FACTCC get better performances when tested in NumEdit diagnostic set of RefAsClaim because the numerical editing transformation introduces more negative samples (reason is similar as described above).
Entity Replacement Similar to numerical Editing, the entity replacement transformation also tends to mislead six factuality checkers as nearly all values of EntRep columns in Tab. 5 are negative. Although FACTCC is trained with data that also includes entity replacement transformation, it still obtains worse performance in EntRep diagnostic test sets of QagsC and RankTe. This implies the incompleteness of entity replacement in (Kryscinski et al., 2020). It shows the same pattern as Anto-Sub when models are tested in EntRep diagnostic sets of RefAsClaim and the reason is similar as described above.
Syntactic Pruning The diagnostic test sets of SynPrun can lead to more performance drop when the base test sets are RefAsClaim and FaccTe because the last columns of these subtables get more negative values. Transformation of this type will be more confusing when the claims are highly paraphrased.
As observed in Tab. 5, models transferred from MNLI dataset and DAE are more robust when syntactic pruning are introduced, while FACTCC and FEQA are constantly misled by SynPrun diagnostic test sets. This can be attributed to the lack of highly paraphrased claims in FACTCC training set. DAE tends to extract dependency triples of summary and make prediction based on them, thus is more robust when evaluated in SynPrun diagnostic sets. As for models transferred from MNLI, it may because the training set of MNLI already possesses pattern of detail omission and the trained models have the capability to recognize it.
Takeaways (1) Most factuality checkers obtain poor performance in AntoSub and NumEdit diagnostic sets, which suggests that current factuality metrics are not faithful when dealing with antonym substitution and numerical editing samples. (2)  FACTCC can handle entity replacement diagnostic sets to some extent, but can not maintain the performance constantly over all EntRep sets. (3) MNLIBERT, MNLIROBERTA, MNLIELECTRA and DAE are more reliable to deal with highly paraphrased claims and are more robust to syntactic pruning transformation.

Data Augmentation
Besides utilizing adversarial transformation to construct test sets, it can also be used to create more training data, i.e., data augmentation, to improve the model performance. Here we choose FACTCC to conduct adversarial training 9 due to the excellent performance of FACTCC in § 4.1.
As the original training data of FACTCC has more than 100 million samples, we first subsample 50 million data to train FACTCC sub . Moreover, we add 34,912 adversarial training data to the subsampled set and train another checker called FACTCC adv sub . Also, we investigate whether introducing references as claims to the training set will enhance model performance. We include references as claims and make negative transformations in (Kryscinski et al., 2020)

Implications and Future Directions
In this paper, we present an adversarial metaevaluation methodology driven by our fine-grained analysis, which not only allows us to re-evaluate existing top-performing factuality metrics, diagnosing their limitations, but also instructs us to further improve current metrics by data augmentation. Based on what we have explored and observed in this work, we suggest following potentially promising future directions: (1) Knowledge-guided factuality metric: One error reason in §4.2 is the lacking of extra knowledge reference ability for factuality metrics. It would be promising to explore the effectiveness of external knowledge like knowledge base (Bordes et al., 2013), citation graph (Lo et al., 2020) (for scientific summarization).
(2) Long document Modeling: Lengths of most of summarization documents are over 512, which brings great challenge for pretrain based factuality metrics (R7 in §4.2). Various methodologies (e.g., first retrieval then verification (Zhou et al., 2019)) should be put forwards to deal with the problem.
(3) Fine-grained meta-evaluation and more diverse human judgments: To reliably evaluate factuality metrics, human judgments over diverse distribution are needed. Moreover, fine-grained metaevaluation for metrics is beneficial to further identify their drawbacks and suggest future directions. NLI transferred models We train three NLI transferred models (MNLIBERT, MNLIROBERTA and MNLIELECTRA) on MNLI dataset (Williams et al., 2018) and the samples with neutral label are deleted for fair comparison. Every model is trained on 4 TITAN Xp for 15 epochs. We choose the AdamW as optimizer and set the learning rate to 2e-5. The training batch size for each gpu is 8. The code and the trained checkpoints can be found in our github https://github.com/ zide05/AdvFact. FEQA The trained FEQA in (Durmus et al., 2020) are used in this paper and the checkpoints and codes can be found in https://github.com/ esdurmus/feqa.

A.2 Experimental Results
Detailed information for baseline and diagnostic datasets. We introduce the basic information for the baseline datasets in Tab. 7. The more detailed statistics for baseline and diagnostic datasets are displayed in Tab. 11. Detailed holistic meta-evaluation Following conclusions can be drawn from the holistic metaevaluation results in Fig. 2: (1) FACTCC has achieved the best performance in most of the test sets except in FaithFact. The reason for this is that the claims in this set are highly paraphrased (novelty of it is 99.2% in Tab. 7) thus will mislead FACTCC which is trained on less abstractive claims (CNNDM-G as shown in Tab. 10).
(2) FEQA underperforms FACTCC most of the time. In FaithFact, however, FEQA gets higher accuracy. Because the claims in FaithFact are highly paraphrased, thus FEQA tends to label samples as factually inconsistent. On the other hand, the negative samples account for 92% in FaithFact. Thus the tendency of producing negative labels helps to improve the accuracy of FEQA.
(3) With the same pre-trained model ELECTRA, DAE outperforms MNLIELECTRA in FaccTe and QagsC. However DAE with dependency information doesn't show constant superiority over NLI based model MNLIELECTRA in all evaluation sets. It shows especially worse performance in FaithFact. Opposite to FEQA, DAE averages the factuality scores of all dependency arc triples as the claim-level factuality score, which is biased towards the label of factually correct. Therefore it will obtain lower accuracy in the test set with more negative samples.    Table 9: Quality examination of four diagnostic evaluation sets. "CoLabel" and "CoGrammar" represent the correctness rate of automatically generated labels and grammar.  10: The model types and training data of factuality metrics. NLI-A and NLI-S represent NLI-based metrics defining facts as dependency arcs and span respectively. PARANMT-G and CNNDM-G mean the automatically generated training data from PARANMT (Wieting and Gimpel, 2018) and CNN/DailyMail (Nallapati et al., 2016)