TRUE: Re-evaluating Factual Consistency Evaluation

Grounded text generation systems often generate text that contains factual inconsistencies, hindering their real-world applicability. Automatic factual consistency evaluation may help alleviate this limitation by accelerating evaluation cycles, filtering inconsistent outputs and augmenting training data. While attracting increasing attention, such evaluation metrics are usually developed and evaluated in silo for a single task or dataset, slowing their adoption. Moreover, previous meta-evaluation protocols focused on system-level correlations with human annotations, which leave the example-level accuracy of such metrics unclear.In this work, we introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks, manually annotated for factual consistency. Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations, yielding clearer quality measures. Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results. We recommend those methods as a starting point for model and metric developers, and hope TRUE will foster progress towards even better methods.


Introduction
A core issue in deploying text generation models for real-world applications is that they often generate factually inconsistent text with respect to the input they are conditioned on, or even completely "hallucinate" (Lee et al., 2018;Rohrbach et al., 2018;Maynez et al., 2020;Zhao et al., 2020) as exemplified in Table 1.
Table 1: Factual inconsistencies (in red) from various tasks which are part of the TRUE study.The corresponding parts in the input/grounding are in blue.
To tackle such inconsistencies, one would like to detect them automatically by predicting whether a generated text is factually consistent with respect to a grounding text (frequently referred to as the "input", or the "knowledge").Such automatic methods attract increasing attention (Zhou et al., 2021;Deng et al., 2021) as they enable both better evaluation and better generation models by automatically filtering training data (Gehrmann et al., 2021) or by augmenting training data for controlled generation (Rashkin et al., 2021b).
While automatically evaluating factual consistency is an active line of work, there is no single agreed-upon meta-evaluation protocol for measuring the quality of such methods, and labeling schemes vary in their granularity.Works are usu-arXiv:2204.04991v3[cs.CL] 3 May 2022 ally done in silo, introducing new datasets and methods that target a specific task or domain, such as summarization (Falke et al., 2019;Kryscinski et al., 2020;Wang et al., 2020;Scialom et al., 2021;Deutsch et al., 2021;Xie et al., 2021) or dialogue (Dziri et al., 2021;Honovich et al., 2021;Nie et al., 2021;Qin et al., 2021).Comparing the robustness of such methods across tasks and datasets is therefore difficult, impeding progress on this subject.
In this work, we present TRUE: a comprehensive survey and assessment of factual consistency evaluation methods, covering various metrics, tasks and datasets.We consolidate 11 existing datasets annotated for factual consistency into a unified format, including pairs of a target text and a grounding source text, with a binary annotation of whether the target text is factually consistent w.r.t its source.TRUE2 covers summarization, knowledge-grounded dialogue, paraphrasing and fact verification. 3The proposed standardization enables us to properly compare consistency evaluation methods in a robust manner across these various tasks and domains.
Previous works on automatic factual consistency evaluation have mainly focused on measuring system-level correlations of the proposed metrics with human judgements (Pagnoni et al., 2021).Yet, these correlations are not useful for estimating the performance of a measured metric when making example-level, binary decisions, decoupled from specific system implementations (see recent discussion by Deutsch et al. (2022) on the limitations of reporting correlations).Instead, we aim to measure how well a method detects inconsistent texts (recall) and how often it falsely disregards consistent texts (precision), which can be easily computed using the aforementioned binary labeling scheme.Therefore, as a meta-evaluation protocol we report the Area Under the ROC Curve (ROC AUC) with respect to inconsistent example detection for each evaluation metric and dataset.
Our thorough survey and assessment of 12 metrics draws a clearer picture on the state of evaluating factual consistency.We show that Natural Language Inference (NLI) approaches, as well as Question Generation and Answering (QG-QA) ap-proaches achieve significantly better4 results on a wide variety of tasks and datasets.We also show that NLI and QG-QA are complementary: combining the two yields even better results and hints at room for further improvement.Finally, we perform both quantitative and qualitative analysis of our results, finding that all approaches struggle with long inputs, labeling issues and personal statementspaving interesting avenues for future work.
To summarize, our contributions are as follows: (1) We argue that work on factual consistency evaluation should be unified and generalized across tasks, and standardize 11 published datasets into a single labeling scheme to corroborate this.(2) We propose a meta-evaluation protocol that allows more actionable and interpretable quality measures than previously reported correlations.(3) We survey and evaluate 12 diverse metrics in this unified perspective, showing that large-scale NLI and QG-QA-based approaches achieve strong and complementary results across tasks.( 4) We analyze our results both qualitatively and quantitatively, pointing at challenges like long inputs and personal statements to be addressed in future work.

Standardizing Factual Consistency
In this section we elaborate on our re-evaluation setup.We first formally define what factual consistency refers to in this work.We then detail the datasets we consider and how we standardize them.Finally, we discuss the meta-evaluation protocol we propose for measuring the performance of evaluation methods on the standardized datasets.

Definitions and Terminology
We define a text to be factually consistent w.r.t its grounding text if all the factual information it conveys is consistent with the factual information conveyed by the grounding text. 5While some previous works distinguished between inconsistent erroneous text to inconsistent correct text (Maynez et al., 2020), we take a strict approach, requiring the text to be faithful to its grounding text, regardless of the "correctness" w.r.t the "real world".In other words, we consider only the information present in the input text, not external knowledge, to assess faithfulness.This enables a more well-defined task, since determining the truthfulness of a fact w.r.t a Task # Examples Open Test Cons.Summarization -FRANK (Pagnoni et al., 2021) 671 + 33.2% -SummEval (Fabbri et al., 2021a) 1,600 -81.6% -MNBM (Maynez et al., 2020) 2,500 -10.2% -QAGS-CNNDM (Wang et al., 2020) 235 -48.1% -QAGS-XSum (Wang et al., 2020) 239 -48.5% Dialogue -BEGIN (Dziri et al., 2021) 836 + 33.7% -Q 2 (Honovich et al., 2021) 1,088 -57.7% -DialFact (Gupta et al., 2021) 8,689 + 38.5% Fact Verification -FEVER (Thorne et al., 2018) 18,209 -35.1% -VitaminC (Schuster et al., 2021) 63,054 + 49.9% Paraphrasing -PAWS (Zhang et al., 2019) 8,000 + 44.2% general "real world" is subjective and depends on the knowledge, values and beliefs of the subject (Heidegger, 2001).This definition follows similar strictness in Textual Entailment, Question Answering, Summarization and other tasks where comprehension is based on a given grounding text, irrespective of contradiction with other world knowledge.This is also in line with recent work on evaluating attribution in text generation (Rashkin et al., 2021a), where humans are required to judge whether a generated text is attributable to a grounding text.We use the terms consistent, grounded, faithful and factual interchangeably.

Standardization Process
We include 11 datasets that contain human annotations w.r.t factual consistency in diverse tasks (Table 2).Other than the importance of covering a wide variety of error types, this also alleviates issues of rating quality which may vary across datasets (Denton et al., 2021).
To allow a unified evaluation framework we convert all annotations to binary labels that correspond to whether the entire target text is factually consistent w.r.t the given grounding text or not.We note that a fine-grained annotation scheme, i.e., a typology of errors, was proposed for factual consistency (Pagnoni et al., 2021).While useful, most existing datasets do not include such labels.Moreover, while Machine Translation (MT) evaluation also showed value in fine-grained annotations (Freitag et al., 2021), it was proposed after years of improving MT to the level where coarse-grained annotation is insufficient.We argue that current grounded generation models are still at early stages w.r.t factual consistency, making binary labeling more beneficial now as it enables easier standardization across tasks and domains, with the goal of bringing researchers to collaborate on a shared methodology.Binary annotation also corresponds to practical applications where filtering out unfaithful predictions is desired, and is in-line with the recommendations for human evaluation of attribution in text generation by Rashkin et al. (2021a).
We next detail the 11 datasets included in TRUE.

Abstractive Summarization
FRANK Pagnoni et al. (2021) proposed a typology of factual errors, grounded in frame semantics (Fillmore, 1976;Palmer et al., 2005) and linguistic discourse theory (Brown and Yule, 1983).Based on this typology, they collected annotations for model-generated summaries on the CNN/DailyMail (CNN/DM; Hermann et al., 2015) and XSum (Narayan et al., 2018) datasets, resulting in 2250 annotated system outputs.Each summary sentence was annotated by three annotators.We take the majority vote for each sentence to get a sentence-level label and consider a summary as consistent if all sentences are consistent.
SummEval SummEval (Fabbri et al., 2020) is a comprehensive study of evaluation metrics for text summarization.The authors collected human judgments for 16 model outputs on 100 articles taken from the CNN/DM dataset, using both extractive and abstractive models.Annotators were asked to rate summaries on a Likert scale from 1 to 5, over 4 dimensions: consistency, coherence, fluency and relevance.Each summary was scored by 5 crowd-workers and 3 expert annotators.We label summaries as consistent only if all the expert annotators gave a consistency score of 5.
MNBM Maynez et al. (2020) annotated system outputs for the XSum dataset (Narayan et al., 2018).They sampled 500 articles and annotated summaries generated by four different systems, as well as the gold summaries.Annotators were asked to assess whether the summary includes hallucinations.Judgments from three different annotators were collected for each document-summary pair.
To convert to a binary-label format, we use the binary consistency decision of whether a summary contains no hallucinations, and assign a label by taking the majority vote of the three annotators.
QAGS Wang et al. (2020) collected judgments of factual consistency on generated summaries for CNN/DM and XSum.Annotators were presented with the summaries one sentence at a time, along with the article, and determined whether each sentence is factually consistent w.r.t the article.Each sentence was annotated by 3 annotators, using the majority vote as the final score.To convert to binary-label format, we consider a summary consistent only if all its sentences are consistent.

Dialogue Generation
BEGIN (Dziri et al., 2021) is a dataset for evaluating groundedness in knowledge-grounded dialogue systems, in which system outputs should be consistent with a grounding knowledge provided to the dialogue agent.BEGIN frames the task as textual entailment (Dagan et al., 2006;Bowman et al., 2015), adopting the entailment and contradiction labels, and splitting the neutral label into three subcategories: hallucination, off-topic responses and generic responses.Dialogue responses were generated by fine-tuning two systems on the Wizard of Wikipedia (WOW) dataset (Dinan et al., 2019), in which responses should be grounded in a span of text from Wikipedia.The generated responses were split into sentences, and each sentence was annotated separately.To convert to a binary-label format, we treat entailed sentences as consistent and all others as inconsistent.
Q 2 Honovich et al. ( 2021) annotated 1,088 generated dialogue responses for binary factual consistency w.r.t the knowledge paragraph provided to the dialogue model, for two dialogue models trained on WOW.Responses were annotated using binary labels by 3 of the paper authors, one annotator per response.We use Q 2 's labels without changes.
DialFact Gupta et al. (2021) introduced the task of fact-verification in dialogue and constructed a dataset of conversational claims paired with pieces of evidence from Wikipedia.They define three tasks: (1) detecting whether a response contains verifiable content (2) retrieving relevant evidence and (3) predicting whether a response is supported by the evidence, refuted by the evidence or if there is not enough information to determine.We use the verifiable (i.e., factual, rather than personal) responses annotated for the third task, treating supported annotations as consistent and the rest as inconsistent.In cases where several evidence were marked as required for verification, we concatenate all evidence sentences to be the grounding text.We note that the definition of paraphrase is not equivalent to the definition of factual consistency, as a subset of a source text is not a paraphrase but may still be factually consistent with the source.However, PAWS was constructed such that non-paraphrases usually have contradicting meanings and is therefore relevant.

Meta-Evaluation
Previous work on evaluating factual consistency focused on measuring correlation with human judgements (Pagnoni et al., 2021) to compare different metrics.However, such system-level numbers are not very informative when one is interested in evaluating the absolute performance of inconsistency detection methods that perform a binary decision w.r.t each input.Deutsch et al. (2022) also recently discuss various issues in measuring system-level correlations to assess the validity of automatic evaluation metrics for summarization.
To conduct a more fine-grained evaluation at the single example level, we report the Receiver Operating Characteristic Area Under the Curve (ROC AUC) w.r.t binary detection of inconsistent examples. 6The ROC curve is created by plotting the true positive rate (TPR, a.k.a. the recall) against the false positive rate (FPR, a.k.a. the fallout) at different possible thresholds for each tested metric.Measuring ROC AUC evaluates the different metrics without setting a specific decision threshold.
For datasets with existing development/test split, we also tune a threshold for the binary consistency/inconsistency decision on the development set and report the test set accuracy using this threshold.We tune the thresholds by optimizing the geometric mean of the TPR and 1-FPR: TPR * (1 − FPR).

Evaluation Metrics
We compare various standard as well as state-ofthe-art approaches to measure factual consistency.This comparison should draw a clear picture of current research on this subject and raise directions for future work.For example, we expect that robust metrics should perform well across various tasks and datasets.We next describe the different metrics we assess as part of this study.We note that for all reference-based metrics, we use the grounding text as the reference.For metrics where the scores are not in the [0,1] range, we normalize the scores to be in that range.

N-Gram Based Metrics
Standard N-Gram matching metrics such as BLEU (Papineni et al., 2002), ROUGE (Lin, 2004) and token-level F1 were shown to have weak correlation with factual consistency (Maynez et al., 2020;Honovich et al., 2021).We add them as baselines to this study mainly to corroborate this claim on a wide set of datasets and tasks.

Model-Based Metrics
BERTScore (Zhang et al., 2020) aggregates similarity scores between the BERT contextual embedding of tokens in candidate and reference sentences.
We report results for the BERTScore-precision variant as it showed better results in preliminary experiments.We use BERTScore version 0.3.11with the DeBERTa-xl-MNLI model (He et al., 2021;Nangia et al., 2017), which is the recommended model as of the time of writing this paper.7 BLEURT (Sellam et al., 2020a,b) is a learned metric based on BERT (Devlin et al., 2019) for evaluating text generation.BLEURT includes additional pretraining on synthetic data followed by fine-tuning on human judgements to train a model that scores system outputs.We use the recommended BLEURT-20 checkpoint (Pu et al., 2021).8 FactCC (Kryscinski et al., 2020) is a BERTbased metric for verifying the factual consistency of summaries.It is trained on synthetically generated data obtained by applying rule-based transformations to generate consistent and inconsistent summaries.
CTC (Deng et al., 2021)  The steps of the QG-QA approach are as follows: (1) Questions are automatically generated for spans in the generated text, such that the answer to a question is its respective input span.(2) The generated questions are answered using a QA model on the grounding text, resulting in an answer span or a "no-answer" output.(3) For each question, the two answer spans from the grounding and the generated text are compared to get a score.(4) The scores for all questions are aggregated into a final score.
Q 2 (Honovich et al., 2021) is a QG-QA method that employs an NLI model to compare the two answers for each question, where the grounding text answer is the premise and the generated text answer is the hypothesis.We report results for a re-implementation of Q 2 using T5-11B as the backbone for the QG, QA and NLI models.While Honovich et al. ( 2021) validate each generated question by answering it using a QA model and comparing to the original extracted answer candidate using exact match, we relax this and instead use F1 token-overlap with a predefined threshold. 13  QuestEval (Scialom et al., 2021) is a QG-QA method that measures both factual consistency and relevance (by reversing the roles of the generated and grounding texts).The authors trained a model that weights each generated question according to the relevance of its answer to appear in the generated text.Their results showed high correlation with human judgments in comparison to prior work on the SummEval benchmark (Fabbri et al., 2021a).We use the publicly available version. 14

Results
We report the ROC AUC 15 of various metrics on the standardized datasets in Table 3.The ROC curves can be found in Figure 2 in the appendix.SC ZS was trained on VitaminC which includes examples from FEVER, so we exclude those datasets from the average AUC calculation for a more fair comparison.As all metrics operate in a "zero-shot" manner on all datasets (except for SC ZS on Vita-minC and FEVER) and no threshold tuning is required, we report results on the development sets. 16 The results show that the NLI-based models (ANLI, SC ZS had lower average AUC of 72.2.All other approaches scored 72 or lower on average across all datasets (excluding FEVER and VitaminC).As expected, the simple token-matching based metrics did not perform well, and for completeness, we report their performance in Table 9 in the appendix.
We keep the F1 score in Table 3 for convenient comparison to the other metrics.One outlier is BEGIN, which is the only dataset where simple metrics like F1 token overlap achieved scores higher than 80.We measured the average overlap between the grounding and target texts per dataset, and found that BEGIN exhibits a high difference between grounded and ungrounded texts in comparison to other datasets (Table 8 in appendix A), which explains this.
We follow Laban et al. (2021) and perform significance testing through bootstrap resampling (Efron, 1982), comparing the best method to the second-best method on each dataset.We perform interval comparison at p = 0.05 and p = 0.01 and find significantly best results on 6 datasets, 3 achieved by Q 2 and 3 by the ANLI-based model.
Given that no single method outperformed the rest on all datasets, we hypothesize that the NLI and QG-QA based metrics are complementary.We test this by averaging the Q 2 , ANLI and SC ZS scores per example 18 (Ensemble in Table 3).Indeed, averaging the three methods yields better results on most datasets and on average, with an increase of 4.5 in ROC AUC from the best single-metric result.
Our results show that a single metric can do well across all tasks and datasets, with all 3 best metrics scoring higher than 80 on average on the 11 datasets.This corroborates our hypothesis that 18 Pairwise ensembles are reported in the appendix, Table 9.
evaluating factual consistency can be unified, and we hope such unified perspective will be adopted in future work to accelerate progress on the subject.

Analysis
Input Length.As QA and NLI models may struggle with long inputs (Kočiský et al., 2018;Pang et al., 2021;Yin et al., 2021;Shaham et al., 2022), metrics based on them may fail when handling long text.To study the effect of input length on the metrics performance, we unify all datasets 19and split examples into 6 bins according to the grounding length. 20We focus on the grounding as the target texts are usually short (see Table 7 in Appendix A).We measure AUC of the best 3 metrics according to their overall score for each length bin, sampling 1,000 examples per bin.
The results are shown in Figure 1.We find that there is a consistent degradation for texts longer than 200 tokens for all metrics, including SC ZS which is designed to better handle long text.We find it surprising that the ANLI-based model and Q 2 still do relatively well on the longest bin (with AUC > 0.825) as they perform end-to-end QA and NLI on text with more than 500 tokens.
Model Size.Model-based metrics are expected to benefit from increasing the model size.To quantify this we study the effect of using smaller models for the ANLI, BLEURT and BERTScore metrics.We compare the average ROC AUC of larger and smaller model variants for each metric.The ablation results are in Table 4.We find an advantage of 4.7, 3.7 and 1.3 average ROC AUC for the larger ANLI, BLEURT and BERTScore variants respectively, showing that larger models indeed allow for better factual consistency evaluation metrics, and hinting at potential improvements from using even larger models.
Qualitative Analysis.We conduct manual error analysis to point at weaknesses of the different metrics and present challenges posed by the task.We analyze 80 examples that were misclassified by all three best metrics, as well as 100 examples that were correctly classified by one or two of the three.
Out of the analyzed examples, many seem to have a wrong label.This is especially true for cases in which all best metrics failed, with annotation errors in 35/80 cases.For the cases where one or two metrics failed, we found annotation errors in 27/100 cases.To verify that the high annotation error rate is indeed a result of inspecting the "hardest" examples and not a general issue in the datasets we used, we uniformly sample 100 additional examples, finding that only 10 had annotation errors.We therefore stress that the high misannotation rate indeed characterizes "hard" examples only, and is not a general property of the datasets we used.This is inline with the findings of Freitag et al. (2021), who showed that in some cases, metrics may be "better" than non-expert annotators.These findings demonstrate the potential of automatic methods in "cleaning" training data by filtering factually inconsistent examples.
Despite showing impressive results, the bestperforming metrics fail to detect subtle inconsistencies, as presented in Table 5.This was the case for 21/180 analyzed examples.Metrics that aggregate scores across parts of a target text, such as Q 2 or SC ZS , might assign a high score for texts in which all but a small part is consistent.End-to-end NLI should predict "contradiction" even when only a small part of the text contradicts the grounding, but it may fail to do so.Applying a strict approach in the aggregation step, like taking the minimum instead of the average, could potentially remedy this -with the price of having more false-negatives.Other errors are caused by domain-specific challenges, such as handling personal statements in dialogues.As shown in Table 5, such statements may be falsely classified as ungrounded.This was the case for 10/62 analyzed dialogue responses.A possible way to alleviate this would be to automatically exclude non-factual parts from the evaluation.
Ensemble Analysis.As shown in §4, a simple averaging ensemble using the three best metrics achieves strong results, outperforming individual metrics on most datasets.To understand this further, we analyze cases in which at least one of the best three metrics failed, while the ensemble succeeded.Overall, there were 25,761 such cases, where in 85.2% of these cases, two out of the three metrics succeeded, and only one failed.In 14.6% of these cases, one metric succeeded while the other two failed, and only in 0.2% of the cases, the ensemble succeeded while all metrics failed.These cases are a result of the different threshold used for the ensemble model vs. the thresholds for the individual metrics.We sample 100 of these examples and manually analyze them.Out of the sampled examples, 47% were misclassified by one metric only, where this metric assigned a borderline score -i.e., close to the decision threshold.36% of these examples were misclassified by one metric only, and also with a non-borderline score -i.e., the metric was far from a correct prediction.Other cases include two, or even three, erroneous metrics.

Related Work
Adding to the related work mentioned throughout the paper, works on unified evaluation of text Grounding Generated Text Explanation The word "philately" is the English version of the French word "philatélie", coined by Georges Herpin in 1864.
The word philately is actually a french word coined by george herpin.
The word philately is an English word based on a French word, but not French.All best metrics misclassified this.French police have interviewed presidential candidate francois fillon and his wife penelope over claims she was paid for fake work.They provided information that would help find the "truth", mr fillon said. . .French presidential candidate francois fillon has said he and his wife penelope have been questioned by police over claims she worked illegally.
Most details are correct and the hallucination is subtle.In the case of Q 2 , most of the generated questions have the same answer based on the grounding and the generated text, therefore the overall score was high.Stamp collecting is generally accepted as one of the areas that make up the wider subject of philately, which is the study of stamps.I've never heard of stamps, but I do know that the word "philately" refers to the study of stamps.
The personal statement "I've never heard of stamps" is not factual and should not be evaluated.Evidence suggests that cognitive behavioral therapy and a gradual increase in activity suited to individual capacity can be beneficial in some cases.
It has been suggested that cognitive behavioral therapy and gradual increase in exercise could help in some cases so I'm going to try that for now.
Similar to the previous examples -SummaC and ANLI falsely marked the text as inconsistent, probably due to the personal statement.generation across tasks include GEM (Gehrmann et al., 2021), where the focus is on evaluating system outputs and not the factual consistency evaluation methods as in TRUE.BEAMetrics (Scialom and Hill, 2021) proposes meta-evaluation protocols across tasks, but does not focus on factual consistency.When discussing consistency ("correctness") they measure correlations, which are not sufficient as mentioned in Section 2.3.Chen et al. (2021) present an adversarial meta-evaluation for factual consistency evaluators, focused on summarization.Other works on meta-evaluation of factual consistency across datasets include GO-FIGURE (Gabriel et al., 2021) FRANK (Pagnoni et al., 2021) SummaC (Laban et al., 2021) and QAFactEval (Fabbri et al., 2021b), however they all focus solely on summarization.Yeh et al. (2021) conduct a thorough assessment of dialog metrics, however not specifically around factual consistency.To the best of our knowledge, our work is the first to generalize the discussion on evaluating factual consistency across tasks and datasets, and the first to show that large-scale QG-QA and NLI are strong and highly complementary -setting better baselines and metaevaluation methodology for future work.

Discussion and Future Work
We discuss the main takeaways of the TRUE study, pointing at actionable insights for future work.First, as QG-QA and NLI-based methods show better performance than other approaches, especially when combined together, we recommend model developers to use those methods for evaluation when factual consistency is a priority.As for metric developers, we recommend using those methods and the datasets in TRUE when evaluating new metrics.
We also suggest reporting ROC AUC rather than correlations, as it is more interpretable and actionable.Our proposed binary annotation scheme allows to easily test new metrics across tasks and datasets, which would be useful for future work.
Finally, we encourage data curators to use the binary annotation scheme, which is inline with the recommendations of Rashkin et al. (2021a).Having said that, we do not rule out more detailed labeling schemes -but rather ask to provide a protocol for converting such labels into the more general binary format.Future work may also address the challenges of long inputs and personal statements in dialogue, which we point out in our analysis.

Conclusions
We presented TRUE, a survey and assessment of automatic factual consistency evaluation methods.We standardized various datasets from diverse tasks into a unified labeling scheme to perform a thorough comparison of automatic evaluation methods, showing that large scale NLI and QG-QA based approaches perform well across multiple tasks and datasets.We further show that these methods are highly complementary -hinting at additional headroom for improvement while pointing on current limitations.We hope our results and methodology will encourage a more unified perspective in future work to foster progress towards more factuallyconsistent NLP applications.

B Implementation Details
We train all models using the t5x library. 21  QG-QA For our reimplementation of Q 2 (Honovich et al., 2021) we use T5-11B as the pretrained model for QG, QA and NLI, while Honovich et al. (2021) used T5-Base, ALBERT (Lan et al., 2019), and RoBERTa (Liu et al., 2019) for the QG, QA and NLI models, respectively.We use a maximum length of 2048 tokens for the input.We set the F1 token overlap threshold to 0.54 by tuning it on a held-out dataset.We use beam search with a beam size of 4 to generate multiple questions, and use the first question that passes the validation threshold.
NLI We fine-tune a T5-11B model on ANLI (Nie et al., 2020) for 25K steps with a learning rate of 10 −4 and a batch size of 32.During inference we use a maximum input length of 2048 tokens.

C ROC Curves
Figure 2 presents the ROC curves for the different datasets studied in TRUE, using the bestperforming metrics.

Figure 1 :
Figure 1: ROC AUC when splitting TRUE's data according to the grounding length.

Table 2 :
Statistics for the datasets incorporated in TRUE.Cons. is the ratio of consistent examples.
(Nie et al., 2019(Nie et al., , 2020)) generating claims using annotators, then labeling whether each claim is supported or refuted by Wikipedia.Claims can also be labeled with NotEnoughInfo, meaning that there is not enough information in Wikipedia to either verify or refute the claim.Given a claim, the task defined by FEVER is to first extract evidence, then to determine whether it supports or refutes the claim.In a slightly different framing, the latter stage in FEVER is to determine whether the claim is factually consistent or not w.r.t the evidence, which is aligned with what we aim to measure in TRUE.We use the development set of the NLI version of FEVER(Nie et al., 2019(Nie et al., , 2020)), treating supported claims as consistent and the rest as inconsistent.
Verification FEVER Thorne et al. (2018) introduced FEVER (Fact Extraction and VERification), a dataset for fact verification against textual sources.FEVER was constructed by VitaminC Schuster et al. (2021) derived a largescale fact verification dataset from factual revisions to Wikipedia pages.Each example includes an evidence text from Wikipedia and a fact, with an annotation of whether the fact is supported, refuted or neutral w.r.t the evidence.The authors collected factual revisions to Wikipedia articles (pairs of "before" and "after" sentences), and asked annotators to write two facts for each pair: one that is supported by the first sentence and refuted by the second, and vice versa.When no explicit contradiction was present, the annotators wrote facts that are neutral w.r.t the evidence.Additional examples were created by revising examples from FEVER.We treat examples that include supported facts as consistent, and refuted or neutral facts as inconsistent.

Table 3 :
ROC AUC results for the different metrics on the TRUE development set.We exclude VitaminC and FEVER from the average calculation as SC ZS was trained on VitaminC that includes examples from FEVER.The highest score in each row (excluding the Ensemble) is in bold and the aforementioned SC results are in strikethrough.Statistically significant results are indicated using * and ** for p < 0.05 and p < 0.01 respectively.

Table 5 :
Examples for the error analysis.The first two rows show cases of challenging inconsistencies, while the last two show dialogue responses containing non-factual personal statements.

Table 7 :
Generated text length statistics for TRUE.

Table 12 :
Accuracy results for the different metrics on the TRUE test set.Thresholds were tuned on the corresponding development sets.We exclude VitaminC from the average calculation as SC ZS was trained on VitaminC.The highest score in each row (excluding the Ensemble) is in bold and the aforementioned SC results are in strikethrough.