NonFactS: NonFactual Summary Generation for Factuality Evaluation in Document Summarization

,


Introduction
Over the last few years, there have been remarkable improvements in document summarization due to advances in pre-trained language models such as BART and PEGASUS Zhang et al., 2020a). However, these improvements are mainly measured with ROUGE scores, which assess the quality of a summary using n-gram overlap with references. Recent studies show that state-of-the-art models generate up to about 30% nonfactual summaries (Cao et al., 2018;Kryściński et al., 2019;Pagnoni et al., 2021), i.e., summaries that are not entailed by or factually inconsistent with their source document. This demands an 1 Codes and Models: github.com/asoleimanib/NonFactS Document: ... Shanghai is an unusual place. ....
Here are the things that make China's booming commercial hub a unique place in the world's most populous country. ... Two dozen colossal ... a massive marble ... Shikumen is Shanghai's indigenous alleyway housing. ...

Factual Summary:
Shanghai has long been a unique city in China.

NonFactual Summary:
Shanghai has long been a hub for housing.

Document:
A week after a Japanese court issued a landmark injunction halting plans to restart two nuclear reactors ..., a different court has rejected a petition by residents to delay... By dismissing resident's demands, the court ruled that the Sendai nuclear power plant in Kagoshima could restart. ...

Summary (Factual/NonFactual):
the Japanese government says it nuclear power plant could restart.

Classifier
Factual / NonFactual Figure 1: Overview of the proposed pipeline. Left: the NonFactS generator model is trained to generate a nonfactual summary given a reference factual summary and its corresponding context document. Right: Reference factual summaries and the generated nonfactual summaries are used to train a binary classifier to evaluate factuality in document summarization. automatic evaluation metric for factuality in document summarization.
Factuality evaluation in document summarization is a notoriously difficult task which is closely related to the Natural Language Inference (NLI) task. There have been different attempts to address this problem by revisiting NLI models (Utama et al., 2022;Laban et al., 2021). However, existing NLI datasets such as SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018) do not fully encompass factual inconsistencies within the summarization task. Moreover, NLI datasets cover sentence-level entailment while premises in the summarization task are multi-sentence documents (Utama et al., 2022). On the other hand, NLI approaches need aggregation and, consequently, further in-domain data for training or determining a decision threshold (Laban et al., 2021). In addition, collecting human-annotated nonfactual summaries or document-level entailment samples is extremely expensive. Therefore, training a document-level entailment classifier on ground-truth samples is not straightforward because of the lack of data.
A solution to overcome the lack of proper training data is to generate synthetic nonfactual summaries. There have been early attempts to do so using heuristics transformations, e.g., negation, entity swap, and noise injection (Kryscinski et al., 2020), that cover a limited range of possible factual inconsistencies. Recently, FALSESUM (Utama et al., 2022) leveraged a controllable text generation model to replace entity pairs (predicate, argument) in human-annotated reference summaries with new entity pairs. However, it requires extensive pre-processing, impacting the quality of generated samples and results in limited inconsistency variations. Therefore, we extend this line of research to introduce NonFactS, a data generation model to generate nonfactual summaries given a source document and a reference or random summary. We then train a binary classifier on these generated samples to evaluate factuality in document summarization. Figure 1 shows our proposed pipeline, the NonFactS generator and classifier.
NonFactS is trained to complete a truncated reference summary using inputs consisting of only the source document, the truncated reference summary, and a set of random words as Seeds. The Seeds are sampled from the document and from the removed part of the summary. In order to generate a nonfactual summary, the Seeds during the inference phase contain random words from the document only. All the words appearing in the reference summary are masked in the document. Figure 2 provides a detailed overview of our generator during training and inference. The contributions of this work are the following: First, we introduce a new model to generate nonfactual summaries using a source document and a factual reference summary. Nonfactual summaries are document-level and generated without language-dependent and error-prone pre-processing steps such as entity extraction and lemmatization (see Figure 3).
Second, our method significantly outperforms the state-of-the-art methods on the FALSESUM (Utama et al., 2022) and SUMMAC (Laban et al., 2021) benchmarks.
Third, We demonstrate that our method can still achieve high performance when human-annotated reference summaries are unavailable by using only random sentences from source documents as a substitute.
Fourth, we conduct overlap, novel n-gram, and hypothesis-only analyses to compare NonFactS and FALSESUM regarding their abstractiveness and naturalness of generated summaries.

Related Work
This section reviews existing methods for factuality evaluation and standard benchmarks for this task.

Entity-Based
Laban et al. (2021) introduce a Named Entity Recognition (NER) based method as a baseline to identify if the generated summary entities (e.g., person, location, organizations) are present in the corresponding source document. The quality of NER output significantly impacts the final performance. Dependency Arc Entailment (DAE) (Goyal and Durrett, 2020) is a more advanced model trained on a set of arcs in the dependency parse of generated outputs to classify the entailment decision for each arc with respect to the corresponding input. This approach is also significantly affected by the quality of the parser.

QAG
The Question Answer Generation (QAG) approach follows question generation, question answering, and answer matching steps. FEQA (Durmus et al., 2020) masks text spans (e.g., noun phrases, entities) in the summary, considers the spans as the gold answers, and then generates questions for the gold answers. From there, a Question Answering (QA) model finds answers to these questions in the source documents. F1 performance against the gold answers is considered a faithfulness score. QuestEval (Scialom et al., 2021) combines both a precision-oriented QAG method, with questions generated from the summary such as FEQA, and a recall-oriented metric, with questions generated from the source document such as Sum-maQA (Scialom et al., 2019). QAG cannot cover all types of factual inconsistency because it significantly depends on entities, and generated questions are mostly factoid.

Truncated Reference Summary:
Shanghai has long been a unique city in China

Reference Factual Summary:
Shanghai has long been a unique city in China.

NonFactual Summary:
Shanghai has long been a hub for housing.

Truncated Reference Summary:
Shanghai has long been a unique city in China

Training Inference
Target Output Output Figure 2: Overview of NonFactS at respectively the training and inference phase. Training: input contains a context document, its truncated reference summary (shown in blue), and random words consisting of words from the document (shown in underline) and from the removed part of the summary (shown in red). The BART model is trained using the reference summaries as targets. Inference: The input structure is the same as the training input but random words are only chosen from the document (shown in underline). In addition the words in the document, which appear in the reference summary, are masked (shown in highlight ).

NLI
The NLI task is closely related to factuality evaluation in document summarization. However, premises and hypotheses in the existing NLI datasets such as SNLI and MNLI are sentences while factuality evaluation in document summarization assumes document-sentence pairs. Falke et al.
(2019) test five NLI models and compares summaries against all sentences in their corresponding source document and assumes it is sufficient for a summary to be entailed by one source sentence. Laban et al. (2021) introduce a learnable aggregation method and show that their approach outperforms the sentence-level entailment. In general, hypotheses are required to be investigated based on multi-sentence and inter-sentence premises to be classified as entailment, contradiction, or neutral. Furthermore, while mean and max are nonparameter aggregators, learnable methods require additional training data and an in-domain validation set to choose a decision threshold. Document-level entailment pairs solve such challenges. In order to generate document-level NLI samples, Kryscinski et al. (2020) propose a series of heuristics and rule-based transformations to the sentences of source documents. They introduce a factual consistency checking model (FactCC) that is trained on source documents and the generated sentences pairs. The transformations include paraphrasing to yield semantically-equivalent sentences and negation, pronoun swap, entity swap, number swap, and noise injection to yield semanticallyvariant sentences. The rule-based nature of the FactCC dataset results in low diversity of factuality errors, and it poorly aligns with actual errors made by summarization models (Goyal and Durrett, 2021).
FALSESUM (Utama et al., 2022) is a data generation pipeline to perturb human-annotated reference summaries. It replaces predicate-argument entities in reference summaries with entities from their corresponding documents. While FALSESUM automatically generates nonfactual summaries, it requires a series of input preprocessing steps (see Figure 3), including entity extraction, span corruption, and lemmatization which are error-prone and language-dependent.
Very recently and concurrently, there have been additional attempts for faithful summarization by automatically generating a synthetic dataset of positive and negative references by corrupting supported reference sentences (Adams et al., 2022) and factual consistency checking by generating factually inconsistent summaries using source texts and reference summaries with key information masked (Lee et al., 2022).  Figure 3: NonFactS and FALSESUM input structures. NonFactS requires only one simple word extraction as pre-processing while entity extraction, span corruption, and lemmatization are needed for FALSESUM.

NonFactS Method
In order to train a classifier to evaluate the factuality of summaries, we need a large set of factual and nonfactual summaries. Reference summaries in large summarization datasets such as CNN (Hermann et al., 2015) and XSUM (Narayan et al., 2018) can be used as factual summaries but the problem is the lack of nonfactual summaries. Non-FactS takes a set of source documents D and their corresponding reference factual summaries S + and aims to generate a set of nonfactual summaries S − . The final goal is to train a classifier on pairs of factual and generated nonfactual summaries and their corresponding source documents. S − should be similar to actual summarizers output and be indistinguishable from S + using surface features. NonFactS is a text generator model taking as an input I, concatenation of D, a truncated factual summary S + truncated , and a list of random words Seeds. For training NonFactS, we set Seeds = {W S , W D }, that means random words consist of n random words W S from S + removed = S + − S + truncated , and m words W D from D (see Figure 2). It is then trained to generate S + . In other words, NonFactS is trained to select true words from Seeds to generate a sentence (summary) given the truncated version of that sentence and its corresponding context document. The input format is the following: where < /s > is the separator token. To force the model to generate nonfactual sentences (S − ) at inference time, Seeds are only selected from D (Seeds = {W D }), and all the words appearing in S + are also masked in D.
The reason to include S + truncated in the input is to make S − more indistinguishable from S + . We set the S + truncated length to half of the S + length that could be the first or last half of the full sentence. In addition, our initial experiments showed if Seeds only contains true words it might result in low quality S − as the model has to complete S + truncated using all words, which can be completely irrelevant words to S + truncated . Therefore, we include more words than needed in Seeds to force the model to select more suitable words. Note, Seeds contains only half of S + removed words to encourage the model to use the context information in D. Words are shuffled and the set does not contain stop words.
We use BART-base ) as our generator model and use the CNN summarization dataset as our training dataset. The training set has more than 287k samples from which we randomly choose 50k samples for the inference phase. We split summaries into sentences which results in about 900k training pairs (document, sentence). We use a batch size of 40 samples and a learning rate of 3e10 −5 , and train the model for one epoch on 2 NVIDIA TITAN X Pascal GPUs (12GB memory) Half Summary + Seeds: xhumanitarian groups expect 4,000 refugees in </s> understood + accountable + Ivoire + attacks + included + west + expecting + seven + volunteers + armed + occurred + Dourlot + Cote + reasons Generated NonFactual Summary: Humanitarian groups expect 4,000 refugees in Cote d'Ivoire, U.N. spokesman says. Document: For the second time during his papacy, Pope Francis has announced a new group of bishops and archbishops set to become cardinals -and they come from all over the world. ... That doesn't mean Francis is the first pontiff to appoint cardinals from the developing world, though. Reference Factual Summary: The 15 new cardinals will be installed on February 14. Half Summary + Seeds: be installed on February 14. </s> canonized + reach + Kean + number + like + pontiff Generated NonFactual Summary: The new pontiff will be installed on February 14. Document: Rebels in Tripoli furiously hunting for signs of longtime Libyan leader Moammar Gadhafi are exploring a network of tunnels and bunkers built beneath his massive compound. CNN's Sara Sidner got a peek at the passageways Friday. She dubbed it "Gadhafi's inner sanctum." ... Reference Factual Summary: CNN's Sara Sidner sees another world in a tunnel below Tripoli. Half Summary + Seeds: world in a tunnel below Tripoli. </s> extend + walked + underground + shelf + occurred + thought + apparently + passages + air + recently Generated NonFactual Summary: Rebels are exploring underground passages around the world in a tunnel below Tripoli. Document: Criminals who file fraudulent tax returns by stealing people's identities could rake in an estimated 26 billion... But in testimony before Congress last year, National Taxpayer Advocate Nina Olson said those filters "inevitably block large numbers of proper refund claims" since there "is no easy way to distinguish proper claims from improper ones." In testimony prepared for Tuesday's hearing, Deputy IRS Commissioner Steven Miller said the agency cannot stop all identity theft. ... Reference Factual Summary: The Treasury's estimate is the first detailed analysis of the ongoing problem. Half Summary + Seeds: the Treasury's estimate is the first </s> detects + numbers + billion + 6 + cars + Security + agency + recently + Congress + 5 Generated NonFactual Summary: The Treasury's estimate is the first to be presented to Congress by the agency.  for about one day. Table 1 shows four nonfactual summaries generated by the NonFactS generator.
To evaluate the factuality of generated summaries, we choose ROBERTa (Liu et al., 2020) and ALBERT (Lan et al., 2020) as our default classification models and fine-tune the models on a balanced dataset consisting of generated nonfactual summaries, reference factual summaries, and context documents (S = {S + , S − }, D).

Benchmark Results
We evaluate NonFactS on two factuality evaluation benchmarks, FALSESUM and SUMMAC. Performance is measured using Balanced Accuracy (BA): where TP, FN, TN, and FP stand for true positive, false negative, true negative, and false positive, respectively. The majority performance for BA is 50.      Table 2 reports NonFactS's performance on the FALSESUM benchmark. For this benchmark, ROBERTa-base is fine-tuned on 100k factual/nonfactual samples augmented with MNLI. NonFactS outperforms overall performance on all datasets except QAGS. It also reports Non-FactS without augmentation data and shows that it outperforms FALSESUM. QAGS categorizes non-grammatical sentences as non-consistent (nonfactual) . We also manually investigated QAGS and found numerous nongrammatical, but factually correct, sentences labelled as nonfactual samples. We suspect that such a phenomenon and the fact that we generate grammatically correct sentences only might be the reason for our seemingly lower performance on QAGS. Table 3 compares different models' performance on the SUMMAC benchmark. The experimental setup in this benchmark does not limit the number of training samples and the size and type of the classification model. We fine-tune ALBERT (Lan et al., 2020) on our 200K balanced datasets. The SUMMAC model uses ALBERT xlarge and larger datasets (MNLI and VitaminC (Schuster et al., 2021)). NonFactS outperforms the overall balanced accuracy performance. It is also considerably better on the CGS and SumEval datasets but performs poorly on XSF. We manually investigated XSF and suspect that the poor performance of our model and other models might be because of the high frequency of non-grammatical, noisy, and nonsense sentences labelled as nonfactual (e.g., 'barron and his wife barron have moved from the white house to the white house'). It is also understandable from the NER-Overlap model, which is the secondbest model on XSF compared to the much more advanced models. In contrast to other datasets, XSF was mainly collected from the XSUM dataset. While this domain shift can be a reason for the low performance, this is not the case for our model. We experimented with NonFactS trained on our synthetic dataset based on XSUM and did not see a significant improvement.

Fine-grained Analysis
In order to have high quality nonfactual samples for training a binary classifier, nonfactual samples must not be identified by surface features. Table 4 compares NonFactS and FALSESUM regarding the similarity of factual and generated nonfactual samples. NonFactS's nonfactual samples are much more similar to factual samples in terms of ROUGE scores and BERTScore (Zhang et al., 2020b). In addition, inspired by Gururangan et al. (2018) and Utama et al. (2022), we perform a hypothesis-only experiment. The classifier is trained and evaluated on only summaries without any access to the context documents. The goal is understanding to what extent the factuality of generated summaries can be determined using semantic plausibility and spurious surface features (e.g., grammatical mistakes or fluency errors). Table 5 indicates that NonFactS generated summaries are marginally better than FALSESUM generated summaries in hypothesisonly factuality evaluation. We also manually investigated 100 randomly sampled generated nonfac-tual summaries and found that 85% of the labels are truly labelled as nonfactual. This is almost the same as FALSESUM reported manual verification (Utama et al., 2022).
We study the ability of the same classifier (ALBERT xlarge) fine-tuned on the Non-FactS/FALSESUM datasets to evaluate factuality on FALSESUM/NonFactS. The rest of the variables, such as the number of training samples, are the same as our default. Table 6 indicates that Non-FactS yields better performance on FALSESUM.
We investigate the performance of the NonFactS factuality evaluation model based on the level of abstractiveness of summaries. We use different metrics to partition the lexical overlap between the summaries and their context documents. Overlap Score is defined by the multiplication of the density, i.e., the percentage of words in a summary that are present in the context document, and normalized coverage, i.e., the percentage of a summary that is a continuous fragment of the context document (Utama et al., 2022;Grusky et al., 2018). We also use the percentage of novel n-grams in summaries, i.e., the percentage of a summary n-grams that are not present in the context document. Higher values for the overlap score and lower values for percentage of novel n-grams correspond to higher overlap and more extractive summaries. garding the overlap score and percentage of novel n-grams. Both generated datasets cover more abstractive than extractive summaries. However, Non-FactS contains more abstractive samples. This is evident from the higher frequency of lower overlap scores. NonFactS also has more samples with a higher percentage of novel 4-grams and trigrams, while FALSESUM covers more novel Bigrams and unigrams. To study the effect of summary extractiveness, we evaluate our model on the FALSESUM and SUMMAC benchmarks. Figure 5 indicates the higher performance of NonFactS over FALSESUM on more abstractive summaries (lower overlap scores) on both benchmarks, which is in line with more abstractive samples in NonFactS.

Zero Reference Analysis
In this section, we consider the case in which there is no access to human-annotated reference summaries (factual summaries) for training a model to generate nonfactual summaries. This is a realistic case, for example, in a real scenario where one has no access to reference summaries in a new domain. We use randomly selected sentences from context documents as factual reference summaries corresponding to the documents. Next, we train the NonFactS generator with the same procedure explained in Section 3 to generate nonfactual summaries. Note, during the training and inference phase, we remove the randomly selected sentences from the documents to eliminate trivial performance and maintain the abstractive summariza-tion approach. The exact number of documents (230k/50k) are used for training and inference. Documents during inference are sampled more than once to provide more samples (200k,400k,1000k) for training the classifier.
To single out the model and dataset effects, we experiment with both ROBERTa and ALBERT and CNN and XSUM as training and inference datasets. The default case (presence of reference summary) is limited regarding the number of training samples for the classifier (max 400k samples). Figure 6 compares the performance of the factuality evaluation models in the presence and absence of reference summaries on the FALSESUM and SUMMAC benchmarks (see Appendix for detailed results). In both benchmarks, zero reference models reach or outperform reference models after training on 400k random factual samples and their corresponding nonfactual summaries. This superiority is much more evident in the ALBERT models. In addition, the figure shows that CNN based models performs better on both benchmarks which is to be expected as both benchmarks are consisting of more CNN based datasets. However, we see that the ALBERT models trained on CNN or XSUM random samples relatively converge together. Therefore, the effect of in-domain datasets vanishes as the model trained on more samples.

Conclusion
We introduced NonFactS, a data generation model to generate large-scale nonfactual summaries. Non-FactS only requires context documents and reference summaries as factual summaries. To evaluate factuality in document summarization, we used a binary classifier trained on a balanced dataset of factual and generated nonfactual summaries. Our model outperforms prior works on two standard benchmarks, FALSESUM and SUMMAC.
Compared to previous methods, NonFactS generates nonfactual samples without requiring extensive language-dependent pre-processing steps. Also, our generated samples are more abstractive and more similar to their factual references, and therefore, it is harder to identify the samples based on spurious surface features and semantic plausibility.
Additionally, we demonstrated that NonFactS is capable of generating nonfactual summaries without the need for human-annotated reference summaries by utilizing randomly selected sentences from context documents. Our experiments indicated that a classifier trained on these generated samples achieves comparable performance to a classifier trained on human-annotated samples and their generated nonfactual pairs.

Limitations
NonFactS generates grammatically correct nonfactual summaries. However, in practice, summaries can be non-grammatical, noisy, and nonsensical. This can limit the generalization of our performance in such cases. Additionally, hypothesis-only results show that a considerable number of samples are identified correctly without their context document. The reason can be the memorized knowledge in pre-trained classifiers or surface features and semantic plausibility.

Broader Impact
Our model has no direct environmental impacts, fairness or privacy considerations. However, it is important to note that it must not be used as a factchecking tool as there is a potential risk that false statements may be labelled as true. Our classifier evaluates the factuality of a summary based on a context document, and if the document is misleading, the summary can be factual based on misleading information. Additionally, NonFactS generates nonfactual summaries, which might have potential risks if misused for generating massive nonfactual summaries (claims). Addressing such risks is an open issue in the field and is not specific to our work.   Limitation section just after conclusion A2. Did you discuss any potential risks of your work?
Broader Impact section just after Limitation section A3. Do the abstract and introduction summarize the paper's main claims?
Abstract. Introduction (section 1) A4. Have you used AI writing assistants when working on this paper?
Left blank.
B Did you use or create scientific artifacts?
Section 3 B1. Did you cite the creators of artifacts you used?

Section 3
B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
The datasets we used are the two well-known datasets in the field and are publicly available and free to use for academic and research purposes. When we publish the dataset, we will provide the licence and respect the previous licences.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)? CNN and XSUM summarization datasets are public and free to use for academic and research purposes. We completely respect their intended use. When we publish the dataset, we will provide the licence and respect the previous licences. We have considered that our contribution is compatible with the original datasets.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it? Not applicable. It is not applicable since the data we use have already been checked by their authors.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? Not applicable. It is completely the same as the datasets we use and therefore we only cite corresponding works.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created? Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results. For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.

Section 3
The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.