Gradient-Based Adversarial Factual Consistency Evaluation for Abstractive Summarization

Neural abstractive summarization systems have gained significant progress in recent years. However, abstractive summarization often produce inconsisitent statements or false facts. How to automatically generate highly abstract yet factually correct summaries? In this paper, we proposed an efficient weak-supervised adversarial data augmentation approach to form the factual consistency dataset. Based on the artificial dataset, we train an evaluation model that can not only make accurate and robust factual consistency discrimination but is also capable of making interpretable factual errors tracing by backpropagated gradient distribution on token embeddings. Experiments and analysis conduct on public annotated summarization and factual consistency datasets demonstrate our approach effective and reasonable.


Introduction
Text summarization aims to produce a simplified version of the source document while retaining salient information. Abstractive summarization is a branch of methods in which generation text is free from constraint on the tokens that appeared in the source. These methods are extensively studied since its flexibility and generalization ability (See et al., 2017;Paulus et al., 2018;Gehrmann et al., 2018;Dong et al., 2019a). However, a challenge in abstractive summarization is the trade-off between abstractiveness and factual consistency. Recent studies show that about 30% of the summaries generated by abstractive models contain facts errors toward source documents. The proportion will rise further as the data abstractiveness increases (Cao et al., 2018;Durmus et al., 2020;Kryscinski et al., 2020), causing factual checking an essential process to verify the credibility and usability of models. * Work is done while at ByteDance.

article:
The Swift Archway Cranford 545 caravan was stolen from a site in Yaxley, Cambridgeshire, on Thursday night. Davis tweeted "My touring caravan was stolen.. even though it was locked up with hitch & wheel lock!". (...) Davis has played the role of Professor Flitwick in the Harry Potter films and Nikabrik in The Chronicles of Narnia: Prince Caspian. (...) claim: A caravan locked by Harry has been stolen from a site in cambridgeshire. reference: A caravan locked by Davis has been stolen from a site in cambridgeshire. In Table 1, we propose an inconsistent generation example, where the blue part support factual consistency and the red part leads to factual errors.
Previous approaches for detecting or boosting factual consistency can be divided into three kinds.
(1) Employ information extraction tools to extract facts and leveraging it by building additional objective (Cao et al., 2018;Goodrich et al., 2019;Zhang et al., 2020a;Zhu et al., 2021). (2) Use natural language inference or question answering models for fact checking correction Falke et al., 2019;Durmus et al., 2020;Chen et al., 2020). (3) Train a factual consistency evaluation model on artificial datasets generated by rule-based transformations (Kryscinski et al., 2020;Cao et al., 2020). Most of the above approaches focus on factual consistency evaluation. Some of these explore using pretrained language models (Devlin et al., 2019;Dong et al., 2019b; to make an end-to-end fact correction. However, fact correction through text generation may further increase uncertainty. By comparison, the tracing of factual errors by explicitly marking out the latent inconsistent tokens in the generated summaries can provide more reliable and interpretable information. It has significant meaning in a real scenario but attracts less research attention. In this paper, we propose a robust weaksupervised factual consistency evaluation model and gradient-based factual errors tracing strategy. Specifically, we construct artificial datasets based on benchmark summarization datasets to train the model. Except rule-based transformations proposed by (Kryscinski et al., 2020;Cao et al., 2020), we propose an implicit augmentation to obtain hard factual inconsistent examples by the adversarial attack. It alleviates the problem of oversimplified negative samples and therefore improves the model performance and robustness. Further, we propose a novel strategy to trace factual errors based on gradients distribution without adding any parameters. The analysis on gradients also provides stronger interpretability for the factual consistency evaluation results. Our contributions are three-fold: (1) We propose an efficient adversarial data augmentation approach to generate weakly supervised samples for factual consistency evaluation. (2) We design a novel strategy to tracing factual errors by utilizing gradient distribution. (3) Experiments and analysis conducted on various datasets show the effectiveness and interpretability of our proposed methods.

Methods
Fig 1 shows the overall architecture of our factual consistency evaluation model. We adopt Roberta  as feature extractor f (·). Given a sequence concatenated by source document and corresponding summary x = {x 1 , x 2 , . . . , x Lx }, we encode tokens into representations r i = f (E(x i )), where e i = E(x i ) indicates the embedding process. We add a simple linear layer after representation of [CLS] token for cal-culating binary cross-entropy loss L ce (f (e; θ, Y )), where Y ∈ {Consistent, Inconsistent}.
In the following, we describe the building methods of artificial datasets, the model's training process, and our proposed error tracing strategy. Artificial Dataset. We follow the recent methods to generate inconsistent samples through rule-based transformations (Kryscinski et al., 2020;Cao et al., 2020). The source document s and the corresponding target summary t is treat as a consistent example x p = {s, t}. After utilizing corruption on part of the original summary (corresponding to yellow highlighted part in Fig. 1), the source and the pseudo summary t forms a inconsistent example Three types of strategies are used to corrupt the original summaries. (1) Entity swapping 1 : replace a random entity in the reference summary by another random entity of the same type in the same source document. To alleviate the bias caused by synonyms, we apply an empirical threshold on the similarity between the original and pseudo entities based on the simple distance algorithm. (2) Pronoun swapping, replace a random pronoun with another one of matching syntactic case. (3) Negation: transform a random auxiliary verb to its negative form. Adversarial Augmentation. It is pointed that a classifier trained on artificial datasets only works well on easy examples, thus can hardly generalize well to actual scenarios (Zhang et al., 2020b). To alleviate this, we propose an adversarial attack mechanism (Goodfellow et al., 2015;Kurakin et al., 2016;Miyato et al., 2017;Yan et al., 2020;Meng et al., 2020) on rule-based pseudo samples as data augmentation. For token embeddings of a sample e, we try to find a worst-case perturbation vector v that maximizes the loss function: Where is the norm bound of the perturbation, since the complexity of neural models, it is intractable to compute the perturbation precisely. Instead, we apply Fast Gradient Value (FGV) (Rozsa et al., 2016) to approximate a worst-case perturbation: The gradient g is the first-order differential of  L ce , representing the direction that rapidly increases the loss. We normalize the gradient and use a small norm to ensure the approximation reasonable. Specifically, for inconsistent samples, the perturbations on the corrupted tokens are masked by a filter. We will explain the reason in the following. It is worth noting that Fig. 1 only shows an inconsistent example, and for consistent examples, perturbations on all tokens are retained. We name the filtered perturbation as v p and add it to e to obtain new tokens embeddings e = e + v p , which can be regarded as an augmented sample. We feed it to the model again to obtain another loss L adv with the same label. Finally, we use the weighted sum of two losses to train our model: Error Tracing. We propose a novel factual error tracing strategy using back-propagated gradients g. Instead of introducing more neural network layers and parameters, our proposed method can be regarded as an inherent by-product of adversarial augmentation. Let ∆L = L adv − L ce ≥ 0, the overall loss can be simplify as: With the optimization of the model, ∆L tends to zero, which reduces the loss change caused by adversarial perturbations, so that the representations of perturbed tokens tend to remain relatively stable in its neighborhood of the high-dimensional space. For inconsistent samples, as perturbations of corrupted tokens are masked, these tokens retain sensitiveness to loss change. So gradient will show a relatively higher value in the corrupted tokens as the loss changes faster when this part changes.
Generalizing the phenomenon into the test process, the model can use gradients distribution to trace factual errors. The cross-entropy loss is backpropagated to the embedding layer to obtain a gradient distribution. We use top-k algorithms to filter candidate error tokens on samples with inconsistent predictions. We conduct quantitative analysis and visualization in Section 4 to demonstrate the effects.

Experimental Setup
We perform experiments on two benchmark text summarization datasets CNN/DM (Nallapati et al., 2016) and XSUM (Narayan et al., 2018). Weakly supervised training data was generated as described in Section 2. Models are evaluated in two ways: (1) with the source documents and the ground truth references of datasets, which are all positive examples (2) with the manual factual consistency annotations provided on CNN/DM (Kryscinski et al., 2020) and XSUM (Maynez et al., 2020). We report accuracy, balanced accuracy, and marco F1-score.
We compare our evaluation model with strong baselines: (1)

Implementation details
We finetune our model on public pre-trained model Roberta-base , which has 12 layers, 768 hidden states, and 12 heads. The max length of the input is 512. Adam is used for optimization with an initial learning rate of 1e-5, and the batch size is 16. We set the training epoch up to 3 with evaluation on the validation set every 1000 steps. The range of weight between two losses is 0 to 1. We empirically set α = 0.5 to adapt to the general situation. The amplitude of adversarial perturbation is obtained by the heuristic method in the range of 2E-3 to 1E-2. Within the range, the influence of amplitude on model performance is less than 3%, and 6E-3 gain the best performance. Each result of the experiments is tested five times under the same setting and gets the average value. The training stage of our model lasts about 2.0 hours per epoch on four pieces of Tesla-V100-SXM2(32GB). The average value of the trainable model parameters is 476M. Table 2 shows the main results. Our evaluation model gains higher accuracy on both datasets' ground truth references, which are significantly better than FactCC and FactCCX. Since our model corrupts the reference summary rather than a fragment of source document, it fits better with abstractive summarization. On factual consistent dataset of CNN/DM, our model significantly outperform FEC by 2.5%(accuracy), 6.8%(balance accuracy), and 3.6%(marco-F1), and shows close results with the previous state of the art model FactCCX. On XSUM, our model gains consistent improvement on all metrics. However, every model performs poorer on XSUM than CNN/DM, indicating that higher abstractiveness makes fact consistency evaluation more difficult. Besides, we conduct an ablation experiment on adversarial augmentation. The  result shows that implicitly augment data through adversarial attack significantly benefits the evaluation, and the improvement on CNN/DM is more pronounced. It confirms that the rule-based augmented data can only simulate simple situations. In general, our proposed evaluation model is more reasonable for the factual consistency evaluation of abstractive text summarization.

Analysis
4.1 Case Study. Table 3 shows cases study of error tracing. We display some inconsistent samples of the artificial test set construct on CNN/DM. For the original text, the blue highlighted part represents the original span appearing in the ground truth (if it has). The orange highlighted part represents the pseudo span used to corrupt the ground truth. We normalize the predicted gradient distribution and use varying degrees of red to describe the tokens with top-5 gradient values. The brighter red represents the larger gradient.
We found that gradients show a high value on the corrupted part. It indicates that our method can provide instructive error tracing results and robust to different error types. Further, The analysis of gradient distribution explicitly explains the factual consistency evaluation result, which improves the interpretability of prediction results. Table 4 shows the quantitative results of our error tracing methods. We collect gradient distribution Source article fragments (...) Creams such as Arnicare for pain relief or liquids suchas Sidda Flower Essences for male virility are part of a $2.9 billion business that has seen "explosive growth," according to the FDA. These drugs do not go through the same level of scrutiny as over-the-counter and prescription drugs. (...) (...) Rabbis Mendel Epstein, 69; Jay Goldstein, 60; and Binyamin Stimler, 39, were found guilty on one count of conspiracy to commit kidnapping in New Jersey federal court. Goldstein and Stimler were also convicted on charges of attempted kidnapping. (...) (...) In his remarks at an antivaccination movie screening, he decided to compare "vaccine-induced" autism to the Holocaust. He said, "They get the shot, that night they have a fever of a hundred and three, they go to sleep, and three months later their brain is gone," Kennedy said. (...) Model generated claims Drugs do not go through the same level of scrutiny as over-the-counter and prescription drugs.

Quantitative Analysis.
Rabbis mendel epstein, 69, and binyamin stimler, 39, were found guilty on one count of conspiracy to commit kidnapping in new jersey federal court.
He decided to compare "vaccineinduced" autism to the holocaust, he says. on token embeddings of inconsistent samples in CNN/DM artificial test set and treat the tokens with top-k gradient values as predictive factual errors. For a range of k, we compute token recall and span recall, where the token recall allows only predicting the portion of the errors and the span recall requires including all error tokens. We treat the model without adversarial augmentation as a baseline.
The results indicate that with adversarial augmentation, the performance of error tracing gains consistent improvement on both token level and span level. When k = 3, more than 70% of spans and more than 80% of tokens can be recalled. When k is smaller, the recall improvement caused by adversarial augmentation is relatively significant. Besides, although span recall has lower metrics due to stricter restrictions, the method we propose can achieve a greater relative improvement. In summary, we have proved that (1) effective error detection can be carried out through gradient distribution (2) our proposed adversarial augmentation can optimize gradient distribution. Table 5 shows some inconsistent cases on CNN/DM factual consistency annotation dataset (Kryscinski et al., 2020), which our model make wrong prediction. The blue part represents the content covered by the claim, and the red part denotes the content claim neglect or makes the wrong expression. We found that these examples all overlap with the source text a lot, and errors occur only in very small parts. This is consistent with FactCC's data structure strategy, but its universality in abstractive summarization tasks needs further study.

Human Evaluation.
Table 6 shows the human evaluation results of our models on CNN/DM artificial dataset. Following (Kryscinski et al., 2020), we randomly sample 500 data pairs in the artificial test set and highlight the tokens with the top-5 predicted gradient value. The staff judges the factual consistency and gives whether the highlighted content provides help for the judgment. In addition, we compute a Pearson correlation between the human factual consistency judgment results and the predicted label of the model. oracle means using ground truth labels and corrupted span, which set an upper bound for evaluation. The results show that the adversarial mechanism significantly improves the availability of error tracing information and evaluation performance.

Conclusion
Abstractive summarization models are susceptible to factual inconsistency generation. To optimize the robustness and interpretability of factual consistency evaluation, we proposed an implicit data augmentation method based on the adversarial attack to construct hard factual inconsistent examples and gradient-based fact errors tracing strategy to provide auxiliary information. Experiments conduct on public datasets demonstrate the effectiveness of our models. The extensive analysis further reveals the role of the error tracing strategy.

Broader Impact
Abstractive summarization systems have demonstrated remarkable performance across a wide range of applications, with the promise of a significant positive impact on human production mode and lifeway. However, due to excessive abstractiveness, current models usually face an unfaithful generation problem, which may affect human judgment and impair the safety of models in practical applications, thus severely restricts the development of technology. In domains with the most significant potential for societal impacts, such as news, models should recognize factual errors to avoid bad influence. Our work focuses on the robustness and interpretability of the factual consistency evaluation model to take a step towards the ultimate goal of enabling the safe real-world deployment of abstractive summarization systems.