Interpretable Automatic Fine-grained Inconsistency Detection in Text Summarization

Existing factual consistency evaluation approaches for text summarization provide binary predictions and limited insights into the weakness of summarization systems. Therefore, we propose the task of fine-grained inconsistency detection, the goal of which is to predict the fine-grained types of factual errors in a summary. Motivated by how humans inspect factual inconsistency in summaries, we propose an interpretable fine-grained inconsistency detection model, FineGrainFact, which explicitly represents the facts in the documents and summaries with semantic frames extracted by semantic role labeling, and highlights the related semantic frames to predict inconsistency. The highlighted semantic frames help verify predicted error types and correct inconsistent summaries. Experiment results demonstrate that our model outperforms strong baselines and provides evidence to support or refute the summary.


Introduction
Prior work (Fabbri et al., 2022b;Goyal and Durrett, 2020;Laban et al., 2022) formulates the problem of factual inconsistency detection as a binary classification task, which predicts whether a summary is consistent with the source document.However, these approaches have two drawbacks.First, they cannot predict the types of factual errors made by a summary and thus provide limited insights into the weakness of summarization systems.Although recent studies (Pagnoni et al., 2021;Tang et al., 2022;Goyal and Durrett, 2021a) have manually inspected the types of factual errors in summaries, there is no existing work on automatic detection of fine-grained factual inconsistency.
Second, existing models typically cannot explain which portions of the document are used to detect the inconsistency in the input summary.In order to verify and correct an inconsistent summary, humans still need to read the entire source document to find the supporting evidence.Kryscinski et al. (2020) introduce an auxiliary task to extract the supporting spans in the document for inconsistency detection, which requires expensive ground-truth labels of supporting spans.
To address the first limitation, we propose the fine-grained factual inconsistency detection task.The goal is to predict the types of factual inconsistency in a summary.We show examples of different factual error types in Table 1.
To solve the second challenge, we further introduce an interpretable fine-grained inconsistency detection model (FINEGRAINFACT) that does not require any label of supporting text spans, inspired by how humans verify the consistency of a summary.When humans annotate the factual error types of a summary, they first identify facts in the document that are relevant to the summary and then determine the factual error types in the summary.Following this intuition, our model first extracts facts from the document and summary using Semantic Role Labeling (SRL).We consider each extracted semantic frame as a fact since a semantic frame captures a predicate and its associated arguments to answer the question of "who did what to whom".After fact extraction, a document fact attention module enables the classifier to focus on the facts in the document that are most related to the facts in the summary.By highlighting the facts in the document with the highest attention scores, our model can explain which facts in the document are most pertinent to inconsistency detection.
Experiment results show that our model outperforms strong baselines in detecting factual error types.Moreover, the document facts highlighted by our model can provide evidence to support or refute the input summary, which can potentially help users to verify the predicted error types and correct an inconsistent summary.
Intrinsic noun phrase error: Errors that misrepresent object(s), subject(s), or prepositional object(s) from the source article.
David was using FaceTime with Marcy Smith and saw the flames.
Extrinsic predicate error: Errors that add new main verb(s) or adverb(s) that cannot be inferred from the source article.
David was eating and saw the flames.
Intrinsic predicate error: Errors that misrepresent main verb(s) or adverb(s) from the source article.
David was engulfed and saw the flames.
Table 1: A text document and example summaries with different factual error types according to the typology defined by Tang et al. (2022).The errors in the sample summaries are in red color and italicized.We bold the text spans from the document that refute the sample summaries.

Task Definition
The goal of the fine-grained inconsistency detection task is to predict the types of factual errors in a summary.We frame it as a multi-label classification problem as follows.Given a pre-defined set of l factual error types {e 1 , . . ., e l }, a document d, and a summary s, the goal is to predict a binary vector y ∈ {0, 1} l where each element y i indicates the presence of one type of factual errors.We follow the typology of factual error types proposed by (Tang et al., 2022), which include intrinsic noun phrase error, extrinsic noun phrase error, intrinsic predicate error, and extrinsic predicate error.The definitions and examples of these error types are presented in Table 1.

Our FINEGRAINFACT Model
The model architecture is illustrated in Figure 1.
Fact extraction.To represent facts from the input document and summary, we extract semantic frames with a BERT-based semantic role labeling (SRL) tool (Shi and Lin, 2019).A semantic frame contains a predicate and its arguments, e.g., Fact encoder.We first represent tokens in the concatenated sequence of the input document and summary by fusing hidden states across all layers in Adapter-BERT (Houlsby et al., 2019) with max pooling.To represent facts, we apply attentive pooling to all tokens in the semantic frame under the assumption that different tokens in a fact should con- tribute differently to the fact representation.Given the token representations t j , we calculate the attention scores α j = exp(ϕ(t j ))/ m j=1 exp(ϕ(t j )), and represent each document or summary fact as , where m is the number of tokens in the fact and ϕ is a two-layer fully-connected network.
Document Fact Attention module.This module aims to retrieve the facts in the document that are related to the facts in the summary.We first concatenate the document fact representations into a document fact matrix F doc .We attend each summary fact f sum i to the document fact matrix to compute a document context vector: , where f sum i acts as the query, F doc is used as the key and value.The document context vector c i captures the information of the facts in the document that are related to the summary fact f sum i .
For each document fact, we sum up its attention scores received from all summary facts as its importance score.Concretely, we use α j→i to denote the sum of attention scores injected from the j-th summary fact to the i-th document fact over all attention heads.The importance score of a document fact f doc i is defined as n j=1 α j→i , where n is the total number of facts in the summary.Then, we return the top k document facts with the highest importance scores as the document fact highlights, where k is a hyper-parameter.
Classification module.A linear classifier predicts the probability of each factual error type based on the concatenation of the representations of summary facts and document context vectors.Specifically, we first use mean pooling to fuse all summary fact representation vectors and all document context vectors into two fixed-size vectors: These two vectors contain the information of all facts in the summary and the information of all document facts that are related to the summary.Next, we feed the concatenation of f sum and c to a linear classification layer to predict the probability of each factual error type: Training objective.We train our model with weighted binary cross-entropy (BCE) loss, The technical details are in Appendix A.

Setup
Dataset.We conduct experiments on the Aggrefact-Unified dataset (Tang et al., 2022), which collects samples and unifies factual error types from four manually annotated datasets (Maynez et al., 2020;Pagnoni et al., 2021;Goyal and Durrett, 2021b;Cao and Wang, 2021).We remove the duplicated samples (i.e., duplicated document-summary pairs) in the Aggrefact-Unified dataset (Tang et al., 2022) and obtain 4,489 samples.We randomly split data samples into train/validation/test sets of size 3,689/300/500.The statistics of the error type labels are in Appendix B.1.
Evaluation metrics.We adopt the macroaveraged F1 score and balanced accuracy (BACC) as the evaluation metrics.BACC is an extension of accuracy for class-imbalanced datasets and is widely adopted by previous literature on inconsistency detection (Kryscinski et al., 2020;Laban et al., 2022).All experiment results are averaged across four random runs.
Baselines.We adapt the following baselines2 for the new task.FACTCC-MULTI: FactCC (Kryscinski et al., 2020) is originally trained on synthetic data for binary inconsistency detection.We replace the binary classifier with a multi-label classifier and finetune the model on Aggrefact.FACTGRAPH-MULTI: FactGraph (Ribeiro et al., 2022) parses each sentence into an AMR graph and uses a graph neural network to encode the document.We replace the binary classifier with a multi-label classifier.We also fine-tune the BERT (Devlin et al., 2019) and ADAPTERBERT (Houlsby et al., 2019).

Performance of Error Type Detection
Following (Tang et al., 2022), we detect error types in summaries from different models: SOTA includes the pre-trained language models published in or after 2020.XFORMER contains the Transformer-based models published before 2020.OLD includes earlier RNN-or CNN-based models.REF represents reference summaries.From Table 2, we observe that: (1) Representing facts with semantic frames improves factual error type prediction..We observe that in most of the cases, our model outperforms other baselines that do not use semantic frames to represent facts.(2) The performance of our model drops after we remove the document fact attention module.The results show that our document fact attention module not only improves the interpretability, but also boost  the performance of factual error type detection.
(3) All detection models perform better in summaries generated by OLD systems.It suggests that the factual errors made by OLD systems are relatively easier to recognize than the errors made by more advanced systems.

Evaluation of Document Fact Highlights
Since ground-truth document fact highlights are not available, we apply a fact verification dataset to evaluate whether the predicted document fact highlights provide evidence for inconsistency detection.Specifically, we adopt the FEVER 2.0 dataset (Thorne et al., 2018), which consists of claims written by humans and evidence sentences from Wikipedia that can support or refute the claims.We first extract facts from the evidence sentences via SRL and use them as the groundtruth document fact highlights.We then consider each claim as the input summary and the section of a Wikipedia article that contains the evidence sentences as the input document.
We devise the following method to compute document fact highlights for the baseline models.Since all baselines utilize the CLS token to predict the factual error types, we use the attention scores received from the CLS token to compute an importance score for each document fact.We then return the facts that obtain the highest importance scores as the document fact highlights for each baseline.More details are in Appendix B.2.  fact highlights predicted by different models.We observe that our model obtains substantially higher recall scores, which demonstrates that our model provides more evidence to support the inconsistency prediction.Thus, compared with the baselines, our model allows users to verify the predicted error types and correct inconsistent summaries.

Case Study
Table 4 shows a sample summary generated by an OLD model with an intrinsic noun phrase error, where the "a school in northern ireland" in the summary contradicts with "Northern Ireland charity" in the document.Our model accurately predicts the error type with evidence in the form of document fact highlight, which helps users verify the error and correct the summary.
In Table 5, we present an error analysis on a sample summary generated by a SOTA model.According to the source text, the word "West" in the summary is incorrect and should be removed since the statement in the summary is made by "Sussex PPC" instead of "West Sussex PCC".In order to   (Tang et al., 2022).The error in the sample summary is in red color and italicized.We bold the text spans from the document that refute the sample summary.
detect this error, a model needs to understand that the expressions "Sussex PCC Katy Bourne", "Ms Borune", and "she" in the document refer to the same entity.This sample illustrates that the errors generated by a SOTA model are more subtle and more difficult to be detected.Our model fails to predict the correct error type for this sample.Since the top five document fact highlights returned by our model do not contain the entity "Sussex PCC Katy Bourne", we suspect that our model fails to recognize the co-referential relations among "Sussex PCC Katy Bourne", "Ms Borune", and "she" for this sample.Thus, improving the co-reference resolution ability of fine-grained inconsistency detection models is a potential future direction.

Related Work
Factual consistency metrics.QA-based consistency metrics (Durmus et al., 2020;Scialom et al., 2021;Fabbri et al., 2022b) involve generating ques-tions from the given document and its summary, and then comparing the corresponding answers to compute a factual consistency score.Entailmentbased consistency metrics (Laban et al., 2022;Kryscinski et al., 2020;Ribeiro et al., 2022;Goyal and Durrett, 2020) utilize a binary classifier to determine whether the contents in a system summary are entailed by the source article.In contrast, our model is a multi-label classifier that detects the types of factual errors in a summary.Moreover, our model leverages SRL to encode the facts in the input document and summary, enabling users to interpret which facts in the document are most relevant to the inconsistency detection.
Fact-based evaluation methods.To evaluate the informativeness of a summary, the Pyramid human evaluation protocol (Nenkova and Passonneau, 2004) asks annotators to extract semantic content units (SCUs) from the system summary and reference summary, respectively, and then compute their overlap.Each SCU contains a single fact.Xu et al. (2020) approximate the Pyramid method by using SRL to extract facts.They then compute the embedding similarity between the facts extracted from the system summary and those from the reference summary.Fischer et al. (2022) also use SRL to extract facts, but they measure the similarity between the facts extracted from the system summary and those from the source document to compute a faithfulness score.On the other hand, our model integrates SRL with a multi-label classifier to predict the factual error types of a summary.

Conclusion
In this paper, we present a new task of fine-grained inconsistency detection, which aims to predict the types of factual inconsistencies in a summary.Compared to the previous binary inconsistency detection task, our new task can provide more insights into the weakness of summarization systems.Moreover, we propose an interpretable finegrained inconsistency detection model, which represents facts from documents and summaries with semantic frames and highlights highly relevant document facts.Experiments on the Aggrefact-Unified dataset show that our model can better identify factual error types than strong baselines.Furthermore, results on the FEVER 2.0 dataset validate that the highlighted document facts provide evidence to support the inconsistency prediction.

Limitations
Although our model allows users to interpret which parts of the input document are most relevant to the model's prediction, our model does not allow users to interpret which text spans of the input summary contain errors.We use the summary in Table 4 as an example.If the model can indicate the text span "a school in northern ireland" contains errors, it will be easier for users to correct the summary, potentially benefiting factual error correction systems (Fabbri et al., 2022a;Huang et al., 2023).Kryscinski et al. (2020) introduced an auxiliary task to extract erroneous text spans in summaries, but their method requires expensive text span ground-truth labels.Locating incorrect text spans in the summaries without requiring spanlevel training labels remains unexplored.Another limitation of our model is that it does not allow users to interpret the uncertainty of the prediction results (Deutsch et al., 2021).

Ethical Considerations
The factual error types and document fact highlights predicted by our model can help users correct factually inconsistent summaries.Since factually inconsistent summaries often convey misinformation, our model can potentially help users combat misinformation.However, the factual error types predicted by our model may be incorrect.For example, it is possible that an input summary contains extrinsic noun phrase errors, but our model predicts the error type of intrinsic predicate error.Hence, users still need to be cautious when using our model to detect and correct inconsistent summaries.The Aggrefact-Unified dataset contains public news articles from CNN, DailyMail, and BBC.Hence, the data that we used does not have privacy issues.

A Details of Training Objective
Since some error types may have an imbalanced distribution of positive and negative samples, we apply sampling weighting to the training objective.We first weigh the loss for the positive samples according to their proportion in the training set.Then we sum up the binary cross-entropy loss of each error type as the training objective.The weighted binary cross-entropy (BCE) loss of our model is formally defined as follows: where β i is the weight for positive samples of the i-th error type.We set β i to be the ratio of the number of positive samples to the number of negative samples of the i-th error type in the training data.

B Experiment Details B.1 Aggrefact-Unified Dataset
This dataset contains news documents from CNN/DM (Nallapati et al., 2016) and XSum (Narayan et al., 2018).In addition to the four factual error types presented in Table 1, the Aggrefact-Unified dataset also provides the labels of intrinsic entire-sentence error, extrinsic entire-sentence error, and entire-sentence error.
We map intrinsic (extrinsic) entire-sentence errors to intrinsic (extrinsic) noun phrases and intrinsic (extrinsic) predicate errors.We also map the entire-sentence error to all four types of factual errors.Statistics of the factual error type labels are shown in Table 6. to the i-th token of the semantic frame in the last layer of the baseline model over all attention heads.Then we compute the importance score as follows: where m is the number of words in the fact.Finally, we return the document facts with the highest importance scores as the document fact highlights.

B.3 Hyper-parameter Settings
To compute F1 and BACC scores, we set the classification threshold to be 0.5.The dimension of the adapter in the Adapter-BERT model is set to 32.The number of attention heads in our document fact attention module is set to 16.We search the optimal number of attention heads from {1, 4, 8, 16} that obtains the highest BACC score in the validation set.We train our models for 40 epochs and select the checkpoint that obtains the highest BACC score in the validation set.We set the learning rate to be 1e-5.The training batch size is 12 with a gradient accumulation steps of 2. The AdapterBERT, BERT, and FineGrainFact models receive the same amount of hyperparameter tuning.

B.4 Hardware and Software Configurations
We run all the experiments using a single NVIDIA V100 GPU.It takes around 1 hour and 50 minutes to train our model for 40 epochs.Our model contains 113.1M of parameters in total.We only need to train 3.6M of the model parameters since most of the parameters are frozen by the Adapter-BERT model.We obtain the BERT-base-uncased checkpoint from Huggingface (Wolf et al., 2019).We adopt the implementation of the BERT-based SRL model (Shi and Lin, 2019) provided by Al-lenNLP (Gardner et al., 2018) to conduct semantic role labeling (Palmer et al., 2005).

C Results on Different Summarization Datasets and Error Types
In Table 8, we separate the F1 scores obtained by our FINEGRAINFACT model according to the summarization dataset and the type of factual errors.
It is observed that our model has relatively low performance (< 50%) on detecting intrinsic errors (intrinsic noun phrase and intrinsic predicate errors) in the XSum dataset.We analyze the reason as follows.According to previous studies (Durmus et al., 2020), system summaries generated in the XSum dataset tend to have a high abstractiveness (low textual overlapping with the source document).We suspect that our FINEGRAINFACT model learns a spurious correlation that suggests an inconsistent summary with high abstractiveness contains extrinsic errors rather than intrinsic errors.A critical future direction is to address this spurious correlation of our model.

D Generalization Ability Analysis
To more robustly evaluate the generalization ability of inconsistency detection models, we further construct a challenging data split in which there are no overlapped systems and documents between the test set and the training set.We first gather all the samples that contain a summary generated by the BART model (Lewis et al., 2020)  Table 9: Performance of fine-grained inconsistency detection models in the challenging data split (%).

E Scientific Artifacts
We list the licenses of the scientific artifacts used in this paper: AllenNLP (Apache License 2.0), Huggingface Transformers (Apache License 2.0), and FACTCC (BSD-3-Clause License).We apply the above artifacts according to their official documentation.We will release an API of our model for research purposes.Our API can be applied to detect the fine-grained factual error types in summaries written in the English language.that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?E B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?8 B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? E B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.

i
and f sum i to denote the i-th fact in the document and summary, respectively.
you describe the limitations of your work?7 A2.Did you discuss any potential risks of your work? 8 A3.Do the abstract and introduction summarize the paper's main claims? 1 A4.Have you used AI writing assistants when working on this paper?Left blank.B Did you use or create scientific artifacts?3, 4 B1.Did you cite the creators of artifacts you used?4.1, B.4, E B2. Did you discuss the license or terms for use and / or distribution of any artifacts?E B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?B.4

Table 2 :
Performance of fine-grained consistency detection models in summaries generated by different systems (%)."− Doc.Fact Attention" indicates that we remove the document fact attention module and use mean pooling to fuse all document semantic representation vectors.
Table 3 presents the recall scores of document Source text: Children in P6 and P7 will learn how to cope with change under the Healthy Me programme developed by Northern Ireland charity , Action Mental Health ... The charity is now hoping the programme will be rolled out in schools across Northern Ireland ... ...

Table 4 :
Sample outputs of our FINEGRAINFACT model in the Aggrefact-Unified dataset.The error in the sample summary is in red color and italicized.
Source text:The move is part of national fire service reforms unveiled by Home Secretary Theresa May last week .Sussex PCC Katy Bourne said emergency services would have an increased duty to collaborate under the new bill .But West Sussex County Council ( WSCC ) said it already had an excellent model .East Sussex ' s fire authority said it would co -operate with the PCC but it believed collaboration could be achieved without elaborate structural change .Ms Bourne said she had written to WSCC leader Louise Goldsmith and Phil Howson , East Sussex Fire Authority chairman , to request they begin to look at the feasibility of bringing both fire services under her authority .... [ARG0 they] [V begin] [ARG1 to look at the feasibility of bringing both fire services under her authority] 4. [ARG0 they] [V look] [ARG1 at the feasibility of bringing both fire services under her authority] 5. [ARG0 she] [V request] [ARG1 they begin to look at the feasibility of bringing both fire services under her authority]

Table 5 :
Incorrect output sample of our FINEGRAIN-FACT model in the Aggrefact-Unified dataset

Table 6 :
Table 7 presents the statistics of summaries generated by different systems.Statistics of fine-grained error types in the AggreFact-Unified dataset.

Table 7 :
Statistics of summaries generated by different systems in the AggreFact-Unified dataset.

Table 8 :
The F1 score results of the FINEGRAINFACT model in each summarization dataset and factual error type (%).