Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization

The problems of unfaithful summaries have been widely discussed under the context of abstractive summarization. Though extractive summarization is less prone to the common unfaithfulness issues of abstractive summaries, does that mean extractive is equal to faithful? Turns out that the answer is no. In this work, we define a typology with five types of broad unfaithfulness problems (including and beyond not-entailment) that can appear in extractive summaries, including incorrect coreference, incomplete coreference, incorrect discourse, incomplete discourse, as well as other misleading information. We ask humans to label these problems out of 1600 English summaries produced by 16 diverse extractive systems. We find that 30% of the summaries have at least one of the five issues. To automatically detect these problems, we find that 5 existing faithfulness evaluation metrics for summarization have poor correlations with human judgment. To remedy this, we propose a new metric, ExtEval, that is designed for detecting unfaithful extractive summaries and is shown to have the best performance. We hope our work can increase the awareness of unfaithfulness problems in extractive summarization and help future work to evaluate and resolve these issues.


Introduction
Text summarization is the process of distilling the most important information from a source to produce an abridged version for a particular user or task (Maybury, 1999). Although there are many types of text summarization tasks, we focus on the task of general purpose single document summarization, for which typical benchmarks are CNN/DM (Hermann et al., 2015), XSum (Narayan * Equal contribution. 1 Our data and code are publicly available at https: //github.com/ZhangShiyue/extractive_is_ not_faithful et al., 2018a), etc. To produce summaries, usually either extractive summarization methods, i.e., extracting sentences from the source, or abstractive summarization methods, i.e., generating novel text, are applied (Saggion and Poibeau, 2013).
Extractive summarization is known to be faster, more interpretable, and more reliable (Chen and Bansal, 2018;Li et al., 2021;Dreyer et al., 2021). And the selection of important information is the first skill that humans learn for summarization (Kintsch and van Dijk, 1978;Brown and Day, 1983). Recently, some works discuss the trade-off between abstractiveness and faithfulness (Ladhak et al., 2022;Dreyer et al., 2021). They find that the more extractive the summary is, the more faithful it is, e.g., Dreyer et al. (2021) phrase being extractive as being trivially factual (faithful). 2 This may give the community an impression that if the content is extracted from the source, it is guaranteed to be faithful. However, is this always true? In this work, we will show that, unfortunately, it is not.
The problems of extractive summarization are usually referred as coherence or out-of-context issues (Nenkova and McKeown, 2012;Saggion and Poibeau, 2013;Dreyer et al., 2021). Though they may sound irrelevant to faithfulness, some early works give hints of their unfaithful ingredients. Gupta and Lehal (2010) describe the 'dangling' anaphora problem -sentences often contain pronouns that lose their referents when extracted out of context, and stitching together extracts may lead to a misleading interpretation of anaphora. Barzilay et al. (1999) comment on extractive methods for multi-document summarization, that extracting some similar sentences could produce a summary biases towards some sources. Cheung (2008) says that sentence extraction produces extremely incoherent text that did not seem to convey the gist of the overall controversiality of the source. These all suggest that even though all information is extracted directly from the source, the summary is not necessarily faithful to the source. However, none of these works has proposed an error typology nor quantitatively answered how unfaithful the model extracted summaries are, which motivates us to fill in this missing piece.
In this work, we conduct a thorough investigation of the broad unfaithfulness problems in extractive summarization. Although the literature of abstractive summarization usually limits unfaithful summaries to those that are not entailed by the source (Maynez et al., 2020;Kryscinski et al., 2020), we discuss broader unfaithfulness issues including and beyond not-entailment. We first design a typology consisting five types of unfaithfulness problems that could happen in extractive summaries: incorrect coreference, incomplete coreference, incorrect discourse, incomplete discourse, and other misleading information (see definitions in Figure 2). Among them, incorrect coreference and incorrect discourse are not-entailment based errors. An example of incorrect coreference is shown in Summary 1 of Figure 1, where that in the second sentence should refer to the second document sentence -But they do leave their trash, but it incorrectly refers to the first sentence in the summary. Summaries with incomplete coreferences or discourses are usually entailed by the source, but they can still lead to unfaithful interpretations. Lastly, inspired by misinformation (O'Connor and Weatherall, 2019), our misleading information error type refers to other cases where, despite being entailed by the source, the summary still misleads the audience by selecting biased information, giving the readers wrong impressions, etc. Please refer to Section 2 for detailed definitions and discussions.
We ask humans to label these problems out of 1500 model extracted summaries that are produced by 15 extractive summarization systems for 100 CNN/DM English articles (Hermann et al., 2015). These 15 systems cover both supervised and unsupervised methods, include both recent neural-based and early graph-based models, and extract sentences or elementary discourse units. Please refer to Section 3 for more human evaluation details. By analyzing human annotations, we find that 32.6% of the 1500 summaries have at least one of the five types of errors. Out of which, 4% and 15.5% summaries contain incorrect and incomplete coreferences respectively, 1.2% and 10.9% summaries have incorrect and incomplete discourses respectively, and other 4.9% summaries still mislead the audience without having coreference or discourse issues. The non-negligible error rate demonstrates that extractive is not necessarily faithful. Among the 15 systems, we find that the two oracle extractive systems (that maximize ROUGE (Lin, 2004) against the gold summary by using extracted discourse units or sentences) surprisingly have the most number of problems, while the Lead3 model (the first three sentences of the source document) causes the least number of issues.
Next, we examine whether these problems can be automatically detected by 5 widely-used existing metrics, including ROUGE (Lin, 2004) and four faithfulness evaluation metrics for abstractive summarization (FactCC (Kryscinski et al., 2020), DAE (Goyal and Durrett, 2020), QuestEval (Scialom et al., 2021), BERTScore (Zhang et al., 2020b)). We find that they mostly have either no or small correlations with human labels. Nonetheless, BERTScore performs relatively better. We design a new faithfulness metric, EXTEVAL, for extractive summarization. It contains four sub-metrics that are used to detect incorrect coreference, incomplete Summary 1 (incorrect coreference): (CNN) Most climbers who try don't succeed in summiting the 29,035-foot-high Mount Everest, the world's tallest peak. That's why an experienced climbing group from the Indian army plans to trek up the 8,850-meter mountain to pick up at least 4,000 kilograms (more than 8,000 pounds) of waste from the high-altitude camps, according to India Today. The mountain is part of the Himalaya mountain range on the border between Nepal and the Tibet region.
Summary 2 (incomplete coreference & incorrect discourse) : That's why an experienced climbing group from the Indian army plans to trek up the 8,850-meter mountain to pick up at least 4,000 kilograms More than 200 climbers have died to clean up the trash left by generations of hikers Summary 3 (incomplete discourse & incomplete coreference): But they do leave their trash. Thousands of pounds of it. The mountain is part of the Himalaya mountain range on the border between Nepal and the Tibet region. The 34-member team plans to depart for Kathmandu on Saturday and start the ascent in mid-May. The upcoming trip marks the 50th anniversary of the first Indian team to scale Mount Everest. coreference, incorrect or incomplete discourse, and other misleading information, respectively. We show that EXTEVAL performs best at detecting unfaithful extractive summaries (see Section 4 for more details). Finally, we discuss the potential future solutions to these problems (e.g., decontextualization and near-extractive summarization, that makes minimal edits based on errors found by EX-TEVAL) as well as the generalizability of our work in Section 5.
In summary, our contributions are (1) a taxonomy of broad unfaithfulness problems in extractive summarization, (2) a human-labeled evaluation set with 1500 examples from 15 diverse extractive systems, (3) a meta-evaluation of 5 existing metrics, (4) a new faithfulness metric (EXTEVAL) specifically designed for extractive summarization. Overall, we want to remind the community that even when the content is extracted from the source, there is still a chance to convey unfaithful information. Hence, we should be aware of these problems, be able to detect them, and eventually resolve them to achieve a more faithful and reliable summarization.

Broad Unfaithfulness Problems
In this section, we will describe the five types of broad unfaithfulness problems ( Figure 2) we identified for extractive summarization under our typology. In previous works about abstractive summarization, unfaithfulness usually only refers to the summary being not entailed by the source (Maynez et al., 2020;Kryscinski et al., 2020). The formal definition of entailment is t entails h if, typically, a human reading t would infer that h is most likely true (Dagan et al., 2005). While we also consider being not entailed as one of the unfaithfulness problems, we will show that there is still a chance to be unfaithful despite being entailed by the source.

Incorrect Coreference
An anaphora in the summary refers to a different entity from what the same anaphora refers to in the document. The anaphora can be a pronoun (they, she, he, it, this, that, those, these, them, her, him, their, her, his, etc.) or a 'determiner (the, this, that, these, those, both, etc.) + noun' phrase.

Not-entailment
Incomplete Coreference An anaphora in the summary has ambiguous or no antecedent. Ambiguous interpretation

Incorrect Discourse
A sentence with a discourse linking term (e.g., but, and, also, on one side, meanwhile, etc.) or a discourse unit (usually appears as a sub-sentence) falsely connects to the following or preceding context in the summary, which leads the audience to infer a non-exiting fact, relation, etc.

Not-entailment
Incomplete Discourse A sentence with a discourse linking term or a discourse unit has no necessary following or preceding context to complete the discourse.

Ambiguous interpretation
Other Misleading Information Other misleading problems include but do not limit to leading the audience to expect a different consequence and conveying a dramatically different sentiment.
Bias and wrong impression Hence, we call the five error types we define here the 'broad' unfaithfulness problems, and we provide a rationale for each error type in Figure 2. The most frequent unfaithfulness problem of abstractive summarization is the presence of incorrect entities or predicates (Gabriel et al., 2021;Pagnoni et al., 2021), which can never happen within extracted sentences (or elementary discourse units 3 ). For extractive summarization, the problems can only happen 'across' sentences (or units). 4 Hence, we first define four error types about coreference and discourse. Following SemEval-2010(Màrquez et al., 2013, we define coreference as the mention of the same textual references to an object in the discourse model, and we focus primarily on anaphoras that require finding the correct antecedent. We ground our discourse analysis for systems that extract sentences primarily in the Penn Discourse Treebank (Prasad et al., 2008), which considers the discourse relation between sentences as "lexically grounded". For example, the relations can be triggered by subordinating conjunctions (because, when, etc.), coordinating conjunctions (and, but, etc.), and discourse adverbials (however, as a result, etc). We refer to such words as discourse linking terms. And for systems that take elementary discourse units as the minimal selection unit, we follow the Rhetorical Structure Theory (Mann and Thompson, 1988) and assume every unit potentially requires another unit to complete the discourse.
Finally, inspired by the concept of misinformation (incorrect or misleading information presented as fact), we define the fifth error type -misleading information that captures other misleading problems besides the other four errors. The detailed definitions of the five error types are as follows: Incorrect coreference happens when the same anaphora is referred to different entities given the summary and the document. The anaphora can be a pronoun (they, she, he, it, etc.) or a 'determiner (the, this, that, etc.) + noun' phrase. This error makes the summary not entailed by the source, which is a clear unfaithfulness problem. An example is shown in Summary 1 of Figure 1, where the mention that refers to the sentence -But they do leave their trash. Thousands of pounds of it -in the document but incorrectly refers to Most climbers who try don't succeed in summiting the 29,035-foot-high Mount Everest. Therefore, users who only read the summary may think there is some connection between cleaning up trash and the fact that most climbers do not succeed in summiting the Mount Everest.
Incomplete coreference happens when an anaphora in the summary has ambiguous or no antecedent. 5 Following the formal definition of entailment, these examples are considered to be entailed by the document. Nonetheless, it sometimes can still cause unfaithfulness, as it leads to 'ambiguous interpretation'. For example, given the source "Jack eats an orange. John eats an apple" and the sentence "He eats an apple," the faithfulness of the sentence depends entirely on whom "he" is. Figure 1 illustrates an example of incomplete coreference, where Summary 2 starts with that's why, but readers of that summary do not know the actual reason. Please refer to Figure 4 in the Appendix for another example with a dangling pronoun and ambiguous antecedents.
Incorrect discourse happens when a sentence with a discourse linking term (e.g., but, and, also, on one side, meanwhile, etc.) 6 or a discourse unit (usually appears as a sub-sentence) falsely connects to the following or preceding context in the summary, which leads the audience to infer a nonexiting fact, relation, etc. An example of incorrect discourse is shown by Summary 2 in Figure 1, where More than 200 climbers have died falsely connects to clean up the trash, which makes readers believe 200 climbers have died because of cleaning up the trash. But in fact, they died attempting to climb the peak. This summary is also clearly not entailed by the source.
Incomplete discourse happens when a sentence with a discourse linking term or a discourse unit has no necessary following or preceding context to complete the discourse. Similar to incomplete coreference, summaries with this error are considered entailed, but the broken discourse makes the summary look suspicious and confusing and thus may lead to problematic interpretations. An example is shown in Figure 1. Summary 3 starts with but, and readers expect to know what leads to this turning, but it is never mentioned. Please refer to Figure 5 for another example of incomplete discourse that may leave readers with a wrong impression.
Other misleading information refers to other misleading problems besides the other four error types. It includes but does not limit to leading the audience to expect a different consequence and conveying a dramatically different sentiment. This error is also difficult to capture using the entailment-based definition. Summaries always select partial content from the source, however, sometimes, the selection can mislead or bias the audience. Gentzkow et al. (2015) show that filtering and selection can result in 'media bias'. We define this error type so that annotators can freely express whether they think the summary has some bias or leaves them with a wrong impression. The summary in Figure 6 is labeled as misleading by two annotators because it may mislead the audience to believe that the football players and pro wrestlers won the contest and ate 13 pounds of steak.

Human Evaluation
In this section, we describe how we ask humans to find and annotate the unfaithfulness problems.

Data
We randomly select 100 articles from the CNN/DM dataset (Hermann et al., 2015) because it is a widely used benchmark for single-document English summarization, and extractive methods perform decently well on it. We use 15 extractive systems (as introduced below) to produce summaries for each article, resulting in a total of 1500 summaries. For all extracted sentences (or units), we retain their order in the document as the order in the summary.
Supervised systems: We use 9 supervised extractive systems. (1) Oracle: the method maximizes the ROUGE-1-F1 (Lin, 2004) between the extracted summary and the ground-truth summary; (2) Oracle (discourse) (Xu et al., 2020): another oracle system that extracts discourse units instead of sentences to maximize ROUGE while sticking to discourse constraints; (3) RNN Ext RL (Chen and Bansal, 2018); (4) BanditSumm He, 2021). We implement Lead3 and use the released code of PacSum. 10 For Textrank, we use the summa package. 11 For MI-unsup, we directly use the system outputs open-sourced by the authors. 12 It is worth noting that even though among the 15 systems only Oracle (discourse) (Xu et al., 2020) explicitly takes the discourse structure (the Rhetorical Structure Theory graph) into consideration, some of the other systems also implicitly model the discourse of the document, e.g., HeterGraph (Wang et al., 2020b) builds a graph of sentences based on word overlap.

Setup
We ask humans to label unfaithfulness problems out of the 1500 system summaries. The annotation interface (HTML page) is shown in Figure 7 in the Appendix. It first shows the summary and the document. The summary sentences are also underlined in the document. To help with annotation, we run a state-of-the-art coreference resolution model, Span-BERT (Joshi et al., 2020), via AllenNLP (v2.4.0) (Gardner et al., 2018), 13 on the summary and the document respectively. Then, mentions from the same coreference cluster will be shown in the same color. Since the coreference model can make mistakes, we ask annotators to use them with caution.
Annotators are asked to judge whether the summary has each of the five types of unfaithfulness via five yes or no questions and if yes, justify the choice by pointing out the unfaithful parts. Details of the annotation can be found in Appendix C.
Four annotators, two of the authors (PhD students trained in NLP/CL) and two other CS undergraduate students (researchers in NLP/CL), conducted all annotations carefully in about 3 months. Each of the 1500 summaries first was labeled by two annotators independently. Then, they worked together to resolve their differences in annotating incorrect coreference, incomplete coreference, incorrect discourse, and incomplete discourse because these errors have little subjectivity, and thus agreements can be achieved. The judgment of misleading information is more subjective. Hence, each annotator independently double-checked examples that they labeled no while their partner la-  beled yes, with their partner's answers shown to them. They do not have to change their mind if they do not agree with their partner. This step is meant to make sure nothing is missed by accident. After that, we keep both answers. In total, 140 examples have at least one misleading label, out of which, 74 examples have both annotators' misleading labels. In analysis, we only view a summary as misleading when both annotators labeled yes, regardless of the fact that they may have different reasons.
To avoid one issue in the summary being identified as multiple types of errors, we give different priorities to different error types: incorrect coreference = incorrect discourse > incomplete coreference = incomplete discourse > other misleading information. In other words, if an issue is labeled as one type, it will not be labeled for other equalor lower-priority types.

Results of Human Evaluation
Finally, we find that 457 out of 1500 (32.6%) summaries contain at least one of the five problems. Out of all the summaries, 60 (4%) summaries contain incorrect coreferences, 232 (15.5%) summaries have incomplete coreferences, 18 (1.2%) summaries have incorrect discourses, 164 (10.9%) have incomplete discourses, and 74 (4.9%) summaries are misleading. The error breakdowns for each system are illustrated in Figure 3. Note that one summary can have multiple problems. That's why the Oracle (discourse) system in Figure 3 has more than 100 errors.
The nature of different models makes them have different chances to create unfaithfulness problems. For example, the Lead3 system has the least number of problems because the first three sentences of the document usually have an intact discourse, except in a few cases it requires one more sentence to complete the discourse. In contrast, the two Oracle systems have the most problems. The Oracle model often extracts sentences from the middle part of the document, i.e., having a higher chance to cause dangling anaphora or discourse linking. The Oracle (discourse) model contains the most number of incorrect discourses because concatenating element discourse units together increases the risk of misleading context. Cao et al. (2018) show that about 30% abstractive summaries generated for CNN/DM (Hermann et al., 2015) are not entailed by the source. Also on CNN/DM, the FRANK benchmark (Pagnoni et al., 2021) finds that about 42% abstractive summaries are unfaithful, including both entity/predicate errors and coreference/discourse/grammar errors. Compared to these findings, extractive summarization apparently has fewer issues. We do note, however, that the quantity is not negligible, i.e., extractive = faithful. For example, HeterGraph, one of the state-of-the-art systems, produces 24 unfaithfulness problems out of 100 summaries, which is not a small percentage.

Automatic Evaluation
Next, we analyze whether existing automatic faithfulness evaluation metrics can detect unfaithful extractive summaries. We additionally propose a new evaluation approach, EXTEVAL.

Meta-Evaluation Method
To evaluate automatic faithfulness evaluation metrics (i.e., meta-evaluation) for extractive summarization, we follow the faithfulness evaluation literature of abstractive summarization (Durmus et al., 2020;Wang et al., 2020a;Pagnoni et al., 2021) and compute the correlations between metric scores and human judgment on our meta-evaluation dataset (i.e., the 1500 examples). Though one summary can have multiple incomplete coreferences or incomplete discourses (Appendix C), for simplicity, we take the binary (0 or 1) label as the human judgment of each error type. In addition, we introduce an Overall human judgment by taking the summation of the five error types. So, the maximum of Overall score is 5. We use Pearson r or Spearman ρ as the correlation measure.
This meta-evaluation method is essentially assessing how well the metric can automatically detect unfaithful summaries, which is practically useful. For example, we can pick out summaries with high unfaithfulness scores and ask human editors to fix these summaries before using them. One underlying assumption is that the metric score is comparable across examples. However, some metrics are example-dependent (i.e., one example's score of 0.5 = another example's score of 0.5). For instance, ROUGE is influenced by summary length (Sun et al., 2019). In practice, we do not observe any significant effect of example dependence on our correlation computation.
To understand the correlation without exampledependence issues, we provide two alternative evaluations system-level and summary-level correlations, which have been reported in a number of previous works (Peyrard et al., 2017;Bhandari et al., 2020;Deutsch et al., 2021;Zhang and Bansal, 2021). These two correlations assess the effectiveness of the metrics to rank systems. We define the correlations and present the results in Appendix A.

Existing Faithfulness Evaluation Metrics
In faithfulness evaluation literature, a number of metrics have been proposed for abstractive summarization. They can be roughly categorized into two groups: entailment classification and question generation/answering (QGQA). Some of them assume that the extractive method is inherently faithful.
We choose FactCC (Kryscinski et al., 2020) and DAE (Goyal and Durrett, 2020) as representative entailment classification metrics. However, since they are designed to check whether each sentence or dependency arc is entailed by the source, we suspect that they cannot detect discourse-level errors. QuestEval (Scialom et al., 2021) is a representative QGQA metric, which theoretically can detect incorrect coreference because QG considers the long context of the summary and the document. We also explore BERTScore Precision (Zhang et al., 2020b) that is shown to well correlate with human judgment of faithfulness (Pagnoni et al., 2021;Fischer, 2021), as well as ROUGE-2-F1 (Lin, 2004). Details of these metrics can be found in Appendix D.
Note that for all metrics except for DAE, we negate their scores before computing humanmetric correlations because we want them to have higher scores when the summary is more unfaithful, just like our human labels. Table 4 in the Appendix shows their original scores for the 15 systems.

A New Metric: EXTEVAL
Finally, we introduce EXTEVAL, a simple rulebased metric that is specifically designed for detecting unfaithful extractive summaries. Corresponding to the faithfulness problems defined in Section 2, EXTEVAL is composed of four sub-metrics described as follows. We refer the readers to Appendix E for more details.
INCORCOREFEVAL focuses on detecting incorrect coreferences. Taking advantage of the model-predicted coreference clusters by Span-BERT described in section 3.2, we consider different the cluster mapping of the same mention in the document and summary as incorrect coreference.
INCOMCOREFEVAL can detect incomplete coreferences. We also make use of the modelpredicted coreference clusters. If the first appeared mention in a summary cluster is a pronoun or 'determiner + noun' phrase, and it is not the first mention in the corresponding document cluster, then the summary is considered to have an incomplete coreference. 14 INCOMDISCOEVAL is primarily designed to detect incomplete discourse. Concretely, we check for sentences with discourse linking terms and incomplete discourse units. We consider the summary to have a problem if a discourse linking term is present but its necessary context (the previous or next sentence) is missing or a discourse unit misses its previous unit in the same sentence. It is important to note that the detected errors also include incorrect discourse. However, we cannot distinguish between these two errors.
SENTIBIAS evaluates how different the summary sentiment is from the document sentiment. Sentiment bias is easier to be quantified than other misleading problems. We use the RoBERTa-based (Liu et al., 2019) sentiment analysis model from Al-lenNLP (Gardner et al., 2018) 15 to predict the sentiments of each source sentence. We take the average of sentence sentiments as the overall sentiment of the document and the summary, respectively. Then, sentiment bias is measured by the absolute difference between summary sentiment and document sentiment. We also test sentiment analysis tools from Stanza (Qi et al., 2020) and Google Cloud API, but they do not work better (see Appendix B).
EXTEVAL is simply the summation of the above sub-metrics, i.e., EXTEVAL = INCORCOREFE-VAL + INCOMCOREFEVAL + INCOMDISCO-EVAL + SENTIBIAS. Same as human scores, we make INCORCOREFEVAL, INCOMCOREFEVAL, and INCOMDISCOEVAL as binary (0 or 1) scores, while SENTIBIAS is a continuous number between 0 and 1. EXTEVAL roughly corresponds to the Overall human judgment introduced in Section 4.1. Table 1 shows the human-metric correlations. First, out of the five existing metrics, BERTScore in general works best and has small to moderate (Cohen, 1988) correlations with human judgment, while the other metrics have small or no correlations with human labels. Considering the fact that all these five errors can also happen in abstractive summarization, existing faithfulness evaluation metrics apparently leave these errors behind. Second, the four sub-metrics of EXTEVAL (INCORCOREFEVAL, IN-COMCOREFEVAL, INCOMDISCOEVAL, and SEN-TIBIAS) all demonstrate better performance than other metrics at detecting their corresponding problems. Lastly, our EXTEVAL has moderate to large (Cohen, 1988) correlations with the Overall judgment, which is greatly better than all other metrics. Table 2 in Appendix A shows the system-level and summary-level correlations of all metrics with human judgement. Our EXTEVAL still has the best Pearson correlations with the Overall human score on both the system level and the summary level. See Appendix A for more discussions.

Meta-Evaluation Results
In summary, our EXTEVAL is better at identifying unfaithful extractive summaries than the 5 existing metrics we compare to. Its four sub-metrics can be used independently to examine the corresponding unfaithfulness problems. In addition, the problems automatically found by EXTEVAL can serve as useful hints for humans as well as nearextractive systems to fix the summary.

Discussion & Limitation
To resolve these unfaithfulness problems for extractive summarization, it is inevitable to make abstrac-  Table 1: Human-metric correlations. The negative sign (-) before metrics means that their scores are negated to retain the feature that the higher the scores are the more unfaithful the summaries are.
tive edits. Incorrect and incomplete references need to be replaced by the actual entities. Unnecessary discourse linking terms should be removed and necessary context needs to be included. To avoid misleading information, the model is supposed to merge similar opinions and include diverse content. However, as discussed in Section 3.3, current abstractive summarization will produce greatly more unfaithful information than extractive methods. We believe that a middle-ground approach is decontextualization (Choi et al., 2021). Decontextualization is to make extracted sentences stand alone out of their original context, which is achieved by retaining the sentences' semantics while doing necessary and minimal edits to make the extracted sentences interpretable out of context (i.e., nearextractive). It relates to some early summarization works that develop revision systems to revise model extracts (Okumura, 2000;Hasler, 2007). Decontextualization is also conceptually similar to a number of summarization methods that first extract sentences (or words/phrases) from the source and then generate the summary on top of the extracts (Chen and Bansal, 2018;Gehrmann et al., 2018;Li et al., 2021). However, decontextualization requires minimal edits to make the summary stand-alone, while hybrid methods' abstractors usually do not have this constraint.
Last but not the least, it is worth noting that all of the five error types we define in Section 2 can also happen in abstractive summarization, though they are less studied and measured in the literature. To our best knowledge, FRANK (Pagnoni et al., 2021) and SNaC (Goyal et al., 2022) have discussed the coreference and discourse errors in the abstractive summaries. We are not aware of works that have studied the misleading information error in summaries. We thus hope that our taxonomy can shed some light for future works to explore the broad unfaithfulness of all summarization methods.
Our work focuses on extractive summarization. Therefore, the conclusions will be more useful for summarization tasks where extractive summaries perform decently well (e.g., CNN/DM (Hermann et al., 2015)) compared to extremely abstractive summarization tasks (e.g., XSum (Narayan et al., 2018a)). Even though all five types of problems defined in Section 2 can also appear in abstractive summarization, our EXTEVAL is designed for extractive summarization, which is currently not applicable (but can be adapted) for abstractive summaries except SENTIBIAS.

Conclusion
We conducted a systematic analysis of broad unfaithfulness problems in extractive summarization, through which we want to stress that extractive does not equal to faithful. We proposed five types of broad unfaithfulness problems and produced a human-labeled evaluation set consisting of 1500 examples. We found that (i) 32.6% of the summaries have at least one of the five issues, (ii) existing metrics correlate poorly with human judgment, and (iii) our proposed faithfulness evaluation metric EXTEVAL, which is specifically designed for this task, performs the best at identifying these problems. We hope this work can raise awareness of additional unfaithfulness issues so that future work can also resolve these issues when developing more faithful summarization models.  Table 2: System-level and summary-level correlations. The negative sign (-) before metrics means that their scores are negated to retain the feature that the higher the scores are the unfaithful the summaries are. and S systems in the mete-evaluation dataset. The system-level correlation is defined as follows:

References
In our case, N = 100 and S = 15. We use Pearson r or Spearman ρ as the correlation measure K.
Summary-level correlation evaluates if the metric can reliably compare summaries generated by different systems for the same document. Using the same notations as above, it is written by: System-level correlations have more complicated trends than the result in Table 1. We think it is because for both system-level and summarylevel correlations, their correlations are computed between two vectors of length 15 (15 systems), whereas the meta-evaluation method we used in the main paper computes the correlations between two vectors of length 1500 (1500 examples). Therefore, a smaller sample size will cause a larger variance. This is especially true for system-level correlations, because, following the definitions above, the summary-level correlation (K sum m,h ) averages across N (in our case, N=100) which can help reduce the variance. Nevertheless, our EXTEVAL achieves the best Pearson correlations with the Overall human judgment on the system level.  To summarize, our EXTEVAL has the best Pearson correlations and close-to-the-best Spearman correlations with the Overall human judgment on both system and summary levels. It means our new metric can rank extractive systems well based on how unfaithful they are.

B Alternative Sentiment Analysis Tools
In the main paper, we use the sentiment analysis tool from AllenNLP (v2.4.0) (Gardner et al., 2018) to implement our SENTIBIAS sub-metric of EX-TEVAL. Here, we test two other sentiment analysis tools from Stanza (Qi et al., 2020) and Google Cloud API 16 , respectively. We also try an ensemble method by averaging their output scores. Table 3 shows the performance. AllenNLP works better than the other two tools. The ensemble does not help improve the performance either.

C Human Evaluation Details
We did not choose to label the data on Amazon Mechanical Turk because we think that understanding the concepts of coreference and discourse requires some background knowledge of linguistics and NLP. Figure 7 shows the annotation interface and an example annotation. We ask the expert annotators to justify when they think there exists an unfaithful problem. Specifically, if they think the summary has incorrect coreferences, they need to further specify the sentence indices and the mentions. For example, "s2-he" means "he" in the second summary sentence is problematic. Meanwhile, they need to justify their answer by explaining why "s2he" is an incorrect coreference. For incomplete coreference, annotators also need to specify the sentence indices plus mentions, but no explanation 16 https://cloud.google.com/apis/docs/ overview is required because it can always be "the mention has no clear antecedent." For incorrect discourse, they need to specify sentence indices and justify their choice. For incomplete discourse, they only need to specify sentence indices. We find that many summaries have multiple incomplete coreference or discourse issues. Annotators need to label all of them, separated by ",", e.g., "s2-he, s3-the man". Lastly, besides these four errors, if they think the summary can still mislead the audience, we ask them to provide an explanation to support it.

D Faithfulness Metric Details
We select the following representative metrics to assess whether they can help to detect unfaithful summaries for extractive summarization. Unless otherwise stated, we use the original code provided by the official repository.
ROUGE (Lin, 2004) is not designed for faithfulness evaluation; instead, it is the most widely used content selection evaluation metric for summarization. Although it has been shown that ROUGE correlates poorly with the human judgment of faithfulness (Maynez et al., 2020), we explore whether it still holds for the extractive case. We only report ROUGE-2-F1 because other variants share similar trends with it. we use the implementation from the Google research Github repo. 17 FactCC (Kryscinski et al., 2020) is an entailment-based metric trained on a synthetic corpus consisting of source sentences as faithful summaries and perturbed source sentences as unfaithful ones. It means that FactCC inherently treats each source sentence as faithful. During the evaluation, they take the average score for each summary sentence as the final score.
DAE (Goyal and Durrett, 2020) is also entailment-based and evaluates whether each dependency arc in the summary is entailed by the document or not. The final score is the average of arc-level entailment labels. DAE is similarly  trained by a synthetic dataset compiled from paraphrasing. Since dependency arcs are within sentences, DAE also can hardly detect discourse-level unfaithfulness issues.
QuestEval (Scialom et al., 2021) is a F1 style QGQA metric for both faithfulness and content selection evaluations. It first generates questions from both the document and the summary. Then, it answers the questions derived from the summary using the document (i.e., precision) and answers the questions derived from the summary using the summary (i.e., recall). The final score is their harmonic mean (i.e., F1). QuestEval theoretically can detect incorrect coreference because QG considers the long context of the summary and the document. However, it may not be able to capture the other three types of errors.
BERTScore (Zhang et al., 2020b) is a general evaluation metric for text generation. It computes the token-level cosine similarities between two texts using BERT (Devlin et al., 2019). Some previous works (Pagnoni et al., 2021;Fischer, 2021) have shown that its precision score between the summary and the source (i.e., how much summary information is similar to that in the document) has a good correlation with the summary's faithfulness. We hypothesize BERTScore is able to capture more general discourse-level errors because of the contextualized representations from BERT. Table 4 show the metric scores as well as the human Overall score of the 15 systems we study in this work. Scores are computed only on the 100 CNN/DM testing examples we use, and the system score is the average of example scores.

E EXTEVAL Details
For INCOMCOREFEVAL, the list of pronouns we use includes they, she, he, it, this, that, those, these, them, her, him, their, her, his, and the list of determiners includes the, that, this, these, those, both. This list only contains frequent terms that appear in our dataset, which is not exhaustive.
The list of linking terms for INCOMDISCO-EVAL includes and, so, still, also, however, but, clearly, meanwhile, not only, not just, on one side, on another, then, moreover. Similarly, the list is not exhaustive, and we only keep frequent terms. Figure 4 and Figure 5 show two additional examples of incomplete coreference and incomplete disource respectively. Figure 6 shows a misleading information example.

Document:
(CNN) The California Public Utilities Commission on Thursday said it is ordering Pacific Gas & Electric Co. to pay a record $1.6 billion penalty for unsafe operation of its gas transmission system, including the pipeline rupture that killed eight people in San Bruno in September 2010. Most of the penalty amounts to forced spending on improving pipeline safety. Of the 1.6billion,850 million will go to "gas transmission pipeline safety infrastructure improvements," the commission said. Another $50 million will go toward "other remedies to enhance pipeline safety," according to the commission. "PG&E failed to uphold the public's trust," commission President Michael Picker said. "The CPUC failed to keep vigilant. Lives were lost. Numerous people were injured. Homes were destroyed. We must do everything we can to ensure that nothing like this happens again." The company's chief executive officer said in a written statement that PG&E is working to become the safest energy company in the United States. "Since the 2010 explosion of our natural gas transmission pipeline in San Bruno, we have worked hard to do the right thing for the victims, their families and the community of San Bruno," Tony Earley said. "We are deeply sorry for this tragic event, and we have dedicated ourselves to re-earning the trust of our customers and the communities we serve. The lessons of this tragic event will not be forgotten." On September 9, 2010, a section of PG&E pipeline exploded in San Bruno, killing eight people and injuring more than 50 others. The blast destroyed 37 homes. PG&E said it has paid more than $500 million in claims to the victims and victims' families in San Bruno, which is just south of San Francisco. The company also said it has already replaced more than 800 miles of pipe, installed new gas leak technology and implemented nine of 12 recommendations from the National Transportation Safety Board. According to its website, PG&E has 5.4 million electric customers and 4.3 million natural gas customers. The Los Angeles Times reported the previous record penalty was a $146 million penalty against Southern California Edison Company in 2008 for falsifying customer and worker safety data. CNN's Jason Hanna contributed to this report.

Summary (incomplete coreference):
(CNN) The California Public Utilities Commission on Thursday said it is ordering Pacific Gas & Electric Co. to pay a record $1.6 billion penalty for unsafe operation of its gas transmission system, including the pipeline rupture that killed eight people in San Bruno in September 2010. According to its website, PG&E has 5.4 million electric customers and 4.3 million natural gas customers.

Document:
(CNN) It's been a busy few weeks for multiples. The first set of female quintuplets in the world since 1969 was born in Houston on April 8, and the parents are blogging about their unique experience. Danielle Busby delivered all five girls at the Woman's Hospital of Texas via C-section at 28 weeks and two days, according to CNN affiliate KPRC. Parents Danielle and Adam and big sister Blayke are now a family of eight. The babies are named Ava Lane, Hazel Grace, Olivia Marie, Parker Kate and Riley Paige. "We are so thankful and blessed," said Danielle Busby, who had intrauterine insemination to get pregnant. "I honestly give all the credit to my God. I am so thankful for this wonderful hospital and team of people here. They truly all are amazing." You can learn all about their journey at their blog, "It's a Buzz World." Early news reports said the Busby girls were the first all-female quintuplets born in the U.S. But a user alerted CNN to news clippings that show quintuplet girls were born in 1959 to Charles and Cecilia Hannan in San Antonio. All of the girls died within 24 hours. Like the Busby family, Sharon and Korey Rademacher were hoping for a second child. When they found out what they were having, they decided to keep it a secret from family and friends. That's why they didn't tell their family the gender of baby No. 2 -or that Sharon was actually expecting not one but two girls, according to CNN affiliate WEAR. And when everyone arrived at West Florida Hospital in Pensacola, Florida, after Sharon gave birth March 11, they recorded everyone's reactions to meeting twins Mary Ann Grace and Brianna Faith. The video was uploaded to YouTube on Saturday and has been viewed more than 700,000 times. Could you keep it a secret? Summary (incomplete discourse): The first set of female quintuplets in the world since 1969 was born in Houston on April 8, Danielle Busby delivered all five girls at the Woman's Hospital of Texas via C-section at 28 weeks and two days, the Busby girls were the first all-female quintuplets The summary is generated by the Oracle (disco) (Xu et al., 2020) extractive system. All extracted elementary discourse units are underlined in the document. The last summary sentence missed the "born in the u.s" part which may make people think the Busby girls is the first all-female quintuplets not only in US.

Document:
(CNN) It didn't seem like a fair fight. On one side were hulking football players and pro wrestlers, competing as teams of two to eat as many pounds of steak as they could, combined, in one hour. On another was a lone 124-pound mother of four. And sure enough, in the end, Sunday's contest at Big Texan Steak Ranch in Amarillo, Texas, wasn't even close. Molly Schuyler scarfed down three 72-ounce steaks, three baked potatoes, three side salads, three rolls and three shrimp cocktails -far outpacing her heftier rivals. That's more than 13 pounds of steak, not counting the sides. And she did it all in 20 minutes, setting a record in the process. "We've been doing this contest since 1960, and in all that time we've never had anybody come in to actually eat that many steaks at one time," Bobby Lee, who co-owns the Big Texan, told CNN affiliate KVII. "So this is a first for us, and after 55 years of it, it's a big deal." In fairness, Schuyler isn't your typical 124-pound person. The Nebraska native, 35, is a professional on the competitive-eating circuit and once gobbled 363 chicken wings in 30 minutes. Wearing shades and a black hoodie, Schuyler beat four other teams on Sunday, including pairs of football players and pro wrestlers and two married competitive eaters. She also broke her own Big Texan record of two 72-ounce steaks and sides, set last year, when she bested previous record-holder Joey "Jaws" Chestnut. The landmark Big Texan restaurant offers its "72-ounce Challenge" daily to anyone who can eat the massive steak, plus fixings, in under an hour. Those who can't do so must pay $72 for the meal. Schuyler, who now lives in Sacramento, California, won $5,000 for her efforts. Her feat will be submitted to Guinness World Records. But mostly, she just seemed pleased to enjoy a hearty meal on the house. "It's free, so I'm pretty happy about that," she told KVII. "Otherwise it would have cost me about 300 bucks."

Summary (other misleading information):
On one side were hulking football players and pro wrestlers, competing as teams of two to eat as many pounds of steak as they could, combined, in one hour. And sure enough, in the end, Sunday's contest at Big Texan Steak Ranch in Amarillo, Texas, wasn't even close. That's more than 13 pounds of steak, not counting the sides.