BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of Faithfulness Metrics

The proliferation of automatic faithfulness metrics for summarization has produced a need for benchmarks to evaluate them. While existing benchmarks measure the correlation with human judgements of faithfulness on model-generated summaries, they are insufficient for diagnosing whether metrics are: 1) consistent, i.e., indicate lower faithfulness as errors are introduced into a summary, 2) effective on human-written texts, and 3) sensitive to different error types (as summaries can contain multiple errors). To address these needs, we present a benchmark of unfaithful minimal pairs (BUMP), a dataset of 889 human-written, minimally different summary pairs, where a single error is introduced to a summary from the CNN/DailyMail dataset to produce an unfaithful summary. We find BUMP complements existing benchmarks in a number of ways: 1) the summaries in BUMP are harder to discriminate and less probable under SOTA summarization models, 2) unlike non-pair-based datasets, BUMP can be used to measure the consistency of metrics, and reveals that the most discriminative metrics tend not to be the most consistent, and 3) unlike datasets containing generated summaries with multiple errors, BUMP enables the measurement of metrics’ performance on individual error types.


Introduction
Although modern abstractive summarization systems have improved drastically in their ability to produce fluent text , their ability to generate text that is factually grounded in the source article remains an issue (Kryscinski et al., 2020). This phenomenon has inspired the NLP community to develop faithfulness evaluation metrics (Fabbri et al., 2022;Honovich et al., 2021;Scialom et al., 2021) that automatically measure the extent to which abstractive summarization systems produce unfaithful summaries, i.e., summaries that contain information that cannot be verified by the source article.
As the number of these automatic faithfulness metrics has increased, there has arisen a corresponding need for benchmarks that evaluate their relative strengths. To satisfy this need, researchers have developed datasets such as FRANK (Pagnoni et al., 2021) and TRUE (Honovich et al., 2022) that are comprised of model-generated summaries along with human-annotated faithfulness levels. Although these datasets are useful for evaluating the degree to which faithfulness metrics correlate with human judgements and can discriminate unfaithful summaries, a number of factors limit the conclusions that can be drawn from them. For one, because model summaries can vary in terms of length, content, and number of errors, these benchmarks are ill-suited for drawing conclusions about the consistency (Gabriel et al., 2021) of metrics, i.e., whether they decrease as summaries become increasingly unfaithful, as well as their sensitivity to specific types of errors. Furthermore, because the summaries are machine-generated, these benchmarks also cannot evaluate whether metrics can detect human-written unfaithful summaries.
To enable research on these topics, we present BUMP-a benchmark of unfaithful minimal pairsa dataset of 889 minimally different summary pairs where all unfaithful summaries are generated by human annotators. BUMP is constructed from the CNN/DailyMail dataset (Hermann et al., 2015). As is illustrated in Figure 1, given an article and its reference summary, we ask a human annotator to edit the reference summary in a minimal way such that the edited summary exhibits one unfaithful error. We design two tasks for performance comparisons: 1) taxonomy-based edits, where a specific unfaithfulness error type is required according to our proposed taxonomy, and 2) free-style edits, where no error constraints are imposed.
We use BUMP to study the ability and performance consistency of faithfulness evaluation met- Figure 1: Example from BUMP dataset. An annotator constructs an unfaithful summary containing an extrinsic entity error (Section 3.2) by replacing the word "homegrown" in the original summary with the word "foreign". The original and edited summary form a minimal unfaithful summary pair. Faithfulness metrics are evaluated on both the original and edited summary and compared to measure whether the metric is consistent, e.g., in this example QuestEval and CoCo are consistent, while FactCC is not. rics in differentiating unfaithful summaries from faithful ones. Similar to how minimal pairs are used to diagnose linguistic knowledge of language models (Marvin and Linzen, 2018;Warstadt et al., 2020), the minimal summary pairs in BUMP allow targeted tests of a metric's consistency on different types of errors (Table 1). This setup minimizes the effect of confounding factors that affecting similar analyses (e.g., Pagnoni et al. (2021) and Tang et al. (2022)) such as multiple errors occuring in the same summary. We evaluate standard and state-ofthe-art faithfulness metrics on BUMP using metaevaluation metrics that target two phenomenon: 1) consistency, i.e. the fraction of unfaithful summaries that receive a lower score than their corresponding faithful summaries, and 2) discriminability, i.e., the metric's ability to classify unfaithful vs. faithful summaries as measured by ROC AUC.
Our results (Section 4) yield a number of useful findings: 1) BUMP differs substantially from existing benchmarks in that the summaries in BUMP are harder to discriminate (ROC AUC scores between 50-70% vs. 80-90%) and are less probable under SOTA summarization models. 2) Discriminability != consistency, interestingly the most consistent metrics (BARTScore, CoCo) tend to have poor discriminability. 3) Some error types are harder than other, in particular, metrics seem to uniformly struggle with summaries containing coreference and predicate errors.
In sum, our contributions are three-fold: (i) We build a benchmark of human generated unfaithful minimal pairs (BUMP) for evaluating faithfulness metrics. (ii) We show human generated unfaithful errors are substantially different from and more challenging than model generated ones. (iii) We demonstrate how BUMP provides insights on both the consistency and discriminative ability of faithfulness metrics on different error types than prior evaluation benchmarks that complement insights from existing benchmarks. BUMP is available at: https://omitted.link.

Related Work
Standard evaluation metrics for text generation tasks, e.g., BLEU, and ROUGE, do not correlate well with human judgements of factual alignment in summarization settings (Kryscinski et al., 2019;Maynez et al., 2020). This has motivated the development of automated faithfulness metrics that quantify factual alignment through methods that either: use NLI to measure the degree of entailment between the source article and summary (Kryscinski et al., 2020;Goyal and Durrett, 2020;, use question answering (QA) models to measure whether questions derived from the source can be answered by the summary and vice versa Durmus et al., 2020;Scialom et al., 2021), or compare summary probabilities when relevant information is removed from the source (Xie et al., 2021).
Systematic comparison of faithfulness metrics is performed using one of two classes of benchmarks: 1) machine-generated summaries paired with human-annotated faithfulness levels Pagnoni et al., 2021;Tang et al., 2022), and 2) summary pairs pertaining to the same source article where one summary is faithful and the other is unfaithful (Falke et al., 2019;Gabriel et al., 2021). While both types of benchmarks can evaluate a metric's ability to discriminate unfaithful summaries, the latter additionally allows one to test for consistency, i.e., whether metrics assign higher values to more faithful summaries. The BUMP dataset introduced in this paper belongs to the second class of benchmarks, however has a number of unique properties. First, unlike both Falke et al. (2019) and Gabriel et al. (2021), the unfaithful summaries in BUMP are humanwritten. In addition, the unfaithful summaries in BUMP are minimally different, in the sense that only a single error differentiates the faithful and unfaithful summary. As shown in Section 4, this produces summary pairs that are substantially more challenging for metrics to differentiate. Inspired by the use of minimal pairs to diagnose linguistic knowledge of language models (Marvin and Linzen, 2018;Warstadt et al., 2020), the benefit of this approach is that it allows targeted tests of a metric's consistency on different types of errors (Section 3.2) while minimizing the effect of confounding factors. Thus, unlike other benchmarks with error type annotations (Pagnoni et al., 2021;Tang et al., 2022), results on BUMP are not complicated by issues such as multiple errors appearing in the same summary.

Benchmark of Unfaithful Minimal Pairs (BUMP)
In this section, we describe the construction of the BUMP dataset. We first describe how data sources are selected to build BUMP (3.1). We then describe the two annotation tasks (3.2 and 3.3), where Task 1 is taxonomy-based (a specific error type is required for the edited summary), and Task 2 allows freestyle edits (i.e., no error constraints are imposed).

Dataset
For Task 1, we randomly select 100 articlesummary pairs from the test set of the CNN/DailyMail dataset (Hermann et al., 2015). 1 1 We do not annotate samples from the XSum dataset (Narayan et al., 2018) since the reference summaries For Task 2, we select additional 100 random article-summary pairs. Both tasks are performed via Amazon Mechanical Turk. 2

Task 1: Taxonomy-based Unfaithful Summaries
For Task 1, we first define a taxonomy, detailed in Table 1. In our taxonomy, the intrinsic/extrinsic distinction only applies to the predicate, entity, and circumstance error, since for a coreference error, it is generally ambiguous whether an erroneous pronoun/reference that does not exist in the source article should be regarded as intrinsic or extrinsic.
In total this results in seven different error types.
Given an article-summary pair, for each of the seven error types in this taxonomy, we ask the annotator to introduce an error of the required type through a minimal edit to the reference summary. All <article, summary, error type> assignments in Amazon Mechanical Turk are shuffled and there is no annotation repetition per assignment. This increases the chance that edits of the same summary will be made by different annotators. Additional details regarding qualification tests and annotation instructions are presented in Appendix A.
After the data collection, we manually check the validity of each edit. For cases where the edits do not match the required error types, we relabel them with the corrected error types based on our taxonomy. The dataset statistics after correction are shown in Table 2, The incorrect response rate is 16%, suggesting that in general annotators correctly respond with the required error types. Comparison with Other Taxonomies. Our taxonomy is adapted from the one in FRANK (Pagnoni et al., 2021), by including semantic frame errors (Predicate Error, Entity Error, and Circumstance Error) and Coreference Error, and removing error types (e.g., grammatical error) that might overlap with others and cause confusion in the annotation task. To further categorize each semantic frame error, we adopt the notions of Intrinsic and Extrinsic errors (Maynez et al., 2020;Goyal and Durrett, 2020;Tang et al., 2022). Note that we do not simply categorize errors into the Intrinsic and Extrinsic ones, as we semantic frame errors can better instruct annotators to create summaries with diverse unfaithful errors. are frequently unfaithful (Maynez et al., 2020).
2 https://www.mturk.com/; annotation guidelines and interfaces are detailed in Appendix A and B.

Description Predicate Error
The predicate in the summary statement is inconsistent with the source article.

Entity Error
The subject/object of a predicate is inconsistent with the source article.
Circumstance Error Location and/or time of the event of the predicate is wrong.
Coreference Error A pronoun/reference with wrong or nonexistent antecedent.
Intrinsic Error Error derived from information within the source article.

Extrinsic Error
Error contains information not present in the source article.

Task 2: Free-style Unfaithful Summaries
In addition to the taxonomy-based Task 1, we also conduct a separate task, Task 2, where annotators can introduce errors by editing reference summaries in any way they want, i.e., free-style editing. Specifically, only annotators who did not participate the qualification test of Task 1 are qualified to participate this task; in this way, we ensure the edited summaries in Task 2 are not constrained to any known error types. The goal of Task 2 is to understand how human generated unfaithful summaries may vary, and how the performance of faithfulness evaluation metrics changes accordingly, when there are no error type constraints.
To post-process all data collected in Task 2, we manually assign an error type to each data point, based on our error type taxonomy in Task 1. Similar to Task 1, to assign error types to data in Task 2, we also allow "Other" error type if the unfaithful summary cannot be described by our taxonomy or contains more than one error. The rate that "Other" labels occur is only 2.5%, confirming the efficiency of our taxonomy. For more details on Task 2, please see Appendix B.
Discussion of Tasks 1 and 2. For both tasks, we ask annotators to introduce only one error (by editing the reference summary in a minimal way). We acknowledge that some reference summaries may be unfaithful in the first place; nevertheless, for both tasks, edited summaries are based on reference summaries, by which we ensure the edited summaries are always more unfaithful than reference summaries. Task 1 Task 2   Predicate  Intrinsic  116  17  Extrinsic  76  28  Entity  Intrinsic  128  28  Extrinsic  115  62  Circumstance Intrinsic  82  22  Extrinsic  78  33  Coreference  -98  1  Other  -0  5 Total 693 196

Evaluations of Faithfulness Evaluation Metrics
In this section, we first describe the faithfulness evaluation metrics benchmarked on our curated dataset (4.1). The meta evaluation metrics and the performance of each faithfulness evaluation metric are then discussed (4.2 and 4.3). 3

Faithfulness Metrics
We investigate both standard generic metrics and state-of-the-art faithfulness evaluation-specific metrics' abilities to distinguish faithful summaries from their minimally edited counterparts. r Standard n-Gram-Based Metrics: We evaluate the following 2 n-gram-based metrics: BLEU (Papineni et al., 2002) and ROUGE (ROUGE-2 Precision specifically) (Lin, 2004). Since these metrics have not been proposed for evaluating faithfulness, we leverage them to compute the similarity between the input article and a (reference or edited) summary instead.

Meta-Evaluation
Each faithfulness metric takes a pair of input article and summary and output a numerical metric score, where the summary is either the original faithful reference summary or the unfaithful edited summary by human annotators. Note that we normalize the direction of the metric score depending on the original definition of the metric (e.g., FactCC assigns a higher probability score to more unfaithful summary), so that a higher faithfulness score always means more faithful. We quantify the difference between faithfulness metrics on BUMP using two measurements: consistency and ROC AUC. Consistency measures the success rate of a metric assigning a lower score to a more unfaithful summary given two summaries for the same article. This is similar to the consistency measurement introduced by Gabriel et al. (2021) which studies the correlation between the faithfulness metric and the number of errors in a summary. ROC AUC instead measures the overall capability of these metrics to discriminate faithful from unfaithful content in an input summary. This is formulated as a binary classification problem (i.e., faithful and unfaithful summaries are assigned binary labels respectively as in Honovich et al. (2022)).

Results
Consistency. The consistency studies of the two tasks for all the metrics are reported in Table 3. In terms of the difficulty per error type, (i) for Task 1, extrinsic entity errors are generally the easiest, while all but BARTScore struggle with intrinsic predicate errors; (ii) for Task 2, intrinsic entity errors are the hardest. This implies that when annotators are not presented with any error types, the introduced error styles may differ from those in Task 1 (see Case Study section), potentially causing inconsistencies for metrics in these two tasks. Nevertheless, we observe that for both tasks, intrinsic errors are more challenging than extrinsic ones across all but FactCC in Task 2. This is likely because intrinsic errors can be derived from the original article, making them more subtle, while extrinsic errors contain words that do not appear in the original article.
In terms of the overall performance (all error types are considered), BARTScore has the high-est consistency in both tasks, though BARTScore has not been proposed specifically for faithfulness evaluations. Other metrics that rank top 4 in both tasks include QAFactEval and CoCo. By comparison, Q 2 and FactCC have the worst consistency, even worse than n-gram based metrics ROUGE and BLEU; nevertheless, they exhibit different rankings in terms of ROC AUC (see next section for details).
ROC AUC. The ROC AUC study of detecting unfaithful summaries for all metrics are presented in Table 4. We observe some similar trends as those in the consistency study: (i) Content distributions between Tasks 1 and 2 are different, resulting in inconsistent performance per error types, i.e., entity errors are the easiest, while coreference and intrinsic predicate errors are generally the hardest for Task 1; however, there is no obvious difference among error types for Task 2; (ii) Metrics generally show worse performance on intrinsic errors than extrinsic ones.
Additionally, we notice that the overall AUC ROC performance of metrics ranks differently from their consistency performance. In particular, the rank of BARTScore drops from the top one to the fifth, while Q 2 improves significantly from the second last to the second. QAFactEval consistently exhibits high performance and even ranks first under ROC AUC, while n-gram based metrics, e.g., ROUGE-2 and BLEU consistently show the worst performance (as expected). In general, metrics that are specifically proposed for faithfulness evaluations rank higher w.r.t. ROC AUC than those that are for generic NLG evaluation purposes. For our two meta-evaluation protocols, consistency is suitable for pairwise ranking of two summaries for a given input article, while ROC AUC is more adequate in evaluating the absolute capacity of unfaithful summary detection. If a metric has high consistency but low ROC AUC, it implies that the scores for predicted faithful and unfaithful summaries overlap frequently. The overlap in scores makes it challenging to establish a clear decision boundary for classification. Therefore, to improve the binary classification capability of metrics with high consistency, more calibration is needed to increase the score gap between faithful and unfaithful summaries.
Comparisons with Model Generated Unfaithful Summaries. We compare the generation probabilities of our edited summaries to those of sum-   maries generated from beam search by a BARTbased summarizer for the same set of documents in our dataset. We report the difference of these generation probabilities normalized by the text length in Figure 2, where we find our edited summaries are much different from model generations in term of the model generation probabilities. This highlights that existing metrics might not work well on summaries of various styles and experiments are needed to verify their effectiveness in human generated unfaithful summaries.
Furthermore, the ROC AUC scores evaluated on our datasets are generally much smaller (with many values close to random baseline) than existing datasets as shown in TRUE (Honovich et al., 2022), which again indicates the difficulty of existing metrics in detecting unfaithful human-generated summaries with minimal difference from faithful ones.
Finally, in TRUE (Honovich et al., 2022), the performance ranking in terms of ROC AUC (the same problem setting as our binary classification task) averaged over FRANK (Pagnoni et al., 2021), SummEval (Fabbri et al., 2021) and QAGS-C (Wang et al., 2020) (i.e., three datasets containing CNN/DailyMail) is SummaC > Q 2 > BARTScore > BERTScore > FactCC > BLEURT > QuestEval, where the top ranked metrics are significantly different from the ranking derived from our dataset, e.g., SummaC consistently exhibits much worse ROC AUC than Q 2 using our BUMP dataset. We think BUMP complements the existing benchmarks and allows a more comprehensive analysis of faithfulness metrics in future studies.  Analysis. Here we provide quantitative analysis to the annotations in BUMP and list out several interesting patterns that are worth noting. We pick one representative example from each bucket within Task 1 and 2 and list them in Table 5 and Table 6, respectively.
For Task 1, we identify four interesting patterns from the edited summaries. The first pattern corresponds to the samples where a perfectly edited hallucination is introduced to the edited summary such as Example 1. In the second pattern (Example 2), the annotators regard the quantity of a Noun object as a circumstance and make edits to the quantity, hence mistakenly regard Entity Errors as Circumstance Errors. Similarly, in some cases (Example 3), the annotators mistake Intrinsic Errors as Extrinsic Errors by failing to identify that the edits are mentioned in the original article. In the last pattern (Example 4), the edited summaries are able to successfully fool the evaluation metrics by landing a high faithful score from different metrics (QAFactEval in this example, the faithful score is 0.896).
For Task 2, we identify two major patterns. Similar to Task 1, the first pattern corresponds to the samples where a perfectly edited hallucination is introduced as shown in Example 1 in Table 6. In the second pattern (Example 2), the edits are able to fool the metrics by predicting a high faithful score from different metrics (QAFactEval in this example, the faithful score is 0.899).

Conclusion
In this paper, we presented a benchmark of unfaithful minimal pairs (BUMP) to evaluate faithfulness metrics. Unlike prior work where all unfaithful summaries are model generated, each unfaithful summary in BUMP is generated by minimal human edits to introduce one unfaithful error given a reference summary. We showed that unfaithful summaries in BUMP are substantially different from those in other datasets, and more challenging to be identified. We analyzed faithfulness metrics according to 7 error types in our proposed taxonomy via consistency and ROC AUC protocols. Evaluation results show that no metrics simultaneously achieves the best performance in these two protocols. Nevertheless, intrinsic errors are consistently more difficult than extrinsic ones for all metrics. Therefore, BUMP provides a valuable dataset that complements the research community in testing faithfulness evaluation metrics. (CNN)French customs officers say they have seized more than 2 tons of cocaine aboard a sailboat that was falsely flying an American flag in the Caribbean. The drugs, whose value is estimated at more than $105 million, are the biggest cocaine seizure ever carried out by French authorities, said Michael Lachaux, director of customs operations in Martinique. Officers arrested one Venezuelan and two Spanish citizens who were on board the vessel off the coast of Martinique on Wednesday, Lachaux said in an interview with the radio station France Info on Saturday. Martinique is an overseas department of France. In November, French customs officials seized nearly 250 kilograms (550 pounds) of cocaine on a vessel that was also off the coast of Martinique, according to authorities.
The value of the drugs is estimated at more than $105 million . Officers arrested one Venezuelan and two Spanish citizens on board the vessel .

Intrinsic
Circumstance Error The value of the drugs is estimated at more than $250 million . Officers arrested one Venezuelan and two Spanish citizens on board the vessel .

Intrinsic
Entity Error 3 Lightning, floods and a deluge of hailstones descended on St Louis Tuesday as powerful storms pummeled the mid-United States. Roads around the Missouri city were flooded in the intense downpour, with one town recording more than two inches of rain in half an hour. Several crashes followed the atrocious driving weather, though no injuries were immediately reported. Muds and floods: Roads around St Louis, Missouri, were deluged with rainwater following fierce thunderstorms . ...

St
Louis was hit Tuesday by flash floods . A nearby town had more than two inches of rain in less than half an hour .

Extrinsic Predicate Error
St Louis pummeled Tuesday by flash floods . A nearby town had more than two inches of rain in less than half an hour .

Intrinsic
Predicate Error 4 Nigel Farage will say today that allowing refugees into Europe could lead to half a million Islamic extremists coming to our countries and posing a direct threat to our civilisation. The Ukip leader will say the boatloads of people trying to get to the continent from North Africa could provide a cover for jihadis wanting to do harm. Mr Farage will make his comments in a debate at the European Parliament in Strasbourg in the wake of more than 1,000 migrants drowning in the Mediterranean. Scroll down for video . Warning: The Ukip leader will say the boatloads of people trying to get to the continent from North Africa could provide a cover for jihadis . He will warn that if the EU agrees to give refuge to migrants from North Africa, it could lead to millions being given passports that allow them to move to Britain. A source close to Mr Farage said he was taking time out from the campaign trail in Britain to focus the election debate on immigration. ...

Nigel
Farage says refugees into Europe could lead to influx of extremists . Ukip leader says it could provide a cover for jihadis wanting to do harm . He will make comments in a debate at the European Parliament today .

Nigel
Farage says refugees into Europe could lead to influx of extremists . Ukip leader says it could provide a cover for jihadis wanting to do harm . He will accuse in a debate at the European Parliament today .

Intrinsic
Predicate Error

Limitations
While BUMP contains 889 human-written unfaithful summaries, we only ask one annotator to create one unfaithful summary for each documentsummary pair and a given error type. We acknowledge that having more unfaithful summaries created by different annotators can improve the summary diversity for better evaluating faithfulness metrics, but unfortunately that is beyond our currently available resources.

Ethics Statement
The collection of BUMP involves human annotations. The human annotators are provided with clear task instructions and informed of the condi-tions where they would be qualified and disqualified. For compensation, we pay $3.00 per assignment in the qualification task and $0.50 − $1.00 per assignment in the full task for both Tasks 1 and 2. 2 Washington (CNN)It's the mistake that Hillary Clinton won't make again: ignoring her gender. ... Clinton could be helped by an improving climate for women in politics. ... Her new campaign website is plastered with pictures of women, with Clinton, in a blue cloth coat, holding a cup of coffee listening intently to another woman as a man looks on. The emphasis on women -and the progress of women -as a possible underlying campaign theme is a reversal of her 2008 strategy, which stressed experience and competence over history. ... Hillary Clinton could be helped by an improving climate for women in politics . Republicans hope the gender play backfires and that voters are fatigued by identity politics . The emphasis on women as a possible campaign theme is a reversal of her 2008 strategy .
Hillary Clinton could be helped by an improving climate for women in politics . Republicans hope the gender play backfires and that voters are enraged by identity politics . The emphasis on women as a possible campaign theme is a reversal of her 2008 strategy . In the qualification task for Task 1, we ask workers to read a news article and seven pairs of summaries. For each pair of summaries, the first summary is the correct reference summary, and the second summary is the unfaithfully edited summary that contains one of the seven error types. We then ask the workers to select one answer from the seven error types to indicate which type of error is introduced in the edited unfaithful summary.

A.2 Full Task
The instructions and the task interface for Task 1 are shown in Figures A5 to A7.
In the full task for Task 1, different from the qualification task, we ask the workers to read a news article from CNN/DailyMail (Hermann et al., 2015) and one reference summary for the article. We then ask the workers to edit the reference summary to introduce the error type specified. If they cannot introduce the error type based on the reference summary, they can write "N/A" to indicate that it's not possible to introduce the specified error type based on the provided reference summary. There are 18 samples in Task 1 data that were annotated as "N/A" by the annotators, all of which are reviewed by the authors and re-annotated with the correct edits as a post-processing step to ensure the completeness of the dataset.
In addition, for Task 1, to help reduce the confusion from workers regarding Circumstance Error and Entity Error, we explicitly specify that the Circumstance Errors should only be erroneous edits concerning the time, duration, or location of an event, and changing the quantity of a noun is not considered a circumstance error. B Details of Task 2: Free-style Unfaithful Summaries B.1 Qualification Task Figure A8 and Figure A9 shows the interface for Task 2 qualification task. In this task, we show the workers four pairs of news article and summary, and ask them to answer if the summaries are accurate based on the associated articles. Among the four pairs, three of them are inaccurate and one is accurate. Only the workers who answered correctly to all of the four pairs passed the qualification task. We launched three batches in total with nine assignments for each batch, and eight workers passed the qualification task.

B.2 Full Task
Full task instructions are shown in Figures A10 and  A11. Unlike Task 1, we do not list any potential error types so as to achieve the free-style editing. Furthermore, we also do the following to ensure the quality of edited summaries: • For minimal edits, we explicitly ask annotators not to write from scratch, but to introduce only one error on top of the given reference summary.
• In the pilot study, we notice that some edited summaries are simply removing/adding sentences or phrases (such data points are removed in the final released data); we, therefore, added additional instructions that require the edited and the reference summaries to contain similar amount of information about the given article (i.e., similar coverage).
• The edited summaries should be grammatically correct.
• The edited summaries should be plausible and adhere to common sense.
• Some examples on edited summaries are given in the task instructions.
For the data post-processing of this task, we (authors of this paper) manually assign an error type to each unfaithful summary according to our error type taxonomy in Task 1. We acknowledge that there may be cases where the induced unfaithful error belongs to more than one error type or cannot be categorized by any error types. When this happens, we assign it to type "Other". Our final results show that the rate to have label "Other" is only 2.5%, indicating that our error type taxonomy is able to cover most cases.