CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in Abstractive Summarization

We study generating abstractive summaries that are faithful and factually consistent with the given articles. A novel contrastive learning formulation is presented, which leverages both reference summaries, as positive training data, and automatically generated erroneous summaries, as negative training data, to train summarization systems that are better at distinguishing between them. We further design four types of strategies for creating negative samples, to resemble errors made commonly by two state-of-the-art models, BART and PEGASUS, found in our new human annotations of summary errors. Experiments on XSum and CNN/Daily Mail show that our contrastive learning framework is robust across datasets and models. It consistently produces more factual summaries than strong comparisons with post error correction, entailment-based reranking, and unlikelihood training, according to QA-based factuality evaluation. Human judges echo the observation and find that our model summaries correct more errors.


Introduction
Large pre-trained Transformers have yielded remarkable performance on abstractive summarization (Liu and Lapata, 2019;Lewis et al., 2020;Zhang et al., 2020a) with impeccable fluency, yet their summaries often contain factually inconsistent content (Maynez et al., 2020;Zhang et al., 2020b;Goyal and Durrett, 2020), even for stateof-the-art models. Three types of remedies have been proposed: running a separately learned error correction component (Dong et al., 2020), removing noisy training samples (Nan et al., 2021;Goyal and Durrett, 2021), and modifying the Transformer architecture (Huang et al., 2020;Zhu et al., 2021). Yet they either rely on heuristically created data for error handling, falling short of generalization, or require learning a large number of new parameters, and summary informativeness is often sacrificed. XSum Article: The Fermanagh MLA Phil Flanagan tweeted after Tom Elliott appeared on a BBC radio programme in May 2014. . . . "I wonder if he will reveal how many people he harassed and shot as a member of the UDR.". . .
Contrastive learning (our method): A Sinn Féin MLA has been ordered to apologise and pay compensation to a former member of the Ulster Defence Regiment (UDR).
Cross-entropy: A Sinn Féin MLA has agreed to pay compensation to a former Ulster Unionist Party (UDR) MP after he tweeted that he had harassed and shot people as a member of the party.
Entailment reranking: A Sinn Féin MLA has agreed to pay compensation to a former Ulster Unionist Party (UDR) councillor for a tweet he sent about him.
Unlikelihood: An MLA has been ordered to apologise and pay compensation to a former loyalist MP for a remark he made about him while serving in the Ministry of Defence. Figure 1: Sample article and system summaries by different methods. Our contrastive learning model trained on low confidence system outputs correctly generates the full name. Comparisons using cross-entropy loss, beam reranking by entailment scores (Kryscinski et al., 2020), and unlikelihood objective (Welleck et al., 2020) over negative samples all produce unfaithful content.
Our goal is to train abstractive summarization systems that generate both faithful and informative summaries in an end-to-end fashion. We observe that, while the commonly used maximum likelihood training optimizes over references, there is no guarantee for the model to distinguish references from incorrect generations (Holtzman et al., 2020;Welleck et al., 2020). Therefore, potential solutions reside in designing new learning objectives that can effectively inform preferences of factual summaries over incorrect ones.
Concretely, we hypothesize that including factually inconsistent summaries (i.e., negative samples) for training, in addition to references (i.e., positive samples), let models become better at differentiating these two types of summaries. Although using negative samples has been effective at text representation learning, e.g., word2vec (Mikolov et al., 2013) and BERT (Devlin et al., 2019), there exist two major challenges for it to succeed in concrete language tasks. First, a suitable training objective is critical to avoid performance degradation (Saunshi et al., 2019). Second, it is nontrivial to construct "natural" samples that mimic the diverse errors made by state-of-the-art systems that vary in words and syntax (Goyal and Durrett, 2021).
To address both challenges, we first propose a new framework, CLIFF, that uses contrastive learning for improving faithfulness and factuality of the generated summaries. 1 Contrastive learning (CL) has obtained impressive results on many visual processing tasks, such as image classification (Khosla et al., 2020;Chen et al., 2020) and synthesis (Park et al., 2020;Zhang et al., 2021b). Intuitively, CL improves representation learning by compacting positive samples while contrasting them with negative samples. Here, we design a task-specific CL formulation that teaches a summarizer to expand the margin between factually consistent summaries and their incorrect peers.
Moreover, we design four types of strategies with different variants to construct negative samples by editing reference summaries via rewriting entity-/relation-anchored text, and using system generated summaries that may contain unfaithful errors. Importantly, these strategies are inspired by our new annotation study on errors made by state-of-the-art summarizers-models fine-tuned from BART (Lewis et al., 2020) and PEGA-SUS (Zhang et al., 2020a)-on two benchmarks: XSum (Narayan et al., 2018) and CNN/DailyMail (CNN/DM) (Hermann et al., 2015).
We fine-tune pre-trained large models with our contrastive learning objective on XSum and CNN/DM. Results based on QuestEval (Scialom et al., 2021), a QA-based factuality metric of high correlation with human judgments, show that our models trained with different types of negative samples uniformly outperform strong comparisons, including using a summarizer with post error correction and reranking beams based on entailment scores to the source. Moreover, compared with unlikelihood training method that penalizes the same negative samples (Welleck et al., 2020), our summaries also obtain consistently better QuestEval scores. Human evaluation further confirms that our models consistently reduce both extrinsic and 1 Our code and annotated data are available at https:// shuyangcao.github.io/projects/cliff_summ. intrinsic errors over baseline across datasets.

Related Work
Factuality Improvement and Evaluation. Neural abstractive summaries often contain unfaithful content with regard to the source (Falke et al., 2019). To improve summary factuality, three major types of approaches are proposed. First, a separate correction model is learned to fix errors made by the summarizers (Zhao et al., 2020;Chen et al., 2021), including replacing entities absent from the source (Dong et al., 2020) or revising all possible errors (Cao et al., 2020). The second type targets at modifying the sequence-to-sequence architecture to incorporate relation triplets (Cao et al., 2018), knowledge graphs (Zhu et al., 2021), and topics (Aralikatte et al., 2021) to inform the summarizers of article facts. Yet additional engineering efforts and model retraining are often needed. Finally, discarding noisy samples from model training has also been investigated (Nan et al., 2021;Goyal and Durrett, 2021), however, it often leads to degraded summary informativeness. In comparison, our contrastive learning framework allows the model to be end-to-end trained and does not require model modification, thus providing a general solution for learning summarization systems.
Alongside improving factuality, we have also witnessed growing interests in automated factuality evaluation, since popular word-matchingbased metrics, e.g., ROUGE, correlate poorly with human-rated factual consistency levels (Gabriel et al., 2021;Fabbri et al., 2021). Entailment-based scorers are designed at summary level (Kryscinski et al., 2020) and finer-grained dependency relation level (Goyal and Durrett, 2020). QA models are employed to measure content consistency by reading the articles to answer questions generated from the summaries (Wang et al., 2020;Durmus et al., 2020), or considering the summaries for addressing questions derived from the source (Scialom et al., 2019). Though not focusing on evaluation, our work highlights that models can produce a significant amount of world knowledge which should be evaluated differently instead of as extrinsic hallucination (Maynez et al., 2020). We also show that world knowledge can possibly be distinguished from errors via model behavior understanding.
Training with negative samples has been investigated in several classic NLP tasks, such as grammatical error detection (Foster and Andersen, 2009) and dialogue systems (Li et al., 2019). Notably, negative sampling plays a key role in word representation learning (Mikolov et al., 2013) and training large masked language models, such as BERT and ALBERT, to induce better contextual representations (Devlin et al., 2019;Lan et al., 2020). For text generation tasks, unlikelihood training is proposed to penalize the generation of negative tokens (e.g., repeated words) and sentences (e.g., contradictory responses in a dialogue system) (Welleck et al., 2020;Li et al., 2020;He and Glass, 2020). We use contrastive learning that drives enhanced representation learning to better distinguish between factual and incorrect summaries, which encourages more faithful summary generation.
Contrastive Learning (CL) for NLP. CL has been a popular method for representation learning, especially for vision understanding (Hjelm et al., 2019;Chen et al., 2020). Only recently has CL been used for training language models with selfsupervision (Fang et al., 2020), learning sentence representations (Gao et al., 2021), and improving document clustering (Zhang et al., 2021a). With a supervised setup, Gunel et al. (2021) adopt the contrastive objective to fine-tune pre-trained models on benchmark language understanding datasets. Using a similar idea, Liu and Liu (2021) enlarge the distances among summaries of different quality as measured by ROUGE scores.

CLIFF: Contrastive Learning Framework for Summarization
We design a contrastive learning (CL)-based training objective that drives the summarization model to learn a preference of faithful summaries over summaries with factual errors. It is then used for fine-tuning BART (Lewis et al., 2020) and PEGA-SUS (Zhang et al., 2020a) for training summarization models. Formally, let an article x have a set of reference summaries P (henceforth positive samples) and another set of erroneous summaries N (negative samples). The contrastive learning objective is (Khosla et al., 2020;Gunel et al., 2021): where h i , h j , and h k are representations for summaries y i , y j , and y k . sim(·, ·) calculates the cosine similarity between summary representations. τ is a temperature and is set to 1.0. Importantly, summaries in P and N are included in the same batch during training, so that the model acquires better representations to differentiate correct summaries from those with errors by comparing the two types of samples, thus maximizing the probabilities of the positive samples and minimizing the likelihoods of the corresponding negative samples. The CL objective on the full training set, denoted as L CL , is the sum of losses l x cl over all samples.
To effectively employ CL in summarization, we need to address two challenges: (1) how to automatically construct both positive and negative samples, which are critical for CL efficacy (Chen et al., 2020), and (2) how to represent the summaries (i.e., h * ). Below we describe positive sample generation and options for h * , leaving the strategies for negative samples to § 5.
Positive Sample Construction (P ). Summarization datasets often contain a single reference for each article. To create multiple positive samples, in our pilot study, we experiment with paraphrasing with synonym substitution (Ren et al., 2019), randomly replacing words based on the prediction of masked language models (Kobayashi, 2018), and back-translation (Mallinson et al., 2017). We find back-translation to be best at preserving meaning and offering language variation, and thus use NL-PAug 2 to translate each reference to German and back to English. Together with the reference, the best translation is kept and added to P , if no new named entity is introduced.
Summary Representation (h * ). We use the outputs of the decoder's last layer, and investigate three options that average over all tokens, named entity tokens, and the last token of the decoded summary. Entities and other parsing results are obtained by spaCy (Honnibal et al., 2020). We further consider adding a multi-layer perceptron (MLP) with one hidden layer to calculate the final h * .
The final training objective combines the typical cross-entropy loss L CE and our contrastive learning objective: L = L CE + λL CL , where λ is a scalar and set to 1.0 for all experiments.

Summary Error Annotation and Model Behavior Analysis
We first describe annotating unfaithfulness errors by state-of-the-arts, i.e., models fine-tuned from BART and PEGASUS on XSum and CNN/DM. We then probe into the model generation behavior that is indicative of errors, which guides the design of negative sample construction strategies. 600 (150 × 2 × 2) summaries are annotated to demonstrate how often do the models "hallucinate", i.e., generating content not grounded by the source. To characterize errors, we annotate text spans in summaries with (i) intrinsic errors caused by misconstructing phrases or clauses from the source; and (ii) extrinsic errors which include words not in the source that are either unverifiable or cannot be verified by Wikipedia. Content not covered by the article but can be validated by Wikipedia is annotated as world knowledge, and the models' behavior pattern when generating them differs from when they generate errors.
Two fluent English speakers with extensive experience in summary evaluation and error labeling are hired. For each sample, they are shown the article and two system summaries, and instructed to annotate text spans with the aforementioned errors and world knowledge. After labeling every 50 samples, the annotators discuss and resolve any disagreement. The Fleiss's Kappas on XSum and CNN/DM are 0.35 and 0.45.
Error statistics are displayed in Fig. 2. Extrinsic errors dominate both datasets, especially on XSum. 58.7% of summaries by BART (and 44.0% by PE-GASUS) contain at least one extrinsic error. Noticeably, PEGASUS is a newer model pre-trained with a larger amount of data, thus contains less errors than BART and other older models studied for error annotations by Maynez et al. (2020). This observation also highlights the usage of our anno- tations for future development and evaluation of summarization models.
Low confidence generation is indicative of extrinsic errors. Inspired by recent work that studies model prediction confidence (Liu et al., 2021), we examine generation probabilities for tokens of different part-of-speech (POS) tags. Fig. 3 shows salient results on the generation probabilities of the first token of a proper noun or a number (with additional analysis provided in Appendix A). As observed, model confidence tends to be lower for the first tokens of proper nouns and numbers if they are part of spans with extrinsic errors. Also note that world knowledge, which cannot be inferred from the source either, often has higher generation probability than extrinsic errors. Take this snippet generated by a fine-tuned BART as an example: "Manchester United captain Wayne Rooney's testimonial game against Manchester City. . .". "Manchester City" is an extrinsic error and "Wayne" is produced as world knowledge. The model assigns a low probability of 0.10 to the first token of "Manchester City" and a high probability of 0.92 to token "Wayne". This implies that model confidence can be a useful indicator for negative sample collection.

Negative Sample Construction
Here we describe four strategies for constructing negative samples that modify the references ( § 5.1-5.3) or use system generated summaries (5.4).

Entity Swap
Entity swap imitates intrinsic errors, as over 55% of intrinsic errors in our annotations are found to contain named entities. We construct negative samples by swapping named entities in the references with other randomly selected entities of the same entity type in the source (SWAPENT). One sam-REFERENCE: A "rare" short-eared owl found emaciated in Flintshire is now recuperating well, the RSPCA have said.
SWAPENT: Flintshire → Bettisfield ⇒ A "rare" short-eared owl found emaciated in Bettisfield is now recuperating well, the RSPCA have said.
MASKENT: A "rare" short-eared owl found emaciated in [MASK] is now recuperating well, the RSPCA have said. ⇒ A "rare" short-eared owl found emaciated in a field in South Yorkshire is now recuperating well, the RSPCA have said.

MASKREL:
A "rare" short-eared owl found [MASK] in [MASK] is now recuperating well, the RSPCA have said. ⇒ A "rare" short-eared owl found dead in London is now recuperating well, the RSPCA have said.
REGENENT: A "rare" short-eared owl found emaciated in ⇒ A "rare" short-eared owl found emaciated in Nottinghamshire is now at a wildlife centre to recover. REGENREL: A "rare" short-eared owl found ⇒ A "rare" short-eared owl found in the grounds of a former coal mine is being cared for by the RSPCA in Somerset.
SYSLOWCON: An injured golden owl found in a former coal mine in Lancashire is being cared for by the RSPCA.  Table 1. SWAPENT has the advantage of not depending on any trained model. Yet it only introduces intrinsic errors and lacks the coverage for extrinsic errors, which is addressed by the following generationbased methods.

Mask-and-fill with BART
To simulate extrinsic errors, we leverage large unconditional language models' capability of converting a sequence with masked tokens into a fluent and appropriate sequence. Specifically, we replace each named entity in a reference with a [MASK] token and encode it with BART (without any fine-tuning). BART then fills this partially masked summary with newly generated entities (MASKENT). BART is chosen since it can fill [MASK] with varying number of tokens. For each entity in the reference, we sample three summaries and only retain the ones containing at least one entity that is absent from both the source and the reference.
Up to now, the two introduced strategies both focus on incorrect named entities. To cover more diverse extrinsic and intrinsic errors (Goyal and Durrett, 2020), we extend MASKENT to contain relations (MASKREL). We first obtain dependency relations using Stanza (Qi et al., 2020), with each relation denoted as <gov, rel, dep>. To incorporate more context, we consider noun phrase spans enclosing the token of gov or dep if it is a content word and the noun phrase contains a named entity. Similar to MASKENT, three negative samples are generated by BART based on the input with both gov and dep spans masked in the reference. Only the samples that introduce any new dependency relation that is not contained in the source nor the reference are kept. Specifically, we consider a match of a dependency relation as the same form or synonyms of its gov and and dep is found in the source or the reference with the same relation.
Both MASKENT and MASKREL can create more extrinsic errors compared to other strategies introduced in this section, since negative samples are generated without being grounded on the source articles. However, their constructed negative samples may contained drifted topics that can be easily detected by a summarization model, resulting with less efficient training signals.

Source-conditioned Regeneration
To ground negative sample generation with the article, we further design a regeneration strategy based on conditional generation. For each named entity in the reference, we treat the text before it as a prompt. A summarizer, e.g., fine-tuned from BART or PEGASUS, first reads in the source using the encoder, then receives the prompt as the first part of the decoder output, and finally decodes the rest of the content based on nucleus sampling (Holtzman et al., 2020) with a cumulative probability threshold of 0.7. The prompt and the regenerated text comprise the final negative sample. This method is denoted as REGENENT.
We also extend entities to relations with expanded governor and dependent spans (REGENREL). Here, we consider a prompt as the text before the gov or dep span, whichever occurs first. For both REGENENT and REGENREL, three negative samples are generated for each prompt, and a sample is kept if it introduces any new entity (for REGENENT) or dependency relation (for REGENREL) with regard to the source and the reference.
Negative samples generated by both methods are more relevant to the article than the mask-and-fill strategy, yet they may still miss certain types of errors and differ from real model outputs, since they are modified from the reference summaries.

System Generation
Motivated by the model confidence analysis in § 4, we explore using system generated summaries as negative samples. We first run fine-tuned BART or PEGASUS on the same training set to decode summaries. For each summary, we check the model confidence on the first token of each proper noun and number span. If the probability is below a threshold, we keep it as a negative sample (SYSLOWCON). The threshold is tuned by maximizing F1 based on our error annotations.
We consider all beams at the last decoding step as candidates. We use beam sizes of 6 and 4 for XSum and CNN/DM. Statistics of negative samples constructed by different strategies are in Appendix B.

Experiment Setup
Evaluation Metrics. QuestEval (Scialom et al., 2021) is used as the main metric to evaluate summaries' factual consistency. Given an article and a summary, QuestEval first generates natural language questions for entities and nouns from both. A QA model then consumes the article to answer questions derived from the summary, producing a score. Another score is obtained from a QA model addressing article-based questions after reading the summary. The final QuestEval score is the harmonic mean of the two. We use the version with learned weights for questions, which has shown high correlation with human judged consistency and relevance.
We further use FactCC (Kryscinski et al., 2020), trained based on their negative sample construction method, to measure if the summary can be entailed by the source. We also report ROUGE-L (Lin, 2004). Both FactCC and ROUGE-L reasonably correlate with summary factuality as judged by human (Pagnoni et al., 2021).
Based on our error annotations, we report the correlations between each metric and the error ratepercentage of tokens being part of an error span, and the raw number of errors (Table 2). QuestEval correlates better on both aspects than other metrics.
Comparisons. In addition to the models finetuned with cross-entropy loss (CRSENTROPY), we consider reranking beams based on FactCC

Metric
XSum CNN/DM % of Err # of Err % of Err # of Err  (Li et al., 2020). Given a negative sample y , the loss is defined as − |y | t=1 log(1 − p(y t |y 1:t−1 , x)), where p(y t |y 1:t−1 , x) is the output probability at the t-th step. We combine the unlikelihood training objective with cross-entropy loss with equal weights for fine-tuning.
Lastly, we compare our negative sample strategies with negative samples constructed for training the FactCC scorer, denoted as FCSAMPLE. For CL only, we compare with using other samples in the same batch as negative samples (BATCH), a common practice for CL-based representation learning (Gao et al., 2021;Zhang et al., 2021a).

Automatic Evaluation
We report results by models fine-tuned from BART and PEGASUS with different objectives and negative samples on XSum and CNN/DM in Tables 3  and 4. CLIFF models use a summary representation of averaging over all tokens with MLP projection, with other variants discussed in § 7.3. Unless explicitly stated, comparison models are fine-tuned from the same large model used by CLIFF.
First, comparing with other factuality improvement models (top of the tables), almost all CLIFF models trained with different negative samples uni-  Table 3: Results of models fine-tuned from BART on XSum and CNN/DM. QEval: QuestEval; FC: FactCC; R-L: ROUGE-L. The best result per metric per dataset is bolded. For models of unlikelihood training and CLIFF that use the same negative samples, the better of the two is highlighted with green. * : our model is significantly better than CRSENTROPY (approximation randomization test, p < 0.005).
formly produce higher QuestEval scores across datasets with both large models, with the improvements more pronounced on XSum. Importantly, ROUGE scores for CLIFF models are comparable or better than baselines trained with crossentropy, e.g., on CNN/DM as in Table 3. A similar trend is observed with the FactCC metric, especially when using PEGASUS as the base model ( Table 4). Note that ENTAILRANK tends to yield significantly higher FactCC scores, though it obtains lower QuestEval scores than the cross-entropy baseline. Human inspection finds that ENTAIL-RANK can pick up beams with peculiar words of high FactCC scores, without improving factuality. Moreover, other comparisons based on post COR-RECTION and model engineering (FASUM) only offer incremental gains. The sample selection-based method, SUBSETFT, sacrifices ROUGE scores significantly. Overall, CLIFF demonstrates stronger generalizability.
Second, CLIFF is more effective and robust than  Table 4: Results of models fine-tuned from PEGASUS on XSum and CNN/DM. We report results on 5, 000 randomly selected samples on CNN/DM, due to long running time of QuestEval. For models of unlikelihood training and CLIFF that use the same negative samples, the better of the two is highlighted with green. * : our model is significantly better than CRSENTROPY (approximation randomization test, p < 0.005).
unlikelihood training with the same negative samples. According to Table 3, using 7 negative sample construction strategies on two datasets, CLIFF obtains higher QuestEval scores than unlikelihood training in 12 out of the 14 comparisons. Using PEGASUS, CLIFF also outperforms in 11 setups as listed in Table 4. Similar trends are found on FactCC and ROUGE-L. Another noteworthy piece is that CLIFF's improvements over the crossentropy baseline are more consistent, whereas unlikelihood training occasionally hurts factuality or ROUGE scores significantly. We believe the key advantage of CLIFF resides in its measure of representation similarities between positive and negative samples in the same batch, allowing models to better differentiate between correct and erroneous summaries.
Finally, among all variants, CLIFF trained with low confidence summaries as negative samples obtains the best QuestEval scores on the more abstractive dataset. As seen in Table 3, using low confidence summaries also improves FactCC scores on both datasets, and enhances ROUGE-L on the more  extractive dataset CNN/DM. This indicates that system generated summaries contribute more diverse errors made by existing models organically, which are particularly suitable for our CL framework. As we use summaries generated by the same model for CLIFF training, one future direction is to use outputs by different models. For our mask-and-fill and source-conditioned regeneration strategies, we find that relation-anchored construction often beats their entity-anchored counterparts. This calls for efforts that steer the entity-driven methods to a more relation-focused direction.
Combining Strategies. We further show results by fine-tuning BARTs using samples based on combined negative sample construction strategies in Table 5. As can be seen, combining SYSLOWCON and other strategies yields better QuestEval scores than models trained with negative samples by any single strategy, except for MASKENT and REGE-NENT on XSum. This signifies the importance of covering diverse types of errors in negative samples.

Human Evaluation
Pairwise Comparison with Cross-entropy. We recruit the two human annotators for our summary error study, as well as another experienced annotator, to evaluate summary informativeness and factual consistency. For each article, the judges are shown summaries generated by the CRSENTROPY model and four other systems. They then rate each system summary against the CRSENTROPY summary. All four summaries generated by different factuality-improved models are shown in random order without system names shown, ensuring the fair comparison among them. We randomly pick 100 articles from each dataset used in our error analysis study in § 4, and evaluate summaries generated by ENTAILRANK, unlikelihood training (ULL) with negative samples con-   Table 6: Percentages of summaries that are better than, tied with, or worse than CRSENTROPY, in informativeness (Inform.) and factual consistency (Factual.) The Krippendorff's αs are 0.33 and 0.62 for the two aspects on XSum, and 0.34 and 0.89 on CNN/DM. Our CL method using low confidence summaries is more frequently rated as better for informativeness and factuality on the more abstractive dataset XSum.

Intrinsic
Extrinsic 0   structed by MASKENT, and CLIFF models trained with BATCH and SYSLOWCON negative samples. All are fine-tuned from BART. Detailed evaluation guidelines are in Appendix D. Table 6 shows that on the more abstractive XSum data CL trained with low confidence samples are more frequently rated as being more informative and more factual than CRSENTROPY summaries. This echos our automatic evaluations with QuestEval in § 7.1. On CNN/DM, all models trained with negative samples produce summaries with better informativeness and faithfulness. In contrast, ENTAILRANK summaries are less distinguishable from outputs by CRSENTROPY on both datasets, as more ties are found. We show sample outputs in extrinsic errors as done in § 4. Fig. 4 shows that CL is more effective at reducing extrinsic errors than unlikelihood training can on both datasets. We also observe slight decreases of world knowledge in the summaries (figure attached in Appendix D).
Error Correction Operations. Finally, with reference to CRSENTROPY summaries, human judges are instructed to label each system summary as whether it corrects any error by CRSENTROPY using deletion of the incorrect content, substitution with factual information, or both. As seen in Fig. 5, CL-based models restore factually consistent information, e.g., by replacing erroneous names and numbers with correct ones, more frequently than unlikelihood training or entailment reranking.

Variants of Summary Representation
Sample representation is critical for CL to be effective. Here we investigate summary representation variants as discussed in § 3. There are two major considerations: (1) Should we consider all tokens in a summary or only representative ones (e.g., entities or last token)? (2) Should additional transformation, i.e., an MLP, be used?
Experiments on XSum using three negative sample construction strategies demonstrate that averaging the decoder outputs of all tokens and adding an MLP projection yield the best overall performance, as shown in Table 7. The implications are at least two-fold. First, even for entity-or relationtriggered sample modifications, using more global context helps with CL training. Second, additional transformation can help avoid model degeneration. For instance, more nonsensical and repetitive content is produced by variants without MLP.

Conclusion
We present CLIFF, a contrastive learning-based framework to promote faithfulness and factuality of abstractive summaries. CLIFF uses both references and summaries that are factually inconsistent with the articles to train systems to be better at discriminating errors from factual and salient content. We further study strategies that automatically create erroneous summaries by editing from references or leveraging systems outputs, inspired by our new summary error analysis on state-of-theart models. Both automatic evaluation and human ratings show that CLIFF achieves consistent improvements over competitive comparison methods, and is generalizable across datasets with systems fine-tuned from different large models.

A Additional Analysis for Summary Error Annotation
We hire two fluent English speakers to annotate summary errors on XSum and CNN/DailyMail (CNN/DM). They annotate a common batch of 100 summaries generated by summarizers fine-tuned from BART and PEGASUS, with 50 articles in each batch. The two annotators are shown 50 HTML pages in a batch, each of which contains an article and two summaries generated by the two models. The detailed annotation guideline is given in Fig. 9. For our analysis on token generation probabilities, we additionally show the distributions of the first token's probability for nouns and verbs in Fig. 6. We also report the distributions of the nonfirst token's probability for proper nouns, numbers, nouns, and verbs in Fig. 7. As can be seen, tokens within extrinsic and intrinsic errors have high generation probabilities when they are non-first tokens.

B Statistics for Datasets and Training Samples
Summarization Datasets. We follow the official data splits for the two datasets, with the number of samples in each split listed in Table 8.   Table 9. SYSLOWCON constructs the least negative samples in total, while it achieves the best results as reported in our main paper ( § 7.1), indicating that its negative samples are more effective for training.

D Human Evaluation
In § 7.2, we demonstrate the percentages of samples containing intrinsic errors and extrinsic errors for each model evaluated by human judges. Here, we report the percentages of samples containing world knowledge in Fig. 8. On XSum, all models produce less world knowledge compared to the model trained with cross-entropy loss, while gen-

E Sample Outputs
We include more sample outputs in Fig. 11.
In this study, you will first read article-summary pairs and then identify three types of text spans in the summaries. These spans include content that is contradicted by or cannot be implied from the article. The description for each type is described below: • Intrinsic errors: Text spans that misconstruct phrases or clauses from the article.
• Extrinsic errors: Text spans that include words that are not in the article and are not verifiable or cannot be verified by Wikipedia.
• world knowledge: Text spans that contain information that is not covered by the article but can be validated by Wikipedia.
When selecting spans, you should always make sure the spans are complete words.
In practice, you should follow the these steps carefully: (1) read the article and summaries carefully; (2) figure out if there is content contradicted by or not presented in the article; (3) label the span as an intrinsic error if it misconstructs phrases or clauses from the article; (4) if the span does not belong to intrinsic errors, search within Wikipedia and determine whether the content in the span can be verified; (5) label it as world knowledge if the it can be verified by Wikipedia, otherwise label it as an extrinsic error.
Example annotations 1 Article: Isis Academy in Oxford said it had rebranded as "Iffley Academy" to protect its "reputation, integrity and image". The name 'Isis' was originally chosen as the school is near to the section of the River Thames of the same name. Formerly Iffley Mead School, it became Isis Academy in 2013. A statement issued by the school said it had changed name following "the unforeseen rise of ISIS (also known as ISIL and the Islamic State) and related global media coverage of the activities of the group". "Our priority is to remove the detrimental impact which the name 'Isis' had on pupils, their families and our staff." Last year a language school in the city removed Isis from its name for the same reason. The Isis is the name given to the part of the River Thames above Iffley Lock in Oxford. It is also the name of the goddess wife of the god Osiris in Egyptian beliefs. Summary: A school that was named after the Islamic State (IS) militant group has changed its name. Explanation: "was name after" is an intrinsic error contradicted by the article sentence in bold.
Example annotations 2 Article: Khalil Dale, 60, was abducted in Quetta in January 2012 and was found dead on a roadside a few months later. He had been beheaded. A note next to his body said he was killed because a ransom had not been paid. Mr Dale was born in York but lived in Dumfries. He spent 30 years working in countries including Somalia, Afghanistan and Iraq. An inquest into his death was held at Chesterfield Coroners Court because he is buried in Derbyshire. The court heard that the Muslim convert, who was formerly known as Kenneth, worked as a humanitarian assistance relief worker. Following his abduction, negotiations were undertaken by the International Committee of the Red Cross with the help of the UK government. His body was found on 29 April 2012. The inquest was told that he died as a result of decapitation. Senior coroner Dr Robert Hunter concluded that Mr Dale was unlawfully killed while providing international humanitarian assistance. Summary: A British aid worker was unlawfully killed by Islamist militants in Pakistan, an inquest has heard. Explanation: "Islamist militant" is an extrinsic error as it can not be found in or inferred from the article. The information is also not verifiable by Wikipedia. "Pakistan" is world knowledge as Quetta in the article is a city in Pakistan according to Wikipedia. Figure 9: Guideline for our summary error annotation ( § 4).
In this study, you will evaluate 100 sets of summaries produced by four systems. For each set, its corresponding article and a baseline summary are shown before the four system summaries. The errors in the baseline summary are highlighted. Please first read the article and the baseline summary and then compare each system summary against the baseline summary based on informativeness and factual consistency. In addition, please decide the operations made by the system to achieve better factual consistency. For informativeness and factual consistency, you need to label whether the system summary is better or worse than the baseline summary. You can also label the system summary as tying with the baseline summary. You need to consider two types of operations: deletions and substitutions. Please label the system summary as making deletions, substitutions, or both operations. Examples for the aspects and the operations are as follows.
Article: Alexys Brown, also known as Lexi, died at her home in Emmadale Close, Weymouth, on Thursday. An investigation is under way to discover how she became trapped. A post-mortem examination is due to be carried out this week. It was originally hoped the appeal would raise £2,000. Alison Record, who started the Just Giving appeal, said she was "heart broken" over the death. "Everybody by now has heard of the terrible tragedy the Brown family have suffered with the loss of their beautiful and beloved little girl Lexi," the appeal page reads. Many other comments have been posted on the appeal page. Steph Harris said: "Thinking of you all at this devastating time, fly high beautiful princess. Love Steph and family xxx" Lesley Andrews added: "No amount of money will take away the pain, but so much love comes with every penny. Take care. xx" Aster Group, the housing association responsible for managing the home, is assisting with the police investigation. The Health and Safety Executive (HSE) is also investigating. Dorset County Council said it had not installed the disabled lift at the property. Baseline Summary: An appeal to raise 10,000 pounds for the family of a three-year-old girl who died after becoming trapped in a lift has raised more than 20,000 pounds.
Informativeness: Whether the summary captures salient content from the input article. Note that incorrect content should be considered as invalid.
Win. An appeal to raise money for the family of a three-year-old girl who died after getting stuck in a lift was originally hoped for raising £2,000. The target money of the appeal is a salient information.
Tie. An appeal to raise money for the family of a girl who died after getting stuck in a lift has raised more than £20,000. Compared to the baseline, missing incorrect information does not affect the informativeness.
Lose. An appeal to raise money for the family of a three-year-old girl has raised more than £20,000. This system summary does not mention the death of the girl, which is a salient content of the article.
Factual Consistency: Whether the summary is factually correctly based on the article and knowledge from Wikipedia.
Win. An appeal has been set up for the family of an eight-year-old girl who died after becoming trapped in a lift at her Dorset home. This system summary does not generate the incorrect numbers of money.
Tie. An appeal to raise 5,000 pounds for the family of a seven-year-old girl who died after becoming trapped in a lift has raised more than 20,000 pounds. This system summary makes similar errors to the baseline.
Lose. The family of an eight-year-old girl who died after becoming trapped in a lift at her Dorset home have set a fundraising target of 10,000 pounds. This system summary fabricates an event The family have set a fundraising target, which is more severe than errors of modifiers.
Deletion: The incorrect content in the baseline summary is deleted.
-An appeal for the family of a three-year-old girl who died after becoming trapped in a lift has raised more than 20,000 pounds. The error "10,000 pounds" is deleted. Substitution: The incorrect content in the baseline summary is replaced with correct one.
-An appeal to raise 2,000 pounds for the family of a three-year-old girl who died after becoming trapped in a lift has raised more than 20,000 pounds. The error "10,000 pounds" is substituted with "2,000 pounds", which is the correct information.