Improving Factuality of Abstractive Summarization via Contrastive Reward Learning

Modern abstractive summarization models often generate summaries that contain hallucinated or contradictory information. In this paper, we propose a simple but effective contrastive learning framework that incorporates recent developments in reward learning and factuality metrics. Empirical studies demonstrate that the proposed framework enables summarization models to learn from feedback of factuality metrics using contrastive reward learning, leading to more factual summaries by human evaluations. This suggests that further advances in learning and evaluation algorithms can feed directly into providing more factual summaries. Code and human evaluation results will be publicly available at \url{https://github.com/EthanC111/factuality_summarization}.


Introduction
One major challenge in current abstractive summarization models is how to generate more factual summaries that are consistent with the source text (Li et al., 2022).Various approaches have been proposed to address this challenge, including augmenting the model input (Dou et al., 2021), performing post-processing (Dong et al., 2020;Cao et al., 2020), and modifying the learning algorithms (Cao and Wang, 2021;Liu et al., 2021).In particular, learning-based methods possess the advantage of not requiring modification to the existing model architecture or the addition of new modules.
In the meantime, with the growing interest in aligning learning objectives with evaluation criteria of interest, utilizing feedback of automatic evaluation metrics (Liu et al., 2022) or human preferences (Stiennon et al., 2020) as rewards for fine-tuning abstractive summarization models has gained substantial attention.These methods learn to optimize rewards using techniques such as reinforcementlearning (RL) (Stiennon et al., 2020), minimum Language Model (e.g.BART) Source Document
Given the benefits of learning-based methods in improving factuality of abstractive summarization, and recent advancements in factuality metrics for detecting factual inconsistencies in generated summaries, it is of interest to apply reward learning to enforce models to learn from feedback of factuality metrics to improve the factuality of abstractive summarization models.We aim to investigate the following questions in this paper -Q1: Can contrastive reward learning effectively utilize existing factuality metrics to improve the factuality of abstractive summarization models?Q2: Can the improvement in factuality be reflected in human evaluation studies?
In this paper, we propose a contrastive reward learning framework that enables abstractive summarization models to directly learn from feedback of factuality metrics in a sample-efficient manner.In contrast to other contrastive learning frameworks (Cao and Wang, 2021;Liu et al., 2021), our proposed framework does not rely on the complex construction of negative samples.Instead, similar to (Liu et al., 2022), all candidate summaries used for contrastive learning are generated from pretrained sequence-to-sequence models (Lewis et al., 2020;Zhang et al., 2020) using diverse beam search (Vijayakumar et al., 2018).Additionally, our framework also incorporates the use of quality metrics to provide more fine-grained information on the ranking (positive / negative) of candidate summaries.Specifically, we investigate learning from the rewards of two factuality metrics: BARTScore (Yuan et al., 2021) and DAE (Goyal and Durrett, 2021).Through automatic and human evaluation studies, we demonstrate that our framework enables summarization models to generate significantly more factual summaries.
2 Contrastive Learning from Factuality Rewards

Contrastive Learning for Abstractive Summarization
Abstractive Summarization Given a source document D, the summarization model learns a generative model g θ , that converts the source document D into a summary S: MLE Loss Given a training sample pair {D, S r } consists of source document D and reference summary S r (note that S r consists of L tokens, , the MLE loss L mle aims to maximize the likelihood of reference summary S r given the source document D: (2) where s r <j = {s r 0 , • • • , s r j−1 } and s r 0 is a predefined start token.
Despite its effectiveness in enforcing generated summaries to align with the reference summaries, the MLE loss is not aware of the quality (evaluated by some quality metric M ) of the generated summaries.To address this issue, we introduce a contrastive loss (Liu et al., 2022).
Contrastive Loss Given a training sample pair {D, S r }, and that S i , S j are candidate summaries generated from a pre-trained model given D, and that M (S i ) > M (S j ) ∀i, j, i < j1 , the contrastive loss is defined as: Note that λ ij = (j − i) × λ is the rank difference between two candidates times a constant λ (usually set as 1)2 and that f (S) is the length-normalized estimated log-probability given by: where α is a constant.Intuitively, the contrastive loss penalizes any discoordination between the length-normalized estimated log-probability and the quality metric evaluation (i.e., when f (S j ) > f (S i ) but M (S i ) > M (S j )).The quality metric M could be any evaluation criteria, including automatic evaluation metrics (Lin, 2004;Yuan et al., 2021;Goyal and Durrett, 2021), or human preferences (Ouyang et al., 2022).

Combined Loss
The combined loss used for fine-tuning is described by Equation 5.
where L mle is the MLE loss given in Equation 2, L ctr is the contrastive loss given in Equation 3, and γ is the weight of contrastive loss.Summarization models fine-tuned with L com is referred as CRL-COM.

Reward from Factuality Metrics
We use two factuality metrics as quality metrics M for use in the contrastive loss described in Equation 3.
BARTScore (Yuan et al., 2021)'s factuality score is calculated as the log-likelihood of the summary given the source calculated from a reference-free version of BARTScore.
DAE (Goyal and Durrett, 2021) is calculated as the softmax output of the least-factual dependencyarc inside the sentences in the summary.
These two metrics were chosen for relative computational efficiency, as they are evaluated many times in the training process. 33 Experiments

Experimental Setup
Driven by the two research questions presented in the introduction, we train two kinds of factualitydriven summarization models, namely CRL-COM (B) and CRL-COM (D), trained from contrastive reward learning using BARTScore and DAE as quality metrics, respectively.A baseline summarization model CRL-COM (R) is also trained from contrastive reward learning using ROUGE as quality metric.Note that commonly used n-gram based metrics, including ROUGE (Lin, 2004), have been shown to have a low correlation with human evaluations, particularly on factuality perspective (Falke et al., 2019;Durmus et al., 2020).Thus, we focus on evaluating the factuality of CRL-COM (B) and CRL-COM (D) compared to CRL-COM (R), with the hypothesis that CRL-COM (B) and CRL-COM (D) should be capable of generating more factual summaries compare to CRL-COM (R).
Datasets: We use two abstractive summarization datasets -CNN/Daily Mail (CNNDM) dataset (Hermann et al., 2015;Nallapati et al., 2016) and the XSUM dataset (Narayan et al., 2018).CNNDM summaries tend to be more extractive and are composed of multi-sentence summaries, while XSUM summaries are more abstractive and are composed of single-sentence summaries.
Models: Following the setting outlined in (Liu et al., 2022), we fine-tuned a pre-trained BART model (Lewis et al., 2020) on the CNNDM dataset and a pre-trained PEGASUS (Zhang et al., 2020) model on the XSUM dataset.
Implementation and Fine-tuning Details: The combined loss (with weight of the contrastive loss γ = 100) described in Equation 5 is used to finetune the pre-trained models.Following (Liu et al., 2022) few-shot fine-tuning learning paradigm, we sampled 1000 training samples from each dataset for few-shot fine-tuning.A constant learning rate of 10 −5 and 10 −4 was applied to the fine-tuning process for the CNNDM and XSUM datasets, respectively, in order to facilitate fast convergence.For each dataset, we fine-tuned three models using three different quality metrics: ROUGE (R), BARTScore (B), and DAE (D), designated as CRL-COM (R), CRL-COM (B), and CRL-COM (D), ally inefficient (Laban et al., 2022).As a result, they are less feasible for use in reward-learning settings.
respectively.During validation, we employed the same quality metric used for fine-tuning for early stopping.
Human Evaluation To objectively evaluate the factual consistencies of the generated summaries from each model, we randomly sampled 100 samples from CNNDM and 200 samples from XSUM for human evaluation.We assess each summary from three different perspectives: Factuality (FAC), Coherence (COH), and Relevance (REL), with a particular emphasis on factuality.The assessment follow similar guidelines as in (Liang et al., 2022;Fabbri et al., 2021).The evaluation guidelines provided to the annotators are listed in Table 1.An expert annotator is involved in the human evaluation studies.

Results and Analysis
Contrastive reward learning can enforce models to learn from feedback of factuality metrics Driven by Q1, we observe that results from automatic evaluation presented in Table 2 indicate that contrastive reward learning enables abstractive summarization models to develop in a direction that aligns with existing factuality metrics.
Learning from factuality metrics improves factuality of abstractive summarization.Driven by Q2, we observe that results from human evaluation presented in Table 2 indicate that on both datasets, CRL-COM (B) and CRL-COM (D) exhibit superior performance in terms of factuality compared to CRL-COM (R).This suggests that while learning from factuality metrics such as BARTScore and DAE may potentially result in sacrificing the performance of the models on ROUGE scores, the resulting models can generate more factually consistent summaries.In other words, summaries with higher BARTScore or DAE scores but lower ROUGE scores tend to be more factually consistent with the source article compared to those with lower BARTScore or DAE scores but higher ROUGE scores.This further supports the assertion that BARTScore and DAE are effective at capturing factual information.
Learning from factuality metrics did not sacrifice coherence and relevance.According to hu-

Factuality (FAC)
If all the information and claims inside the summary are included in the source article, assign a binary score of 1 ; otherwise, assign a binary score of 0.

Coherence (COH)
On a Likert scale of 1 (worst) to 5 (best), assign a score based on how well the relevant information is coordinated and organized into a well-structured summary.
Relevance (REL) On a Likert scale of 1 (worst) to 5 (best), assign a score based on the extent to which the summary includes only important information from the source article.man evaluations, the summaries generated by CRL-COM (B) and CRL-COM (D) showed comparable coherence and relevance to those generated by CRL-COM (R).This suggests that BARTScore and DAE has comparable abilities to ROUGE in terms of measuring coherence and relevance.
4 Related Work

Factuality Metrics for Abstractive Summarization
Various factuality metrics assess the factual consistency between a summary and its corresponding source document.QA-based factuality metrics leverage question generation (QG) models to generate questions from the summary and question answering (QA) models to answer those questions, given both the source and summary (Wang et al., 2020;Durmus et al., 2020;Scialom et al., 2021;Fabbri et al., 2022).Factuality is then evaluated based on the alignment between the answers from the source and summary.Another class of metrics, entailment-based factuality metrics (Kryscinski et al., 2020;Goyal and Durrett, 2021;Laban et al., 2022), evaluates whether all the information in the summary is entailed by the source document.
Recent studies on leveraging pre-trained language model as evaluation (Yuan et al., 2021) also achieve competitive performance on evaluating factuality.

Improving Factuality of Abstractive Summarization via Contrastive Learning
Several contrastive learning frameworks have been proposed to enable models to learn factuality from positive samples (such as reference summaries) and negative samples (such as edited reference summaries and system generated summaries).For example, CLIFF (Cao and Wang, 2021) and CO2Sum (Liu et al., 2021).Both of which are similar in nature but CO2Sum employs more sophisticated methods for negative sample construction.

Conclusion
In this work, we present a simple contrastive reward learning framework that enforces abstractive summarization models to learn from feedback of existing factuality metrics.Empirical studies demonstrate the effectiveness of this approach, showing that abstractive summarization models that learn from factuality metric feedback through contrastive reward learning can generate more factual summaries without sacrificing coherence or relevance.This suggests that further advancements in the reward learning paradigm and factuality metrics can facilitate the development of more factually consis-tent abstractive summarization models.

Limitations
While we have included two distinctive dataset (CNNDM and XSUM) in our experiments, more non-news datasets could be included in future studies.Other possibilities for future work include comparing the capability of RL-based reward learning and contrastive reward learning in improving the factuality of abstractive summarization models.

Ethics Statement
Even though some of the investigated systems may achieve a high level of factuality on the CNNDM dataset, this does not guarantee that they can be used as off-the-shelf factual consistent summarization models.Thorough evaluation should be conducted before using these models in high-stakes settings to ensure their reliability.

Figure 1 :
Figure 1: An illustration of our learning framework.

Table 1 :
Guidelines for human evaluation studies

Table 2 :
Results of each system on CNNDM and XSUM dataset.Note that R stands for ROUGE, B stands for BARTScore, and D stands for DAE.