SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization

In this paper, we present a conceptually simple while empirically powerful framework for abstractive summarization, SimCLS, which can bridge the gap between the learning objective and evaluation metrics resulting from the currently dominated sequence-to-sequence learning framework by formulating text generation as a reference-free evaluation problem (i.e., quality estimation) assisted by contrastive learning. Experimental results show that, with minor modification over existing top-scoring systems, SimCLS can improve the performance of existing top-performing models by a large margin. Particularly, 2.51 absolute improvement against BART and 2.50 over PEGASUS w.r.t ROUGE-1 on the CNN/DailyMail dataset, driving the state-of-the-art performance to a new level. We have open-sourced our codes and results: https://github.com/yixinL7/SimCLS. Results of our proposed models have been deployed into ExplainaBoard platform, which allows researchers to understand our systems in a more fine-grained way.


Introduction
Sequence-to-sequence (Seq2Seq) neural models (Sutskever et al., 2014) have been widely used for language generation tasks, such as abstractive summarization (Nallapati et al., 2016) and neural machine translation . While abstractive models (Lewis et al., 2020;Zhang et al., 2020a) have shown promising potentials in the summarization task, they share the widely acknowledged challenges of Seq2Seq model training. Specifically, Seq2Seq models are usually trained under the framework of Maximum Likelihood Estimation (MLE) and in practice they are commonly trained with the teacher-forcing (Williams and * Corresponding author.  Zipser, 1989) algorithm. This introduces a gap between the objective function and the evaluation metrics, as the objective function is based on local, token-level predictions while the evaluation metrics (e.g. ROUGE (Lin, 2004)) would compare the holistic similarity between the gold references and system outputs. Furthermore, during the test stage the model needs to generate outputs autoregressivelly, which means the errors made in the previous steps will accumulate. This gap between the training and test has been referred to as the exposure bias in the previous work (Bengio et al., 2015;Ranzato et al., 2016).
A main line of approaches (Paulus et al., 2018;Li et al., 2019) proposes to use the paradigm of Reinforcement Learning (RL) to mitigate the aforementioned gaps. While RL training makes it possible to train the model with rewards based on global predictions and closely related to the evaluation metrics, it introduces the common challenges of deep RL. Specifically, RL-based training suffers from the noise gradient estimation (Greensmith et al., 2004) problem, which often makes the training un-stable and sensitive to hyper-parameters. Minimum risk training, as an alternative, has also been used in the language generation tasks (Shen et al., 2016;Wieting et al., 2019). However, the accuracy of the estimated loss is restricted by the number of sampled outputs. Other methods (Wiseman and Rush, 2016;Edunov et al., 2018) aim to extend the framework of MLE to incorporate sentence-level scores into the objective functions. While these methods can mitigate the limitations of MLE training, the relation between the evaluation metrics and the objective functions used in their methods can be indirect and implicit.
Among this background, in this work we generalize the paradigm of contrastive learning (Chopra et al., 2005) to introduce an approach for abstractive summarization which achieves the goal of directly optimizing the model with the corresponding evaluation metrics, thereby mitigating the gaps between training and test stages in MLE training. While some related work (Lee et al., 2021;Pan et al., 2021) have proposed to introduce a contrastive loss as an augmentation of MLE training for conditional text generation tasks, we instead choose to disentangle the functions of contrastive loss and MLE loss by introducing them at different stages in our proposed framework.
Specifically, inspired by the recent work of Zhong et al. (2020); Liu et al. (2021b) on text summarization, we propose to use a two-stage model for abstractive summarization, where a Seq2Seq model is first trained to generate candidate summaries with MLE loss, and then a parameterized evaluation model is trained to rank the generated candidates with contrastive learning. By optimizing the generation model and evaluation model at separate stages, we are able to train these two modules with supervised learning, bypassing the challenging and intricate optimization process of the RL-based methods.
Our main contribution in this work is to approach metric-oriented training for abstractive summarization by proposing a generate-then-evaluate twostage framework with contrastive learning, which not only put the state-of-the-art performance on CNN/DailyMail to a new level (2.2 ROUGE-1 improvement against the baseline model), also demonstrates the great potentials of this two-stage framework, calling for future efforts on optimizing Seq2Seq models using methods beyond maximum likelihood estimation.

Contrastive Learning Framework for Abstractive Summarization
Given a source document D and a reference sum-maryŜ, the goal of an abstractive summarization model f is to generate the candidate summary S = f (D) such that it receives the highest score m = M (S,Ŝ) assigned by an evaluation metric M . In this work, we break down the holistic generation process into two stages which consist of a generation model g for generating candidate summaries and a evaluation model h for scoring and selecting the best candidate. Fig 1 illustrates the general framework.
Stage I: Candidate Generation The generation model g(·) is a Seq2Seq model trained to maximize the likelihood of reference summaryŜ given the source document D. The pre-trained g(·) is then used to produce multiple candidate summaries S 1 , · · · , S n with a sampling strategy such as Beam Search, where n is the number of sampled candidates.
Stage II: Reference-free Evaluation The highlevel idea is that a better candidate summary S i should obtain a higher quality score w.r.t the source document D. We approach the above idea by contrastive learning and define an evaluation function h(·) that aims to assign different scores r 1 , · · · , r n to the generated candidates solely based on the similarity between the source document and the candidate S i , i.e., r i = h(S i , D). The final output summary S is the candidate with the highest score: Here, we instantiate h(·) as a large pre-trained selfattention model, RoBERTa (Liu et al., 2019). It is used to encode S i and D separately, and the cosine similarity between the encoding of the first tokens is used as the similarity score r i .
Contrastive Training Instead of explicitly constructing a positive or negative example as most existing work with contrastive learning have adopted (Chen et al., 2020;, here the "contrastiveness" is reflect in the diverse qualities of naturally generated summaries evaluated by a parameterized model h(·). Specifically, we introduce a ranking loss to h(·): whereS 1 , · · · ,S n is descendingly sorted by M (S i ,Ŝ). Here, λ ij = (j −i) * λ is the corresponding margin that we defined following Zhong et al. (2020), and λ is a hyper-parameter. 1 M can be any automated evaluation metrics or human judgments and here we use ROUGE (Lin, 2004

Evaluation Metrics
We use ROUGE-1/2/L (R-1/2/L) as the main evaluation metrics for our experiments. We also evaluate our model on the recently developed semantic similarity metrics, namely, BERTScore (Zhang et al., 2020b) and MoverScore (Zhao et al., 2019).

Base Systems
As the generation model and the evaluation model in our two-stage framework are trained separately, we use pre-trained state-of-the-art abstractive summarization systems as our generation model. Specifically, we use BART (Lewis et al., 2020) and Pegasus (Zhang et al., 2020a) as they are popular and have been comprehensively evaluated.

Training Details
For baseline systems, we use the checkpoints provided by the

Results on CNNDM dataset
The results on CNNDM dataset are shown in Tab. 1. We use the pretrained BART 5 as the base generation model (Origin). We use BART, Pegasus, GSum (Dou et al., 2021) and ProphetNet (Qi et al., 2020) for comparison. Notably, the Max oracle which always selects the best candidate has much better performance than the original outputs, suggesting that using a diverse sampling strategy can further exploit the potential power of the pretrained abstractive system. Apart from ROUGE, we also present the evaluation results on semantic similarity metrics. Our method is able to outperform the baseline model on all metrics, demonstrating its improvement is beyond exploiting the potential artifacts of ROUGE. While the scale of improvement is harder to interpret with these metrics, we note that the improvement is able to pass the significance test.

System Summary Article
Ref.
chris ramsey says he has no problem shaking hands with john terry . queens park rangers host chelsea in the premier league on sunday . terry was once banned and fined for racist comments at loftus road . rio ferdinand , brother of anton , will not be fit to play against chelsea .
queens park rangers manager chris ramsey has revealed he will have no problem shaking john terry's hand in light of the racist comments the former england captain directed at former rs defender anton ferdinand four years ago . terry , who will line up against ramsey's side , was banned for four games and fined # 220,000 for the remarks made in october 2011 during chelsea's 1-0 defeat at loftus road . but ramsey , the premier league's only black manager , thinks the issue has been dealt with . ... ' i don't know what his feelings are towards me . as long as there wasn't anything on the field that was unprofessional by him , i would shake his hand . . queens park rangers manager chris ramsey speaks to the media on friday ahead of the chelsea match . chelsea captain john terry controls the ball during last weekend's premier league match against stoke . ramsey arrives for friday's pre-match press conference as qpr prepare to host chelsea at loftus road . ' the whole episode for british society sat uncomfortably . it's not something we want to highlight in football . it happened and it's being dealt with . we have to move on . and hopefully everyone has learned something from it . ' . ramsey revealed that rio ferdinand , who labelled terry an idiot for the abuse aimed at his brother , won't be fit in time for a reunion with the chelsea skipper this weekend . but the 52-year-old suspects his player's one-time england colleague will be on the receiving end of a hostile welcome from the home fans on his return the scene of the unsavoury incident . ... ferdinand and terry argue during qpr's 1-0 victory against chelsea at loftus road in october 2011 . rio ferdinand , brother of anton , will not be fit for sunday's match against chelsea .
SimCLS queens park rangers host chelsea in the premier league on sunday . qpr boss chris ramsey says he will have no problem shaking john terry's hand . terry was banned for four games and fined # 220,000 for racist comments . rio ferdinand , brother of anton , will not be fit for the match at loftus road .
Origin. john terry was banned for four games and fined # 220,000 for the remarks made in october 2011 during chelsea's 1-0 defeat at loftus road . terry will line up against chris ramsey's side on sunday . rio ferdinand , who labelled terry an idiot for the abuse aimed at his brother , won't be fit in time for a reunion with the chelsea skipper this weekend .  With the constraints of computation power, we try to use as many candidates as possible for the evaluation model training. However, we also notice that our method is robust to the specific number of candidates, as during test we found that our model is still able to outperform the baseline model with fewer candidates, which is illustrated in Fig. 2.

Fine-grained Analysis
To demonstrate that our method is able to make meaningful improvement w.r.t the summary quality, here we compare our method with the baseline model at different semantic levels on CNNDM. , we compare the model performance w.r.t the salient entities, which are entities in source documents that appear in the reference summaries. Specifically, (1) we extract the entities from the source documents, 6 (2) select the salient entities based on the entities in reference summaries,

Entity-level
(3) compare the salient entities with entities in candidate summaries. Results in Tab. 3 demonstrate that our method can better capture the important semantic information of the source documents.

Sentence-level
Sentence Alignments Here we investigate if our method makes sentence-level differences compared to the baseline model. Specifically, (1) we match each sentence in the summaries to a sentence in the source documents based on their similarity (indicated by ROUGE scores), 7 (2) compute the sentence-level similarity between the reference and system-generated summaries based on the overlaps of their matched sentences in the source documents. The results in Tab. 3 demonstrate that the generated summaries of our method is more similar to the reference summaries at the sentence level.
Positional Bias In Tab. 2, we present a case study of the sentence alignment. We use the same matching approach to map the summary sentences to the sentences in source articles. In this example, the output of our method focuses on the same sentences as the reference summary does, while the baseline summary focuses on some different sentences.
Interestingly, the reference summary focuses on the very last sentence in the article, and our method can follow this pattern. Upon examining this pattern, we notice a positional bias of abstractive models when handling long source articles (more than   Fig. 3 shows that the baseline summaries are more likely to focus on the head sentences compared to the references, which may result from the autoregressive generation process of the Seq2Seq models. Our method is able to mitigate this bias, as the candidate sampling process (diverse beam search) generates candidates different from the original outputs, and our evaluation model can assess the holistic quality of the candidates.

Results on XSum dataset
To evaluate our method's performance beyond CNNDM dataset, we also test our method on XSum dataset, and the results are shown in Tab. 4. Here, we use Pegasus 8 as the base system since it achieves better performance than BART on XSum. We follow the same sampling strategy to generate the training data. However, as this strategy generally results in lower ROUGE-2 score on XSum dataset, we use a different strategy to generate the validation and test data (4 candidates generated by 4 diverse groups). Our method is still able to outperform the baseline, but with a smaller margin compared to CNNDM. Summaries in XSum are shorter (one-sentence) and more abstractive, which restricts the semantic diversity of candidates and makes it harder to make meaningful improvement.

Conclusion
In this work, we present a contrastive summarization framework that aims to optimize the quality of generated summaries at summary-level, which mitigates the discrepancy between the training and test 8 'google/pegasus-xsum'  stages in the MLE framework. Apart from the significant improvement over the baseline model on CNNDM dataset, we present a comprehensive evaluation at different semantic levels, explaining the sources of the improvement made by our method. Notably, our experimental results also indicate that the existing abstractive systems have the potential of generating candidate summaries much better than the original outputs. Therefore, our work opens up the possibility for future directions including (1) extending this two-stage strategy to other datasets for abstractive models; (2) improving the training algorithms for abstractive models towards a more holistic optimization process.  The source documents and reference summaries are lower-cased. Due to the input length limitation, some source documents are truncated during training.

B Experiment Details
Candidate Generation We use diverse beam search to generate the candidate summaries. We use the same beam search configuration as the original work except those related to diverse beam search. In particular, the diversity penalty is set to 1, and we use 16 diversity groups with 16 beams, which results in 16 candidates. Model We use the pretrained RoBERTa with 'roberta-base' version provided by the Transformers library as our evaluation model, which contains 125M parameters. Optimizer We use Adam optimizer with learning rate scheduling: lr = 0.002 · min(step num −0.5 , step num · warmup steps −1.5 ), where the warmup steps is 10000.
Training details The batch size in our experiments is 32. We evaluate the model performance on the validation set at every 1000 steps, using the averaged ROUGE-1/2/L score as the selecting criteria. The training is converged in 5 epochs, which takes around 40 hours on 4 GTX-1080-Ti GPUs on CNN/DailyMail dataset and 20 hours on XSum dataset.