Alleviating Exposure Bias via Multi-level Contrastive Learning and Deviation Simulation in Abstractive Summarization

,


Introduction
Automatic text summarization (El-Kassas et al., 2021) is the task of condensing a piece of text into a shorter version while preserving its most salient information and overall meaning.There are two main research directions for text summarization: extractive summarization and abstractive summarization (Nenkova et al., 2011).Extractive summarization involves selecting salient text spans from source documents, while abstractive summarization generates concise summaries in a sequence-to-sequence † Corresponding author.xiaofan.zhang@sjtu.edu.cnmanner (Liu and Lapata, 2019;Raffel et al., 2020).Recently, large pre-trained neural models, typically based on encoder-decoder Transformer (Liu and Lapata, 2019;Bao et al., 2020), have shown promising performance in abstractive summarization.These models are generally optimized using maximum likelihood estimation (MLE) in teacher forcing form (Bengio et al., 2015;Lamb et al., 2016) to maximize the predictive probability of the reference output given its prior gold sub-sequence.However, during inference, the models produce output based on the generated sub-sequence, which may contain errors.This mismatch between training and inference can negatively impact model performance and is known as exposure bias (Bengio et al., 2015;Ranzato et al., 2015).
To alleviate the ubiquitous issue while maintaining reasonable performance, a variety of methods from various perspectives have been proposed.Among them, sentence-level training (Shao et al., 2018;Paulus et al., 2018;Stiennon et al., 2020) aims to train the model using sentence-level evaluation metrics (e.g., ROUGE).Another direction of attempts involves forging rational schemes to inject noise or perturb reference output during the decoding stage of training (Venkatraman et al., 2015;Ning et al., 2023).The model trained in the noisy environment can perceive more states of inference.Notably, recent contrastive learning methods in abstractive summarization combine the strengths of these approaches.Their contrastive objectives guide the model to distinguish between summaries with different metric scores at the sentence level.
While the strong contrastive method (Liu et al., 2022) achieves good results by correlating the relative order of probabilities of system-generated summaries with their evaluation metrics, considering these summaries have considerably high relevance to their gold summary, low probabilities of highquality summaries can result in a low probability being assigned to their gold summary.Since the

Negative Group
Gold Reference Summary probability of the reference summary is connected with the generative objective that preserves the generation ability of the model, the lack of constraint on the absolute position of probabilities of highquality summaries could deteriorate model performance.To this end, we introduce a simple multilevel contrastive framework composed of fine and coarse contrastive components to address the exposure bias problem and explore further the potential power of large abstractive models in distinguishing semantic discrepancies of diverse summaries.Unlike existing margin ranking objectives (Zhong et al., 2020;Liu et al., 2021) which obtain a suitable margin with expensive grid search, to be more compatible with our whole design, we propose a bootstrap re-sampling process to acquire an adaptive margin in the fine contrastive stage.As for the coarse contrastive learning part, we design it mainly for two purposes -ensuring all probabilities of high-quality summaries at a properly high level and further discriminating between the semanticity of high-and low-quality summaries.

Multi-level Contrastive Learning
In addition, we introduce SDSA, a tailored sparse decoder self-attention pattern, to bridge the gap between training and inference.Unlike existing methods that perturb reference output in discrete space, SDSA operates in latent space.Specifically, we use this pattern during training while using standard attention during inference.Although the pattern may corrupt some salient token information of the reference, we argue that the resulting deviation could be acceptable since existing knowl-edge, including position information in the prior sub-sequence, is sufficient to predict the next token (Ethayarajh, 2019;Klafka and Ettinger, 2020).
Experiments on the CNN/DailyMail and XSum datasets show SimMCS consistently outperforms the prior state of the art.Furthermore, incorporating SimMCS with SDSA can further improve the model performance by a large margin.Further in-depth analysis indicates our system retains the advantages of previous contrastive methods and has strong few-shot learning ability.

Related Work
Transformer-based Pre-trained Model Recently, large Seq2Seq Transformers (Vaswani et al., 2017), which contain encoder self-attention, decoder self-attention, and decoder cross-attention, have achieved promising performance in the NLP domain, including text summarization (Song et al., 2019;Raffel et al., 2020).These models are pretrained using a variety of self-supervised objectives and fine-tuned with structured losses in downstream tasks.For example, BART (Lewis et al., 2020), a denoising autoencoder, is pre-trained to reconstruct original text spans corrupted with an arbitrary noising function such as text infilling.PE-GASUS (Zhang et al., 2020) is distinguished by its specifically tailored self-supervised pre-training objective for the summarization task.In PEGASUS, salient text spans are removed or masked from the original document, and the model aims to restore the remaining text spans to their original form.We use these models as backbones in our work.Mitigating Exposure Bias for Abstractive Summarization In the NLG domain, exposure bias is widespread and has received much attention from researchers (Daumé et al., 2009;Ross et al., 2011;Bengio et al., 2015;Wiseman and Rush, 2016;Zhang et al., 2019b;Ziegler et al., 2019).In abstractive summarization, Kryściński et al. (2018) introduces a reinforcement learning method with the ROUGE metric as a reward to encourage the generation of novel phrases.Inspired by Generative Adversarial Networks, Scialom et al. (2020) proposes a novel approach for sequence generation, in which the discriminator is integrated into a beam search (Tillmann and Ney, 2003;Li et al., 2016;Wiseman et al., 2017;Chen et al., 2018).Contrastive Learning Contrastive learning has been widely confirmed to effectively boost model performance by allowing the model to distinguish between the quality of diverse samples (Chuang et al., 2020).Recently the method has shown promising performance in natural language generation tasks such as text summarization (Cao and Wang, 2021) and machine translation (Yang et al., 2019;Pan et al., 2021).In fact, these contrastive examples can be constructed using rule-or modelbased methods, with the latter able to produce text examples closer to human-generated ones and forge more natural contrastive schemes.On the other hand, contrastive learning can be performed in latent or discrete space.For instance, Gao et al.
(2021) introduces a contrastive learning framework into the representation of sentence embeddings and greatly advances state-of-the-art results.Liu et al. (2022) adopts the discriminative re-ranking over generated summaries in discrete space like other works (Shen et al., 2004;Och et al., 2004;Mizumoto and Matsumoto, 2016;Lee et al., 2021).Bootstrap Re-sampling The bootstrap approach is a collection of sample reuse techniques designed to estimate sampling variances, confidence intervals, and other properties of statistics (Stine, 1989;Efron, 1992;Diciccio and Efron, 1992).Compared to traditional standard approaches, these techniques have fewer requirements and assumptions while achieving better performance and providing insight into many problems.

Methodology
Our complete design consists of a generative objective, multi-level contrastive learning framework SimMCS, and a sparse decoder self-attention pattern SDSA.During training, we train the abstractive model according to the training pipeline shown in Fig. 1.The reference summary is only used in the generative objective, while other types of summaries are used for contrastive learning.At the inference stage, the optimized model generates summaries in a conventional manner, using only the source document as input to its encoder.

Generative Objective
The training objective for summary generation consists of a sequence of token decisions made in an auto-regressive manner.This is formulated as a product of decision probabilities corresponding to specified tokens.Given a docu- we estimate the following conditional probability: where |S| stands for the number of tokens in summary S, θ represents the model parameters and s <t denotes all tokens prior to the position t.
In fact, most works based on Seq2Seq Transformers minimize the negative log-likelihood (NLL) of reference summaries.Following prior works, given current model parameters θ and a set of N document-reference pairs D (i) , S * (i) N i=1 , our generative objective is as follows: Following previous works, during the practical fine-tuning, we transform the generative objective in Eq. 2 to a label smoothed cross-entropy loss (Szegedy et al., 2016;Pereyra et al., 2017) with the smoothing parameter set to 0.1.

Multi-level Contrastive Learning
Our multi-level contrastive learning framework, SimMCS, is designed for abstractive summarization and consists of fine and coarse contrastive components.Compared to recent contrastive methods that operate at a single level, SimMCS combines different contrastive signals in a natural way to further distinguish the semantic quality of summaries.
For each data point containing a source document, a reference summary, n system-generated summaries (positive), and m randomly selected summaries weakly correlated with the reference (negative), namely, D, S * , (S , we divide the n positive summaries and the m negative summaries into positive and negative groups respectively.
Similar to Eq. 1, we calculate the probability mass P S corresponding to summary S as follows: where hyper-parameter β represents the degree of length penalty (Wu et al., 2016).Accordingly the probability mass P S ranges from −∞ to 0. According to Eq. 3, we can obtain the probability mass of summaries in positive and negative groups to participate in the following contrastive objectives.

Fine Contrastive Learning
At the fine level, we consider the coordination of model-predicted probabilities and the quality of in-group summaries.Since measuring the quality of summaries using evaluation metrics such as ROUGE (Lin, 2004), BERTScore (Zhang et al., 2019a), and BARTScore (Yuan et al., 2021) involves non-trivial overhead, we only inject the fine contrastive signal from the positive group into our training procedure.While the lack of a fine contrastive signal from the negative group may weaken the ability of the model to rank low-quality summaries, we speculate that this trade-off is acceptable as the generation process during inference mainly requires comparison among relatively highquality candidates for a strong summarizer.
To simplify the following expression, we sort the positive summaries in descending order by metric scores.That is, given a specific metric M, M(S * , S + i ) > M(S * , S + j ), ∀i, j, i < j.The model-predicted probabilities correspond to the sorted summaries.To encourage the model to assign higher probabilities to summaries with higher metric scores, we formulate the following objective: where λ represents the unit margin.λ ij represents the threshold judging whether the difference of ; 13 return M ; P S + j and P S + i engages in backpropagation.Margin Estimation with Bootstrap Re-sampling λ in Eq. 5 is the unit scale of the threshold.On account of our ultimate goal of comprehensively attending to both relative order and absolute position of probabilities, their valid range probably changes as the training progresses.Therefore, instead of framing λ as a hyper-parameter, we estimate it with bootstrap re-sampling to obtain a more representative margin.
Algorithm 1 shows the bootstrap re-sampling procedure.
n are regarded as sample points following certain population distribution F .Fn stands for empirical distribution consisting of these points.g(•) aims to compute statistics representing unit margin given the sample points properly.Accordingly, we perform bootstrap resampling to acquire B bootstrap statistics.Next, we calculate 1 − α confidence interval for λ in terms of the bootstrap statistics.Generally, there are mainly three types of methods to estimate bootstrap confidential interval (Efron, 1979): normal interval, percentile interval, and pivotal interval.We choose the last to perform the estimation.Note the algorithm does not involve backpropagation.More details are shown in Appendix A.2.

Coarse Contrastive Learning
In addition to re-ranking the summaries in the positive group, we argue that it is also important to as-  sign properly high probability mass to high-quality summaries.To this end, we introduce a new contrastive objective with a dual purpose: first, as a constraint to ensure that the probability mass of high-quality summaries is at a relatively high level; and second, as a signal enabling the system to further distinguish between the semantic discrepancy of high-and low-quality summaries.
where P pos denotes the weighted average probabilities of positive summaries, with higher quality summaries having greater weight w i .P neg is the average of probability mass assigned to negative summaries.ξ indicates the strength of constraint on P pos .Through the combination of fine and coarse contrastive signals, our contrastive loss is expected to keep the model-predicted probability mass of high-quality summaries at a properly high level to prevent degradation of model performance.Meanwhile, the aid of the fine contrastive signal allows the model to retain its ability to perceive the quality discrepancy of positive summaries (See Fig. 2).
To preserve both the generation and evaluation abilities of the model, we combine the multiple objectives above into a universal loss function (Edunov et al., 2018): where γ 1 and γ 2 are the weights of fine contrastive loss and coarse contrastive loss respectively.Compared to (a), in (b) we mask out attention weights in red areas corresponding to randomly selected token positions on the "Key" side.Particularly, note the blue areas should not be involved since the start token exists permanently in inference time.

Sparse Decoder Self Attention
A Seq2Seq Transformer comprises three core attention modules: encoder self-attention, decoder self-attention, and decoder cross-attention.During training, the conventional decoder self-attention depicted in Fig. 3 (a) is crucial for the occurrence of the aforementioned mismatch (Arora et al., 2022).
To alleviate this ubiquitous issue in abstractive summarization, we have tailored a simple decoder self-attention pattern shown in Fig. 3 (b) to simulate the potential deviation that may arise during the inference phase.Specifically, during training, we mask out attention weights corresponding to arbitrarily selected token positions on the "Key" side (excluding the start token) while using the standard attention mechanism during inference.In this way, the system can learn to maximize the predictive probability of the reference output based on its previous suboptimal sub-sequence.It is important to note that the start token should not be masked since it is always present during inference.
Mathematically, we formulate a straightforward masking strategy that mimics the accumulation of errors during inference for the decoder selfattention pattern (Zhang et al., 2019b;Bonavita and Laloyaux, 2020).The mask ratio r for each token within a sequence is computed as follows: where i represents the token position (starting from 0), len denotes the sequence length, and MR deter-mines the degree of sparsity of our tailored decoder self-attention for training.

Experiments
Here we provide a brief overview of the datasets, baselines, implementation, and evaluation.More experimental details are provided in Appendix A.

Datasets
In

Baselines
We intend to compare our experimental results with anterior-related works that exhibit outstanding performance.In particular, BART (Lewis et al., 2020)  mean representation and maximizes the similarities between them during training.ConSum (Sun and Li, 2021) remedies exposure bias problem through decreasing the likelihood of low-quality summaries while increasing the likelihood of reference summaries.SummaReranker (Ravaut et al., 2022) learns to select a high-quality summary from a collection of candidate summaries via applying re-ranking to a second-stage model.BRIO (Liu et al., 2022) introduces a new paradigm that assumes non-deterministic distributions instead of the deterministic distribution of gold summary.

Implementation Details
Our implementation is mainly based on PyTorch and Transformers library (Wolf et al., 2020), as well as 4 NVIDIA RTX 3090 GPUs.Backbone Settings In accordance with prior works, we use BART Large3 with 12 layers each for the encoder and decoder, and PEGASUS Large4 with 16 encoder layers and 16 decoder layers as our backbones.In particular, the hidden size of each layer is 1024, which is converted into 16 attention heads with a hidden unit size of 64 for multi-head attention.
Training and Inference For each document, we obtain 16 summaries generated by a summarizer as positive summaries and 2 different humangenerated summaries weakly correlated with the document as negative summaries.All PLMs are trained using the Adam optimizer (Kingma and Ba, 2014) with β 1 = 0.9, β 2 = 0.999, along with a learning rate scheduling.Especially, at the training stage, we leverage our proposed SDSA pattern as an alternative to standard decoder self-attention in all decoder layers.
During inference, as common wisdom, summaries are generated with beam search in an autoregressive manner (Wiseman and Rush, 2016) given source documents.In addition, note that we employ standard decoder self-attention instead of SDSA at this stage.

Evaluations
In practice, we measure the quality of generated summaries using the popular metric ROUGE.On the test set of CNNDM and XSum, we report full-length F1-based ROUGE-1, ROUGE-2, and ROUGE-L scores computed with the standard ROUGE-1.5.5.pl script5 .Furthermore, We also use two popular model-based metrics BERTScore and BARTScore to demonstrate the superiority of our approaches more comprehensively.

Discussion
We evaluate two variants of our contrastive framework SimMCS: (1) SimMCS-Std uses standard attention modules, and (2) SimMCS-SDSA instead leverages our SDSA pattern as an alternative during training.Specifically, we select BART on CNNDM and PEGASUS on XSum as our base models respectively.

Comparison Results
Tab.To demonstrate the effectiveness of our method beyond the CNNDM dataset.We also conduct experiments on another news dataset, XSum.As shown in Tab. 2, notably, our method surpasses previous baselines and achieves new state-of-theart results.The trend is similar to that of CNNDM and shows a strong generalization of our methods.

Analysis
We further analyze the properties of our state-ofthe-art method and compare it with other strong baselines on the CNNDM dataset to gain more insights.Ablation Studies We verify the contributions of ,QFUHDVLQJ%HDP: LGWK various components in SimMCS-SDSA on the CN-NDM test set and show the ablation study results in Tab. 3. Specifically, we consider taking out SDSA, coarse contrastive learning (Coarse-Ctr), fine contrastive learning (Fine-Ctr), and margin estimation with bootstrap re-sampling (Boot), respectively.On the grounds of results, we can come to the conclusion that 1) removing SDSA, Coarse-Ctr, and Fine-Ctr substantially hurts performance, and 2) an adaptive margin under our SimMCS framework can improve model performance on most metrics.Considering that the proper margin in previous work is obtained with expensive grid search, our adaptive margin requires lower overhead for searching.The results in Fig. 4 present that our approach has a fairly strong few-shot learning capability.
Low-Resource Evaluation Considering that our approach can be applied to any Seq2Seq Transformer.Additionally, it is also imperative to pos- Increasing Beam Width Since our multi-level contrastive framework includes a fine contrastive signal to rerank the quality of summaries, it preserves the ability to coordinate candidate summaries and therefore improves the upper bound of performance with increasing beam width (i.e., the number of beams).We test our model performance with beam width 4, 10, 20, 50, and 100.As shown in Fig. 5, the larger the beam width, the better the performance on the CNNDM test set.
Abstractiveness We analyze the abstractiveness of generated summaries by calculating the percentage of novel n-grams, which are defined as those that appear in the summary but not in the associated source document (See et al., 2017;Liu and Lapata, 2019).As shown in Fig. 6, our state-of-the-art model generates more abstractive summaries than the base model BART in terms of all n-grams metrics.Additionally, Fig. 7 demonstrates that the summaries generated by our model effectively convey salient information and are closer to the reference summaries.
Case Study on CNNDM Fig. 7 displays several examples of summaries generated by SimMCS-SDSA and the base model BART, and their corresponding reference summaries.Specifically, SimMCS-SDSA is capable of identifying salient text spans that are overlooked by the base model.Furthermore, compared to the base summaries, the summaries produced by our model exhibit fewer syntactical errors and are more closely aligned with the reference summaries.

Conclusion and Future Work
We introduce SimMCS, a simple multi-level contrastive learning framework for abstractive summarization that simultaneously considers the relative order and absolute position of probabilities assigned to high-quality summaries and further discriminates semantic discrepancy of summaries at different quality levels.Furthermore, we propose a simple yet empirically effective decoder self-attention pattern to alleviate exposure bias and improve model performance by a large margin.All our methods are not restricted to specific tasks or models and demonstrate strong generalization in conditional text generation tasks.
There are several directions to further exploit the potential power of our approaches.First, we believe that the margin estimation with bootstrap re-sampling could be more accurate and robust if given more probability mass.Second, it is also feasible to explore more sparse decoder self-attention mechanisms with diverse strategies.Finally, our methods could be extended to other text generation tasks such as machine translation.

Limitations
Since our contrastive framework requires the probability mass of various summaries given a source document, there is an extremely large consumption of GPU memory even if the batch size is small, which limits the scale of contrastive data and suppresses the potential of our method.Meanwhile, due to limited sample points (i.e., probability mass), our bootstrap re-sampling procedure is susceptible to outliers and cannot fully take advantage of this algorithm.In addition, like most abstractive summarization systems, our model does not attach importance to controllable text generation (Hu et al., 2017;Prabhumoye et al., 2020;He et al., 2020), which means that the generated text might contain redundant and incorrect information.

Ethical Considerations
While there is limited risk associated with our work, similar to existing abstractive summarization systems, there is no guarantee that the generated summaries are factually consistent and free from hallucination (Maynez et al., 2020;Kang and Hashimoto, 2020).Therefore caution is imperative when our system is applied to practical projects.

A.2 Settings
In this paper, we follow recent works (Liu and Liu, 2021;Liu et al., 2022) and generate 16 candidate summaries as the positive group for each data sample using diverse beam search (Vijayakumar et al., 2018).We also randomly select 2 summaries that have a low correlation with the gold summary.On CNNDM, we use the pre-trained BART Large 7 to conduct candidate summary generation, while for XSum we produce candidate summaries via PEGASUS Large8 .The generated candidate summaries are ordered based on their ROUGE-1 score.For model training, We employ the Adam optimizer with a dynamic learning rate: lr = 2 × 10 −3 min(step −0.5 , step × warmup −1.5 ), where warmup indicates the warmup steps, which is set to 10000, step is the number of updating steps, and lr is the learning rate.
The length penalty factor β in Eq. 3 is assigned the same value as that used in the original beam search.With regards to the dynamic margin estimation with bootstrap re-sampling, as described in Algorithm 1, the function g(•) is defined as follows: where num(•) represents the number of sample points, max(•) and min(•) are the maximum and minimum values respectively in this sample group.Since there are limited sample points (i.e., the probability mass of candidate summaries) we additionally provide a boundary constraint to mitigate the impact of outliers (See Tab. 5).

Figure 2 :
Figure 2: Comparison of reranking high-quality summaries.The blue is only trained with MLE loss;The purple represents the model that additionally considers the relative order of probabilities assigned to these summaries; The brown (ours) takes the relative order and absolute position of these probabilities into account.

Figure 3 :
Figure 3: Comparison of two kinds of decoder self-attention patterns.Text span [x 1 , x 2 , • • • , x 9 ] is prepended with start token [S] as input and appended with end token [E] as output.Grey areas are masked out with the causal mask (ignoring the padding mask).Compared to (a), in (b) we mask out attention weights in red areas corresponding to randomly selected token positions on the "Key" side.Particularly, note the blue areas should not be involved since the start token exists permanently in inference time.

Figure 4 :
Figure 4: The AVG ROUGE scores (the average of R-1, R-2, and R-L F 1 scores) of various systems fine-tuned with 10, 100, 1000, and 10000 training examples.

Figure 5 :
Figure 5: The AVG ROUGE scores (the average of R-1, R-2 and R-L F 1 scores) on CNNDM test set with different beam widths .

Figure 7 :
Figure 7: Examples of summaries generated by SimMCS-SDSA trained on CNNDM.The sentence in green is included in the SimMCS-SDSA summary, while the one in red is discarded.
The overall training pipeline of SimMCS with SDSA.The positive group contains high-quality summaries generated by a strong summarizer given a document, while the negative group includes low-quality summaries weakly related to the document.We apply SDSA to the decoder self-attention module during training.We conduct B times of bootstrap re-sampling in the positive group and estimate M that represents the average adjacent spacing of positive samples.The ideal effect of our approach is exhibited in the right box.

Table 1 :
Average results on CNNDM test set."*" is the result of our own evaluation script.R-1/2/L are the ROUGE-1/2/L F1 score (p < 0.01).BS and BaS refer to the neural model-based metrics BERTScore and BARTScore respectively.The best results are bolded.

Table 2 :
Average results on XSum test set."*" is the result of our own evaluation script.R-1/2/L are the ROUGE-1/2/L F1 score (p < 0.01).BS and BaS refer to the neural model-based metrics BERTScore and BARTScore respectively.The best results are bolded.6

Table 3 :
Ablation study results.Performance changes compared with the full model are reported.Larger decreases in metrics are shaded with darker red and larger increases in metrics are shaded with darker green.
1 compares the results of the baseline, previous work from literature, and our proposed approaches on the CNNDM test set.We first note that even SimMCS-Std has outperformed the previous state-of-the-art model that is built on the extra single-level contrastive signal.

Table 4 :
Statistics of used datasets.