Enhancing Abstractiveness of Summarization Models through Calibrated Distillation

Sequence-level knowledge distillation reduces the size of Seq2Seq models for more efficient abstractive summarization. However, it often leads to a loss of abstractiveness in summarization. In this paper, we propose a novel approach named DisCal to enhance the level of abstractiveness (measured by n-gram overlap) without sacrificing the informativeness (measured by ROUGE) of generated summaries. DisCal exposes diverse pseudo summaries with two supervision to the student model. Firstly, the best pseudo summary is identified in terms of abstractiveness and informativeness and used for sequence-level distillation. Secondly, their ranks are used to ensure the student model to assign higher prediction scores to summaries with higher ranks. Our experiments show that DisCal outperforms prior methods in abstractive summarization distillation, producing highly abstractive and informative summaries.


Introduction
Text summarization is the task of generating a concise and condensed summary of a source document while preserving its most important information (Gupta and Gupta, 2019).Unlike extractive summarization, which involves selecting and concatenating sentences from the original document (Nallapati et al., 2017), abstractive summarization is a sequence-to-sequence (Seq2Seq) problem that can generate novel phrases and sentences that were not present in the original document (Nallapati et al., 2016;Paulus et al.;Fan et al., 2018;Gupta and Gupta, 2019).Recent advances in large pre-trained language models have greatly accelerated summarization modeling progress (Lewis et al., 2020;Zhong et al., 2022), but there are still significant concerns for the large language model's real use cases due to the slow inference speed under a production-level environment.
Knowledge distillation is a widely used technique for compressing a large model into a smaller Input: Jose Mourinho has lauded Chelsea's consistency, with a hint of caution, as his side bid to wrap up a wire-to-wire Premier League victory.(...) Chelsea have topped the Premier League since the opening day but Jose Mourinho (left) will remain focused.Blues captain John Terry has been a pivotal (…) Seq-level Distil: Chelsea have topped the Premier League since the opening day.Jose Mourinho has praised Chelsea's consistency, with a hint of caution, as his side bid to wrap up a wire-to-wire Premier League victory.The Blues have led or shared the lead since the Opening round of fixtures and entered this weekend's matches seven points clear with eight matches remaining.Figure 1: Summaries generated from the models without using knowledge distillation, "w.o.Distil"; using sequence-level distillation, "Seq-level Distil" (Zhang et al., 2022a); and using Calibrated Distillation ("Dis-Cal", ours) on CNNDM data.Fragments from the input are color-coded to indicate overlap: green, yellow, and red for over three, five, and ten tokens, respectively.one for faster inference with minimal performance loss (Ba and Caruana, 2014;Hinton et al., 2015;Chen et al., 2020).A prominent direction for abstractive summarization is known as sequencelevel distillation (Kim and Rush, 2016;Zhang et al., 2022a).This method involves generating a pseudo summary for each training document using a teacher model, and training a student model on pairs of training documents and their corresponding pseudo summaries.Compared to methods that rely on word-level information, such as minimizing the cross-entropy loss between teacher and student prediction distributions (Gou et al., 2021), this approach enables the student to better mimic the teacher model's generation at the sequence level.
Despite the high ROUGE (Lin, 2004) score achieved through sequence-level distillation, we argue that the pseudo summary generated by the teacher model exacerbates the student model's tendency to copy continuous text segments from the source documents, thus intensifying the problem

Student Model
Rank #1 Pseudo Summary Sorted Summaries by Rank
Figure 2: Overview of DisCal: The teacher efficiently transfers its knowledge through two approaches: firstly, employing sequence-level distillation, utilizing the best pseudo summary in terms of abstractiveness and informativeness, and secondly, applying output calibration, making higher-ranked summaries receive correspondingly higher predicted scores.
of copy bias during summary generation.As seen in Figure 1, relying solely on the teacher's pseudo summaries for distilling knowledge, without utilizing gold summaries, compels the student model to generate extractive-like summaries due to the inherent copy bias (see the Seq-level Distil).Thus, the level of abstractiveness remains limited, hindering the student model's capacity to produce truly informative and coherent abstractive summaries.This trade-off between informativeness and abstractiveness is a significant challenge in abstractive summarization (Zhang et al., 2018;Lin and Ng, 2019), yet it has not been addressed within the context of knowledge distillation.
In this paper, we present the notion of calibrated distillation, which entails distilling knowledge by precisely calibrating the pseudo summaries provided by the teacher.Our proposed method, referred to as DisCal, as illustrated in Figure 2, leverages the teacher model as a dynamic summary generator that generates diverse pseudo summaries for each input text.To enhance the diversity, we dynamically manipulate the attention temperature of the teacher model throughout the distillation process, mitigating copy-bias by exposing numerous summaries to the student model.
To evaluate the quality of the pseudo summaries, we employ a ranking system based on two factors: informativeness, which is assessed using the ROUGE score, and abstractiveness, which is measured by the ratio of novel n-grams in a summary that are not present in the input text.For knowledge distillation, we select the best pseudo summary in terms of the two factors to supervise the student model through sequence-level distillation.Additionally, the ranking of the summaries is used to calibrate the student model, ensuring that it assigns higher prediction scores to summaries with higher ranks.By doing these, DisCal enhances the level of abstractiveness and improves the ROUGE score, showing promising potential even when the gold summaries in training data are less abstractive.

Related Work
Large pre-trained Seq2Seq models have emerged as the de facto standard for abstractive summarization due to their exceptional performance and versatility.They excel in capturing the salient information from documents through various techniques.For instance, T5 (Raffel et al., 2020) predicts corrupted text spans, BART (Lewis et al., 2020) employs denoising auto-encoding, PEGASUS (Zhang et al., 2020) identifies the most summary-worthy sentences, and DialogLED (Zhong et al., 2022) employs window-based denoising.
Due to the high computational cost associated with large models, there has been a surge of research focused on compressing these models (Gou et al., 2021;Frantar et al., 2023).One prominent approach in this field is known as knowledge distillation, which involves training a smaller student model to mimic the predictions of a larger teacher model by minimizing the difference between the teacher and student predictions (Ba and Caruana, 2014;Hinton et al., 2015;Chen et al., 2020).Particularly, in the context of abstractive summarization distillation with Seq2Seq models, Kim and Rush (2016) proposed the sequence-level knowledge distillation approach which involves training a student model using pseudo summaries generated by the teacher model using beam search decoding.On the other hand, Shleifer and Rush (2020) proposed the shrink and fine-tune (SFT) framework.This ap-proach involves removing certain layers from the teacher model to create a smaller student model, which is then fine-tuned using gold summaries.In a recent study, Zhang et al. (2022a) introduced the method PLATE, which aims to smooth attention distributions of teacher models during pseudo summary generation and then fine-tune the shrunken student model with them.
In addition, there is an interesting line of work called model calibration (Liu and Liu, 2021;Zhang et al., 2022b;Liu et al., 2022), where they leverage different candidate summaries to calibrate the model's predictions to overcome the problem of exposure bias (Bengio et al., 2015).In contrast to prior research, our work focuses on the previously overlooked problem of decreased abstractiveness when distilling summarization models.We propose a solution called Calibrated Distillation, which achieves a high level of informativeness and abstractiveness using a smaller model.

Seq2Seq Abstractive Summarization
Abstractive summarization aims at generating a concise summary of a given document or text using novel phrases and sentences.The objective of abstractive summarization is to learn a neural Transformer model Θ 1 that receives a source document X = {x 1 , x 2 , . . ., x |X| } and generates its appropriate summary Y = {y 1 , y 2 , . . ., y |Y | }, where x t and y t are the word token in the document and its summary at time t, respectively.
For this objective, the Seq2Seq Transformer can be trained to maximize the conditional probability: where the notation Y <t represents all word tokens preceding the position t.Consequently, the model is updated to minimize the negative log-likelihood loss (NLL) for each pair of input document X and its gold summary Y * in the training data: (2)

Sequence-level Knowledge Distillation
Let Θ t and Θ s be the teacher and student models, where the student must be smaller in size com- 1 We mainly focus on Seq2Seq Transformer models (Vaswani et al., 2017) for abstractive summarization.pared to the teacher.Given the teacher model Θ * t trained by Eq. ( 2), the teacher's output distribution for the document D is approximated by the pseudo summary Ỹ = {ỹ 1 , ỹ2 , . . ., ỹ| Ỹ | }, which is the output from running beam search with the teacher model (Kim and Rush, 2016), where the summary with the highest beam score is selected.Therefore, the student model is trained to mimic the teacher's summary generation process by minimizing the NLL loss on the teacher-generated summary Ỹ , i.e., ℓ NLL (X, Ỹ ).The gold summary Y * in training data is no longer used in sequence-level knowledge distillation.

Methodology
We introduce a new distillation method Dis-Cal (Abstractive Summarization Distillation with Calibration) in this section.Briefly speaking, the dynamic summary generator (in Section 4.1) provides a list of feasible pseudo summaries for each document and the calibrated distillation (in Section 4.2) tunes the student model to output the summary with high informativeness and abstractiveness.

Dynamic Summary Generator
In sequence-level knowledge distillation, using a single deterministic pseudo-summary generated by the trained teacher model is sub-optimal.This approach limits the exposure of the model to diverse valid summaries for a given input document (Liu et al., 2021a), leading to reduced abstractivenss.Additionally, it easily propagates incorrect predictions from the teacher model to the student model due to overly confident predictions (Guo et al., 2020;Liang et al., 2022).Consequently, this can lead to poor performance in generating accurate abstractive summaries.
To address the issues, we utilize the teacher model as a dynamic summary generator (in Figure 2), enabling it to generate diverse pseudo summaries in real-time during the distillation process.This is achieved by employing the diverse beam search technique (Vijayakumar et al., 2018) and randomly re-scaling its attention temperature within a predefined range.Manipulating attention weights in all attention modules is recognized for its effectiveness in mitigating the copy bias in the generated summary (Zhang et al., 2022a).Therefore, we randomly re-scale the attention temperature of the Transformer model, where Q, K, V are linear projections of hidden states of each Transformer layer; and k is a randomly drawn re-scaling factor from the uniform distribution U(1, γ); and γ is the maximum value for re-scaling.Thus, the teacher model generates a list of n pseudo summaries via diverse beam search, Ỹ = { Ỹ1 , Ỹ2 , . . ., Ỹn }, and these summaries differ even for the same document according to which re-scaling factor is chosen.Examples of pseudo summaries can be found in Appendix A.

Calibrated Distillation
We introduce a new concept of distillation named calibrated distillation, which is built on the standard sequence-level distillation pipeline but differs in terms of considering two more aspects: (1) we utilize the gold summary to identify the most reliable pseudo summary from the summary list Ỹ and (2) we calibrate the student model's output such that it can generate the summary with high informativeness and abstractiveness.Specifically, given the list of pseudo summaries Ỹ, DisCal evaluates and ranks the n summaries in the list in terms of informativeness and abstractiveness.We define the calibration score to evaluate the ranks by employing ROUGE and novel n-gram scores, which are respectively for informativeness and abstractiveness 2 , as in Definition 4.1.
Definition 4.1.Let Ỹ = { Ỹ1 , Ỹ2 , . . ., Ỹn } be the list of n pseudo summaries for an input document.Then, the informativeness score s info ( Ỹi ) for the i-th pseudo summary is the average of ROUGE-1, ROUGE-2, and ROUGE-L F1 scores on the gold summary Y * , and the abstractiveness score s abs ( Ỹi ) for the i-th pseudo summary is the average of novel 1-gram, 3-gram, and 5-gram scores with respect to the input document X.Hence, the calibration score is formulated by the weighted sum of the two scores normalized over n pseudo summaries in the list Ỹ as: 2 Novel n-gram score is the ratio of n-grams in the summary that do not appear in the input document, which is widely used to measure the abstractiveness in the literature (Liu and Lapata, 2019;Zhang et al., 2022a;Dreyer et al., 2023) where λ is the balancing term of s info and s abs , adjusting the importance of the two factors.
Accordingly, we now obtain a list of ranked pseudo summaries We use this updated list for calibrated knowledge distillation.
Firstly, the summary Y ′ n is selected as the best summary among all pseudo summaries in Ỹ′ since it exhibits the highest calibration score s calib .Hence, we employ Y ′ n as the target summary for guiding the student model through sequence-level knowledge distillation.The student model learns from the teacher's knowledge by minimizing a modified NLL loss.In this case, the loss equation remains identical to Eq. ( 2), but the target is substituted with the rank 1 pseudo summary, denoted as ℓ NLL (X, Y ′ n ).By incorporating the ROUGE score into the assessment process, we ensure that the selected summary has a high level of informativeness.
Secondly, motivated by the work that leverages the order of candidate summaries (Zhang et al., 2022b;Liu et al., 2022), we encourage the student model's prediction output such that it assigns higher estimated probabilities to high ranked summaries.For a given pseudo summary Ỹ , the lengthnormalized estimated log-probability (Liu et al., 2022) by the student model is formulated as: where Ỹ = {ỹ 1 , ỹ2 , . . ., ỹ| Ỹ | } and α is a length penalty hyperparameter similarly used for beam search.Then, our calibration loss is formulated by using the margin based pairwise ranking loss (Hopkins and May, 2011) as: where m ij = (j − i) * m represents the margin multiplied by the difference in rank between two pseudo summaries.Intuitively, this encourages the student model's log-probability f ( Ỹj ) to be greater than f ( Ỹi ) since s calib ( Ỹj ) > s calib ( Ỹi ), thereby generating summaries with high levels of informativeness and abstractiveness.
As a result, the student model is trained by combining the two loss objectives for sequence-level knowledge distillation and model output calibration, respectively, as: tbd where η is the weight for the NLL loss.
• The CNNDM dataset comprises online news articles sourced from the CNN and DailyMail websites, each accompanied by corresponding highlight summaries for reference.
• The XSUM dataset contains online articles from BBC News with single sentence summaries, which are more abstractive than those in CN-NDM (Dreyer et al., 2023).
• The SAMSum dataset contains messenger-like conversations with summaries written by linguists.Unlike CNNDM and XSUM, SAMSum involves dialogue data that includes more than two participants.The detailed statistics of the datasets can be found in Table 1.It is important to note that each dataset exhibits varying levels of abstractiveness in its gold summaries.XSUM and SAMSum exhibit a very high level of abstractiveness compared to CNNDM, probably because their summary length is very short; mostly a single sentence.
Teacher and Student Models.Following the literature (Zhang et al., 2022a), we consider BART Large (Lewis et al., 2020), one of the widely used Seq2Seq Transformer architectures for abstractive summarization.The BART Large model is trained on the entire dataset with gold summaries as a teacher model.Then, we configure two student models with identical Transformer encoder layers to the teacher, but they differ in the number of decoder layers: BART 12-6 and BART 12-3, with six and three decoding layers, respectively.Referring to the SFT pipeline (Shleifer and Rush, 2020), the student models are initialized from the 12-encoder-layer/12-decoder-layer teacher.The two student models copy the full encoder from the teacher model.But, the decoder of BART 12-6 is copied from the {0, 2, 4, 6, 8, 10} decoder layers of the teacher, while the decoder of BART 12-3 from the {0, 5, 11} decoder layers.This initialization is simple but effective since it eliminates the need for separately pre-training the two student models.
In Appendix D, we validate the generalizability of DisCal on a different state-of-the-art Seq2Seq model, DialogLED (Zhong et al., 2022).
Algorithms.We compare DisCal with the three prior knowledge distillation approaches, namely shrink and then fine-tune (SFT) (Shleifer and Rush, 2020) and two sequence-level knowledge distillation methods, Seq-Distil (Kim and Rush, 2016) and PLATE (Zhang et al., 2022a).The SFT method trains the student model on pairs of documents and their corresponding gold summaries without using pseudo summaries.On the other hand, the other two methods only rely on the pseudo summary generated by the teacher model using beam search decoding; PLATE is different from Seq-Distil in terms of scaling up the teacher's attention temperature in pseudo summary generation.We re-implement all compared methods and train them in the same environment using eight NVIDIA V100 GPUs and Pytorch 1.13.1 (Paszke et al., 2019).
Implementation Details.Similar to recent studies (Rohde et al., 2021;Zhang et al., 2022a), we train BART Large (teacher model) using the Adam optimizer (Kingma and Ba, 2014) with an initial learning rate of 5e-5 and a label smoothing of 0.1.The teacher model is trained for 20, 000 steps on CNNDM and XSUM with a weight decay of 0.001 and a batch size of 64, while 5,000 steps on SAM-Sum with a weight decay of 0.1 and a batch size of 16.We use the same training configuration on the two student models for Seq-distil and PLATE.
As for our hyperparameters, we tune them on validation sets.The maximum value γ for re-scaling in Eq. ( 3), the balancing term λ for the calibration score in Eq. ( 4), and the weight for NNL loss in Eq. ( 7) are respectively set at 2.0, 0.2, and 0.01 on CNNDM; 1.5, 0.2, and 1.0 on XSUM; and 1.5, 0.2, and 0.1 on SAMSum.The number of pseudo summaries n per document is set at 6.The detailed implementation including hyperparameter settings for all methods are provided in Appendix B.1.

Results on News Summarization
Tables 2 and 3 summarize the results obtained from two news summarization datasets.The first block shows the performance of the teacher model (BART Large), while the second and third blocks include the results achieved by the two student models (BART 12-6 and BART 12-3) trained using four different knowledge distillation methods.
In general, DisCal exhibits the best performance in terms of informativeness and abstractiveness in both datasets.Particularly, DisCal shows significant performance improvements on the CNNDM dataset.This dataset, as indicated in Table 1, exhibits a low level of abstractiveness in its gold summaries, leaving ample room for improvement.The two students models with DisCal even surpass the performance of their teacher model with large margin.On the other hand, the two existing sequence-level distillation methods, Seq-Distil and PLATE sacrifice the level of abstractiveness compared to the teacher model and SFT.For the XSUM dataset, we observe the similar trend of exhibiting the highest ROUGE while maintaining better novel n-gram scores than the two sequence-level distillation methods.A less significant improvement to abstractiveness comes from short length summaries of the XSUM dataset.nance compared to the other distillation approaches.Similar to the CNNDM dataset, BART 12-6 with DisCal exhibits ROUGE and novel n-gram scores higher than its teacher model.

Inference Latency
Table 5 summarizes the number of trainable parameters and inference latency of BART models we used.By reducing the number of decoder layers, the parameter size decreases from 406M to 255M.Notably, as the decoder is the most computationally intensive component during inference due to auto-regressive decoding, the student models demonstrate a significantly lower inference latency compared to the teacher model.Specifically, BART 12-6 and BART 12-3 achieve inference speed improvements of 1.43 -2.03 and 2.36 -3.08 times faster than BART Large, respectively.Despite faster inference speed, the student models enhanced with DisCal exhibit comparable or superior performance to BART Large in generating informative and highly abstractive summaries, as demonstrated from the results presented in Tables 2, 3, and 4.

Detailed Analysis on Main Component
We perform a detailed analysis of DisCal on CN-NDM data using BART 12-3.

Loss Component Ablation Study
We perform an ablation study on DisCal by gradually adding each loss component on top of the SFT method.The results are shown in Table 6.Firstly, we use the NLL loss in Eq. ( 2) with λ = 0.0, where we consider only the informativeness score s info for selecting the best summary without utilizing the calibration loss from Eq. ( 6).In this setting, although the ROUGE score exhibits a considerable improvement, the novel 5-gram score drops.Secondly, when increasing the λ value to 0.2, where the best pseudo summary is selected by considering both informativeness and abstractiveness, the ROUGE and novel 5-gram scores are both improved.Next, by utilizing both the NLL loss and the calibration loss, we observe further enhancements in the ROUGE and novel 5-gram scores due to their synergistic impact.However, using only the calibration loss does not work as there is no supervision from the NLL loss for summary generation.Parts of Miami-Dade County's skyline was hidden from view Monday as smoke from a growing 1,850-acre wildfire loomed over portions of the Florida county.What started as a nonthreatening and seemingly shrinking grass fire on Sunday, consuming fewer than 100 acres according to Miami-Dade Fire Rescue Battalion Chief Al Cruz, grew to be more than 10 times that within the next 24 hours.By Monday night, the fire had burned nearly 2,000 acres and was 50% contained, the fire department said.High temperatures and gusty winds helped the fire spread, State Forester Jim Karels said.Several fire units and a helicopter with the capacity to drop 400 gallons of water at a time were battling the blaze, Cruz said."The Florida Forest Service and Miami-Dade Fire Rescue have worked around the clock to protect Southwest Miami-Dade County," Florida Agriculture Commissioner Adam H. Putnam said in a statement.Early Monday night, officials were considering road closures, and one school, Lincoln Marti, was evacuated as a precaution, according to the Fire Department.
The wildfire started in Miami-Dade County on Sunday.By Monday night, it had grown to nearly 2,000 acres .The fire was 50% contained, officials said .

Gold
The fire has burned nearly 2,000 acres and is 50% contained.High temperatures and gusty winds helped the fire spread, state forester says.One school is evacuated as a precaution.The fire is 50 percent contained.Florida agriculture commissioner: The fire is around the clock The fire has burned nearly 2,000 acres and is 50% contained.High temperatures and gusty winds helped the fire spread, State Forester Jim Karels says.Several fire crews and a helicopter with the capacity to drop 400 gallons of water at a time are battling the blaze.
The fire has burned nearly 2,000 acres and is 50% contained, fire officials say.The fire started as a nonthreatening grass fire on Sunday.A school is evacuated as a precaution.Firefighters are battling the blaze in Miami-Dade County.
A grass fire in Miami-Dade County has burned nearly 2,000 acres and is 50% contained.High temperatures and gusty winds helped the fire spread, State Forester Jim Karels says.Several fire units and a helicopter are battling the blaze. (

Balancing Term for Calibration Score
The hyperparameter λ in Eq. ( 4) balances the importance between informativeness and abstractiveness in evaluating pseudo summaries.The higher the λ value, the greater the importance of abstractiveness in the score.Table 7 demonstrates how the ROUGE and novel 5-gram scores are affected by adjusting the λ value.With increasing λ values, we observe a trade-off between the levels of informativeness and abstractiveness in the summaries.
Placing excessive weight on abstractiveness compromises the level of informativeness in the summary; the abtractiveness improves while the informativeness drops considerably.

Number of Pseudo Summaries
Intuitively, increasing the number n of pseudo summaries from the dynamic teacher model provide more performance gain with DisCal.Table 8 shows how increasing n affects the performance of Dis-Cal.As the number increases, DisCal generates better summaries in terms of the ROUGE-1 and novel 5-gram scores, while the ROUGE-2 score begins to drop slightly when n is greater than 6.Therefore, in general, having more pseudo summaries helps generate highly abstract summaries without sacrificing much informativeness.approaches, Seq-Distil and PLATE, generate more informative summaries compared to SFT, as they exhibit high informativeness score s info , However, they sacrifice the abstractiveness score s abs , which is lower than that of SFT.This indicates that they copy a large portion of the summaries from the input document (see the red fragments shown in table 9).In contrast, DisCal is not only robust against copy bias but also achieves a very high informativeness score compared to other methods.We provide additional examples in Appendix C.

Self-Calibrated Distillation
Our method has potential in leveraging enhanced student model as self-teacher for subsequent training.Here, we set λ = 0.0 in this bootstrapping experiment as the student model has already been trained with DisCal (λ = 0.2).Table 10 presents the results before and after self-calibration using BART 12-3 trained with DisCal on CNNDM.While ROUGE scores got improved, novel 5-gram scores decrease.Thus, a λ value greater than 0.0 is still necessary to maintain high abstractiveness.

Human-like Evaluation using GPT-4
We conduct human-like evaluation using G-EVAL (Liu et al., 2023).This is a novel LLM-based eval- uation approach employing GPT-4, outperforming all prior automated methods and also displaying a substantial Spearman correlation of 0.513 with human scores in summarization tasks.We use exactly the same prompt suggested from the authors, employing a scale of 1 (worst) to 5 (best) for consistency, coherence, and relevance, and 1 (worst) to 3 (best) for fluency.Table 11 shows the results of four distillation models using BART 12-6, including their teacher BART Large, on CNNDM.Our analysis yields the two insights.Firstly, all distillation methods have slight impact on consistency, coherence, relevance, and fluency; up to 0.18 difference compared to the teacher.This likely stems from the use of teacher-generated pseudo summaries, which effectively prevents performance divergence in student models.Secondly, DisCal enhances abstractiveness while maintaining high consistency.This is achieved through the integration of ROUGE (between pseudo and gold summary) in summary selection and output calibration, ensuring the student model to retain crucial contents from the gold summary during training.

Comparison with Paraphrasing
Paraphrasing is a simple method for enhancing the abstractiveness of provided summaries (Li et al., 2022;Zhou and Bhat, 2021).Hence, we evaluate one of the paraphrasing techniques known as back-translation.We utilize Amazon Translate 3 , a fluent and accurate machine translation service and explore a form of back translation: English → German → English.Table 12 summarizes the impact of back-translation on CNNDM when it is applied to the model output generated by SFT, Seq-Distil, and PLATE using BART 12-6.
The results demonstrate that the back-translation effectively enhances abstractiveness of existing methods, yet it noticeably reduces informativeness (i.e., ROUGE) compared to not using it.In contrast, our approach, Discal, strikes a more favorable balance between informativeness and abstractiveness 3 https://aws.amazon.com/translate/by the proposed calibrated distillation, resulting in improvements in both aspects.

Conclusion
We propose DisCal, an improved knowledge distillation method that leverages diverse candidate summaries generated by teacher model.By evaluating and ranking pseudo summaries during distillation, DisCal chooses the best summaries in terms of informativeness and abstractiveness, and this enhances model predictions based on output calibration.Experiments on three summarization datasets demonstrate that DisCal produces summaries with a higher level of abstractiveness as well as informativeness.

Limitations
DisCal introduces additional training overhead.Generating pseudo summaries from the teacher model involves beam search decoding, which is computationally intensive compared to simple teacher forcing.However, this computational overhead in training phase does not affect inference in testing, i.e.DisCal does not require any changes in inference.
Regarding the training overhead, some recent studies show that beam decoding can be expedited using techniques such as early exiting (Liu et al., 2021b;Schuster et al., 2022) and parallel decoding (Santilli et al., 2023).These research show great potential on alleviating the burden associated with beam decoding during training.

Ethics Statement
This paper focuses on general abstractive summarization and knowledge distillation, introducing a novel calibrated distillation method that produces summaries with high levels of abstractiveness and informativeness.To evaluate our method, we use public benchmark datasets, i.e.CNNDM, XSUM, and SAMSum.Therefore, we do not anticipate any negative ethical and social impact.different temperatures, namely 1.5 and 2.0, and choose the best setup for each dataset (found by using validation data).As recommended by the original paper (Zhang et al., 2022a), the best attention temperatures are 2.0 and 1.5 for CNNDM and XSUM data, respectively.In addition, it shows the best performance when the temperature is 1.5 for SamSUM.
In DisCal phase, we train a student model for 5, 000 steps on CNNDM; 20, 000 steps on XSUM; and for 3,000 steps on SAMSum.The maximum value γ for re-scaling in Eq. ( 3), the balancing term λ for the calibration score in Eq. ( 4), and the weight for NNL loss in Eq. ( 7) are set to be 2.0, 0.2, and 0.01 respectively on CNNDM; 1.5, 0.2, and 1.0 on XSUM; and 1.5, 0.2, and 0.1 on SAMSum.The number of pseudo summaries n per document is set to be 6.

B.2 Evaluation Metric
We report two distinct metrics, ROUGE F1 (Lin, 2004) and novel n-gram scores (Liu and Lapata, 2019;Zhang et al., 2022a;Dreyer et al., 2023), which are used for assessing level of informativeness and abstractiveenss respectively.Specifically, we use three ROUGE F1 scores in our experiments: (1) ROUGE-1: refers to the word overlap of unigrams between the gold and generated summary; (2) ROUGE-2: refers to the word overlap of bigrams between them; and (3) ROUGEL-L: refers to longest common subsequence based statistics.
For abstractiveness, we compute the novel ngram score, which refers to the percentage of novel n-grams in the summary that do not appear in the input document.We use three different n-grams for evaluation: (1) novel 1-grams; (2) novel 3-grams; and (3) novel 5-grams.

C Additional Qualitative Analysis
w.o.Distil: Chelsea are top the Premier League since opening day.Chelsea have led or shared the lead since opening round of fixtures.The Blues have been a key figure in keeping the side in consistent form.Jose Mourinho says his side will remain focused.Manchester City slip up at Crystal Palace last week to end their title hopes.Predicted Summaries (ROUGE-1: 30.8 / Novel 5-Gram: 82.8) (ROUGE-1: 39.2 / Novel 5-Gram: 27.3) (ROUGE-1: 42.0 / Novel 5-Gram: 90.1) DisCal: Chelsea won the Premier League since the opening day.The Blues have been sevenhave points clear with eight matches remaining.Jose Mourinho has praised the consistency and confidence.Chelsea face QPR at Loftus Road.Mourinho says Manchester City lost 2-1 at Crystal Palace.

Table 1 :
Summary of datasets.Novel n-gram scores are computed on pairs of input documents and their gold summaries.A higher n-gram score indicates that the gold summaries in the test set are more abstractive.

Table 2 :
Comparison on CNNDM data for news summarization.We reproduced all the methods.The reproduced BART Large shows a better ROUGE-1 score than the original implementation performance of 44.16.

Table 3 :
Comparison on XSUM data for news summarization.We reproduced all the methods.The reproduced BART Large shows a better ROUGE-1 score than the original implementation performance of 45.14.

Table 4 :
Table4shows the results on the dialogue SAMSum dataset.DisCal maintains its performance domi-Comparison on SAMSum data for dialogue summarization.We reproduced all the methods.

Table 5 :
Number of parameters and latency (milliseconds per document) on V100 GPU with batch size 1.

Table 6 :
Ablation study by adding each loss component.

Table 7 :
Varying the terms λ which balances between abstractiveness and informativeness according to Eq. (4).

Table 8 :
Varying the number of pseudo summaries.

Table 9 :
Example of summaries generated from four different methods including gold summary on CNNDM using BART 12-3.Fragments that overlap the input document by five or more words are marked in red.The values in parenthesis under the method name are the informativeness and abstractiveness scores, (s info / s abs ).
Table 9 presents an example of generated summaries on the test data from CNNDM.The two

Table 12 :
Impact of using back-translation to existing distillation methods using BART 12-6 on CNNDM.
Table14provides additional examples of generated summaries using different distillation approaches for abstractive summarization on the CNNDM