Detection and Mitigation of the Negative Impact of Dataset Extractivity on Abstractive Summarization

In text summarization, extractivity is defined as a measurement of the degree of overlap be-tween a source document and its summary. Previous research has shown that the extractivity level of training data can influence both output extractivity and the amount of factual information (i.e. faithfulness) in outputs for abstractive summarization. However, it remains unclear if and how extractivity impacts the performance of abstractive models. In this work, we investigate the relationship between dataset extractiv-ity and model performance by comparing the performance of trained models under different degrees of extractivity. We find that while low levels of extractivity can improve performance, as extractivity increases, performance is negatively impacted. Furthermore, through an analysis of the model’s copy continuity of content, we discover that higher extractivity leads to a greater tendency for the model to copy text continuously from the source document rather than identifying and summarizing important content that should be covered in the target summary. To address these issues, we propose a simple and effective method to design copy labels for fixing the model’s copying behaviors and train the model with a copy mechanism. The experimental results illustrate the effectiveness of our strategy in alleviating the negative impact on model performance resulting from high dataset extractivity, and that our method outperforms several competitive baselines.


Introduction
Text summarization is the task of reducing the content of an original text while preserving its salient content (Gupta and Gupta, 2019).A multitude of techniques to generate summaries exists, which can be categorized into extractive and abstractive approaches.With extractive techniques, a summary is generated by extracting salient sentences from the text (Kasture et al., 2014).With abstractive techniques, new sentences or commonly called para-phrased sentences are generated so that the resulting summaries contain new words and expressions that do not occur in the original text (Widyassari et al., 2020).
In addition to the extractivity for summarization models, previous work has proposed the idea of extractivity for summarization datasets specifically in terms of the summarization styles in datasets (Grusky et al., 2018).More specifically, extractivity is a metric meant to capture the overlap between a text and its summary, and quantifies the extent to which a summary is a derivative of a text.
Summarization datasets typically contain aligned pairs of texts and human-generated or human verified summaries.As an intrinsic property of summarization datasets, extractivity has been shown to influence various characteristics of output summaries, such as output extractivity (See et al., 2017;Zhang et al., 2018a) and the amount of factual information (i.e.faithfulness) (Ladhak et al., 2022).Generally, datasets with high extractivity are treated as suboptimal for abstractive summarization (Bommasani and Cardie, 2020), but there has been a lack of systematic research investigating the relationship between the magnitude of extractivity in training datasets and the performance of abstractive summarization models.
To address the lack of understanding about the relationship between extractivity in summarization training data and the performance of abstractive summarization models, in this work, we first develop a framework for evaluating the change in model performance over baseline models (control) at three levels of amount of extractivity.Specifically, we split the training data into different subsets based on their increasing extractivity and then train the control models on each resulting subset.By comparing the results, we find that as the extractivity in training data increases, model performance initially improves with increasing extractivity but then drops when extractivity is too high.This naturally raises the next question, which we also address in this paper: Why and how does extractivity cause model performance to change?
Based on our observations, we posit that high extractivity in training data can lead to an increased likelihood of overfitting an abstractive model whereby the model continuously copies from source documents.To validate this hypothesis, we propose new metrics to measure how much content in output summaries is copied continuously from a source text, and how important this continuously copied content is by comparing it with target summaries.We apply these metrics to outputs from the control models defined above.The results support our hypothesis ahd provide a new direction for mitigating the negative effects of high extractivity by promoting the selection of truly important content that is covered in target summaries for copying.
Through our analysis, we have identified a primary harm associated with highly extractive training data, namely the propensity for models to excessively rely on continuous copying rather than focusing on the essential content that should be included in summaries.To alleviate this potential harm, we propose and test a simple and effective strategy based on a copy mechanism (See et al., 2017).Our solution first identifies the salient parts in a source document that the model should copy into the summary.We do this by solving an integer linear programming.We then create ground-truth labels based on the optimal solution for the copy distribution, with the goal of guiding the model to focus on the content it should copy rather than blindly copying continuous text spans from the source document.Lastly, we incorporate this copy mechanism into BART and train it with an auxiliary loss based on the copy distribution and the constructed copy labels.The experimental results demonstrate that our method not only is effective in mitigating the negative effects of high extractivity, but can also improve model performance as it outperformed several recent competitive baselines.
In general, we aim to explore and answer the following research questions in this work: • RQ1 How does dataset extractivity impact the performance of abstractive summarization models?(Section 2) • RQ2 How does dataset extractivity cause the performance of abstractive models to change?(Section 3) • RQ3 For cases when dataset extractivity hurts model performance, how can we mitigate this negative impact?(Section 4)

Extractivity Analysis
In this section, we introduce our research design, following a similar process as Ladhak et al. ( 2022), to answer our first research question: How does the extractivity in summarization datasets influence the performance of abstractive models?

Datasets and Extractivity
We selected two popular summarization datasets, one from the domain of news (i.e.CNN/DM) and one from academic publishing (i.e.arXiv/Pubmed).We opted for these two datasets as they are similar in size and dataset extractivity, and thus would warrant a fair comparison: CNN/DM (Hermann et al., 2015;Nallapati et al., 2016) is a dataset of news articles from CNN and Daily Mail.The summaries are formed from highlighted bullet points.There are 311,971 document -summary pairs in the dataset.arXiv/PubMed (Cohan et al., 2018) contains 346,187 papers from arXiv and PubMed.This dataset consists of the processed full text of papers as input documents, and the abstracts are used as summaries of the papers.
We used the common measurements of extractivity detailed in Grusky et al. (2018), and specifically chose extractive fragment coverage as Ladhak et al. (2022).Specifically, coverage calculates the percentage of words in a summary that are from the source document.The higher the coverage of the summary, the higher the overlap of the summary with the source article.

Models and Performance Metrics
We focused on two widely-used baselines: BART (Lewis et al., 2020)

Extractivity-Performance Trade-off
To examine the relationship between dataset extractivity and model performance, we first finetuned models on training data with different levels of extractivity.We then evaluated these models with the same test data to explore the extractivityperformance trade-off and further investigated how dataset extractivity influences model performance.Specifically, we split the training data into three equally-sized subsets by computing the coverage values for all data instances and sorting the training data based on these values.For each subsets, we fine-tuned a separate model with the corresponding level of extractivity, resulting in three fine-tuned models.We call each of these trained models a triplet model.In addition, we fine-tuned a model on the full training data for comparison.

Results and Analysis
The results are presented in Table 1.By comparing the outputs from triplet models, we learn that the extractivity of output summaries increases as the extractivity of training data increases.This aligns with previous empirical findings where limited abstractivity was shown in abstractive systems trained on highly extractive datasets (See et al., 2017;Zhang et al., 2018b).Overall, it is evident that dataset extractivity does indeed influence the extractivity of output summaries in trained abstractive models.It is therefore worthwhile to investigate the potential impact of dataset extractivity on model performance.
More importantly, we observed for both training datasets that as the extractivity in training data increases, a similar pattern in changes in performance metrics as per BART and PEGASUS occurs: model performance initially improves but subsequently drops when dataset extractivity goes too high.By too high, we mean that there may exist a certain threshold for extractivity, and once the extractivity surpasses this threshold, a subsequent decline in model performance may be observed.However, the magnitude of the performance gap between the metrics varies for different model-dataset pairs.This suggests that the impact of dataset extractivity on model performance depends on the degree of extractivity in training data.As different summarization datasets may have various degrees of extractivity (Bommasani and Cardie, 2020), a detailed analysis is required to understand how this impact may bring changes to model performance and identify potential solutions to mitigate its negative effects on model performance.We provide this analysis next.

Continuity Analysis
Now that we have reached the conclusion that high extractivity can negatively influence abstractive model performance, our our second question to answer is: How does dataset extractivity cause the performance of abstractive models to change?
After examining the model outputs, we formed a plausible hypothesis: extractivity could lead to an increased tendency or preference for an abstractive model to copy content continuously, potentially neglecting the inclusion of important content that should have been covered in a summary.To validate this hypothesis, it is necessary to develop new metrics to measure the continuity and salience of text spans that are continuously copied by the model.These metrics can be applied to the output summaries from all triplet models (as de-fined in Sec. 2) and the results can be compared to test the validity of the our hypothesis.

Characterizing Continuity
We examined the summary continuity at the sentence level using two metrics which we designed to evaluate the proportion of the text in a summary that is continuously copied from the source document and the importance of these copied texts by comparing them with the ground-truth summary.
Given a source document to be summarized define the set of continuous spans C(X, Ŷ ) as the text spans in Ŷ that are copied continuously from X.To obtain these continuous spans, we first identified whether a sentence in Ŷ is copied from X and then searched for any index subsets of copied sentences that are continuous.Specifically, for a given ŷi ∈ Ŷ we calculated its similarity to each x j ∈ X, and we marked the most similar x j corresponding to ŷi as x * i = argmax Then we set a threshold t, and if sim(ŷ i , x * i ) > t1 we treated ŷi as having been copied from x * i and recorded the index of x * i in X as we regarded ŷi as not having been copied from any sentence in X and set k i to an arbitrary negative number −100.After conducting this process for every ŷi ∈ Ŷ , we got a sequence of indices . Lastly, we looped through the sequence K, and obtained all subsequences such that any non-negative element k i in a subsequence satisfies k i = k i+1 or k i = k i+1 − 1.For example, K = {−100, 5, 5, 6, 1} contains a continuous span {5, 5, 6} and this means the 2-nd to the 4-th sentences in Ŷ are copied from the 5-th and the 6-th sentences in X continuously.Note that we skipped subsequences whose first element is equal to the last element, so K = {−100, 5, 5, 5, 1} does not contain any continuous span.We also showed this process in Algorithm 1 and computed two metrics using C(X, Ŷ ): continuity and continuity salience.

Continuity
We definecontinuity as the percentage of continuous spans in the output summary and the ratio is calculated based on sentence counts.
|c| Continuity Salience To quantify the salience of continuous spans, we propose continuity salience to calculate the normalized ROUGE scores2 between the target summary and the text spans that a continuous span copies from X.This serves as an approximation of whether the copied text spans are important to be included in the summary.We denoted the target summary as Y and used a mapping function g to return the text spans in X that a continuous span copies from.We used ∥.∥ to represent the word count of a sequence and AVG to represent the average function.

Evaluating Continuity
We utilized the above-defined continuity and continuity salience to investigate the potential relationship between increasing dataset extractivity and the tendency of models to copy text more continuously, as well as whether the changes in the salience of continuous spans may contribute to the changes in model performance.The results are shown in Table 2.As one might reasonably expect, for both datasets and models, the continuity (Cont) increases as the training extractivity (TE) increases.This basically suggests that as there are more overlaps between source documents and corresponding summaries, the model may learn to copy more directly from a source document such that the model ignores to include the truly important content from the source documents into a summary.This may explain why there are more continuous spans in the output summaries from the model trained on the dataset with higher extractivity, as evidenced by higher continuity scores.Hence, it can be inferred that when dataset extractivity increases, the model is more likely to copy, and once it starts to copy it tends to keep copying the following content.
However, in summarization done by humans, there is no inherent rule that important content appearing in a summary usually clusters together in a source document.As a result, a summary that contains too many continuous spans might inadvertently exclude important content that should have been included in the summary.This is supported by our experimental results, which show that as dataset extractivity increases, continuity salience first increases and soon decreases.This also aligns with the change in model performance shown in Table 1.Based on connecting to previous observations in Section 2, we conclude that the more overlaps there are between source documents and target summaries in training data, the more likely are trained model to copy content continuously, which results in low salience in continuous spans and hurts model performance when dataset extractivity is excessively high.

Mitigating the Negative Impact of High Extractivity on Model Performance
A key takeaway from previous sections in this paper is that high dataset extractivity can result in a tendency to excessively copy text from source documents while neglecting important content that should be included in a summary.To address this issue, a natural solution is to focus the model's attention on important content during training to encourage the model to cover the important content when copying during inference.To achieve this, we first identified a mapping between tokens in a source document X and extractive fragments in the target summary Y , where extractive fragments are the set of shared sequences of tokens in X and Y (Grusky et al., 2018).It should be noted that extractive fragments comprise all tokens the model can copy from the source document.Then, a copy mechanism was implemented (See et al., 2017) and we transformed the above mapping as the label for the copy distribution so that the model can attend to important texts when copying text from source documents.

Identifying Important Content to Copy
The process is illustrated in Figure 1.Essentially, this process was formulated as a simple optimization problem, with the goal of using the least number of sentences in X to cover all extractive fragments.We focused on extractive fragments since they are the tokens that a model can copy.The decision to select the least number of sentences in X was made because we usually require a summary to be concise, i.e., it would be preferable if fewer sentences were needed to cover extractive fragments.
We first converted the extractive fragments into Figure 1: An example from the CNN/DM dataset that shows how to set up the optimization for identifying important content to copy.We highlight different fragments with different colors and use squares with the corresponding colors on the left of the matrix A to represent the bipartite mapping between a fragment and sentences in the source document.For example, the first row in A means the fragment Sunderland appears in the 1-st and 4-th sentences in the source document.After solving the optimization problem, the optimal solution x * indicates that to cover all fragments we only need to select the last three sentences.
a bipartite mapping between fragments in the target summary Y and sentences in the source document X, and represented the mapping in a binary matrix A with each row corresponding to a fragment and each column corresponding to a sentence in X.We set A ij to 1 if the i-th fragment appears in the j-th sentence in X, otherwise A ij is set to 0. A fragment can appear in multiple sentences.Then we defined a binary vector x of size equal to the number of sentences in X where each element x i indicates whether the i-th sentence in X should be selected.The optimization problem is formalized as the following integer linear programming (ILP): In this ILP, 1 means all-one vector, and Z represents the set of integers.Our objective function basically counts the number of selected sentences in the source document X.The first constraint guarantees that all fragments can be covered and the second constraint assures that each sentence can be selected either once or never.We can prove that this optimization is always feasible.The proof is included in Appendix Section A.1.
Proposition 1 The optimization problem (1) is always feasible.
By solving this ILP, the optimal solution x * indicates which sentences in X should be selected and Figure 2: The same example from CNN/DM as used before, this time to show the gold copied token and silver copied tokens for the fragment Sunderland , as defined in Section 4.2.The highlighted Sunderland is its gold copied token and the tokens highlighted in silver are its silver copied tokens, including Sunderland .can be further regarded as important content that a model should copy.

Creating Copy Labels
Our next goal is to transform the optimal solution to the ILP into labels to correct the model's copy behaviors.In this context, a source document X is considered as a sequence of tokens {x w 1 , x w 2 , • • • , x w a } and the target summary Y is represented as a token sequence {y w 1 , y w 2 , • • • , y w b }.Our target copy labels are represented as a matrix M ∈ R b×a where M ij indicates the importance of the source token x w j for the model to copy it in order to generate the target token y w i .Given a target token y w i that appears in the ex-tractive fragment f , we first referred to the optimal solution x * and the bipartite mapping A to find the selected source sentence x sent f that covers the fragment f .We then defined two categories for source tokens {x w 1 , x w 2 , • • • , x w a } corresponding to the target token y w i , as shown in Figure 2. Gold Copied Tokens We define gold copied tokens as the source tokens that appear in the selected sentence x sent f and also present in the extractive fragment f .These gold copied tokens are the expected source tokens where the target token y w i should be coped from, based on the optimal solution x * .. We used the function GOLDCOPY(x w j , y w i ) to represent whether a source token x w j is a gold copied token to the target token y w i , and GOLDCOPY(x w j , y w i ) = 1 if it is otherwise 0.

Silver Copied Tokens
We define silver copied tokens as the source tokens that appear in the selected sentence x sent f .Similarly, the function SILVERCOPY(x w j , y w i ) is set to return 1 if x w j is a silver copied token to the target token y w i otherwise 0.
Additionally, we initialized M based on the ROUGE scores R between the two sentences where x w j and y w i belong, for all source and target tokens pairs.Basically R ij = ROUGE(x m , y n ) where x w j ∈ x m and y w i ∈ y n .Finally we obtained the copy labels as follows: where λ 1 , λ 2 and λ 3 are hyper-parameters.Please note that for target tokens that do not appear in any fragments, all source tokens are neither their gold nor silver copied tokens, as there is no way for the model to generate them via copying.As we intended to adopt M as the ground truth labels to guide the model to copy correct content, it is crucial that the model focuses the most on gold copied tokens when copying fragments, and then attend more to the silver copied tokens than the rest of tokens.Therefore, we set λ 1 > λ 2 > λ 3 .

Training with Copy Mechanism
It has been widely shown that a copy mechanism can be interpreted as modeling copy distributions.
Numerous previous studies have also demonstrated the effectiveness of the copy mechanism in abstractive summarization (See et al., 2017;Xu et al., 2020;Li et al., 2021).Due to limited space, we refer readers to See et al. (2017) for more details.
Inspired by their work, we utilized a copy mechanism and took the encoder-decoder attention based on the last encoder and decoder hidden layers as the copy distribution: where W e and W d are weight matrices for encoder and decoder hidden states respectively, h e j is the jth hidden state of the encoder, h d i is the i-th hidden state of the decoder and d k is the dimension of hidden states.Note that for the multi-head attention, we calculated the copy distributions as the average of multiple heads.Finally, in order to guide the model to copy correctly, we adopted an auxiliary loss function to encourage the consistency between the overall copy distribution α and the copy labels M based on the Kullback-Leibler (KL) divergence: where P (y w i ) is the likelihood of the target token y w i , ∥Y ∥ is the length of the target token sequence, and λ is the hyper-parameter.

Results and Analysis
We implemented our proposed strategy for mitigating the extractive-performance trade-off in BART and refer to that as BART + Copy Labels.We evaluated our method on CNN/DM and arXiv/PubMed.

Results on CNN/DM
Except for BART and PEGASUS, we compared our method to several competitive baselines that also use copy mechanism: • SAGCopy(Xu et al., 2020) fine-tunes MASS (Song et al., 2019) by incorporating the importance scores for source words into copying mechanism.
• PALM (Bi et al., 2020) incorporates the copy mechanism into the pre-training model.
• CoCoNet (Li et al., 2021)  copy the input word that is relevant to the previously copied one.
From Table 3 we can see that our method has superior performance compared to other baselines.We believe this is because our copy labels provide more effective supervision for the copy distribution.Additionally, we performed an extractivity analysis and the results are shown in Table 4.It can be observed that adding copy mechanism with the proposed copy labels does not result in significant changes in the output extractivity, but lowers the continuity in output summaries.Meanwhile, since our copy labels encourage the model to focus on important content when copying from source documents, the continuity salience of our method is higher than that of other baselines, which shows the effectiveness of our method in mitigating the negative impact of high dataset extractivity.A case study is shown in Appendix Section A.3.

Results on arXiv/PubMed
We conducted similar evaluations on arXiv/PubMed, and the results are shown in Table 5 and Table 6.These results align with the findings from the CNN/DM evaluations, in that our copy labels can improve the model's performance, as demonstrated by the ROUGE scores.Besides, the extractivity analysis further supports the conclusion that our proposed copy labels can guide the model to copy important content, such as the increase of continuity salience, which leads to an Extractivity in Summarization Bommasani and Cardie (2020) studied the quality of summarization datasets and found that a high degree of extractivity is present in many datasets.As existing datasets display significant amounts of extractivity, it is necessary to investigate how a model trained on such data may be influenced by extractivity.Similar to our approach, Ladhak et al. ( 2022) examined the effect of extractivity on the faithfulness of models.They showed that an increase in extractivity improves the faithfulness of the model but also that a trade-off exists between abstractivity and faithfulness.Our work focuses on the relationship between dataset extractivity and model performance.
Zhang et al. (2018a) explored the abstractivity of summarization models and found that abstractive models exhibit near-extractive behaviors in practice.However, their analysis is constrained to CNN/DM dataset, which we also used, and RNNbased models, whereas our approach extends to arXiv/PubMed and transformer-based pre-trained models.Kryściński et al. (2018) proposed methods to increase abstractivity in the output.Unlike their work, our work introduces and evaluates a method to mitigate the negative impacts of high dataset extractivity on model performance.

Conclusion
In this work, we have explored how the amount of extractivity in summarization training datasets influences the performance of abstractive models.By comparing model performance under different levels of dataset extractivity, we showed that low levels of extractivity can improve model performance while the impact becomes negative as extractivity is high.Furthermore, with the analysis of the model's copy continuity of content, we found that high dataset extractivity encourages the model to copy text continuously, which can cause models to ignore important content.In order to mitigate these negative effects, we presented a novel, simple, and effective strategy that creates labels for fixing the model's copying behaviors.By training the model with a copy mechanism and our copy labels, our experimental results show that our method can effectively alleviate the harm resulting from dataset extractivity and outperforms several competitive baselines.

Limitations
Our work has the following limitations.First, our analysis of continuity and continuity salience only focused on the sentence level.This is limiting since actual continuous spans can be a part of tokens in a identified copied sentence.Besides, we only utilized string-based overlap for salience estimation, i.e.ROUGE.This can be limiting since semantic salience may not be captured.Furthermore, even if our method can alleviate the negative impact of high dataset extractivity, it may not fully address this issue.In the future, we plan to extend our analysis to token-based continuous spans identification and semantic based measurement for more accurate continuity quantification.

Ethics Statement
The progress in deep neural network architectures and the availability of large pre-trained language models have led to significant advancements with single document summarization.However, current state-of-the-art natural language processing (NLP) solutions still face challenges in consistently generating factual and faithful summaries without any instances of hallucination (Maynez et al., 2020).Therefore, it is imperative to acknowledge that our proposed solution, like previous approaches, is not yet suitable for deployment as it does not specifically address the issue of hallucination.To bridge this gap, future research efforts should prioritize the development of more effective evaluation measures and solutions for text summarization, aiming to ensure highly faithful summaries that accurately represent the source content and enhance the overall trustworthiness of summarization systems.Additionally, in the case of applying the proposed method to sensitive data domains such as medical patient records and legal documents, it becomes essential to incorporate privacy-preserving policies to safeguard the confidentiality of personal information (Da Silva et al., 2006).These measures are critical to instill confidence in the practical implementation of text summarization techniques.

A Appendices
A.1 Proof of Proposition 1 According to the definition of extractive fragments (Grusky et al., 2018), a fragment must appear in one of the sentences in a source document, or there is no overlap between the source document and the target summary for the fragment.Therefore, for each row in the matrix A, it must contain at least one 1 so that the row corresponds to a valid fragment.
Then by setting x to the all-one vector 1, A • 1 is equivalent to count how many 1 in each row of A. Since A must contain at least one 1, setting x to the all-one vector 1 satisfies all constraints and 1 is always a feasible solution to our ILP.Therefore, the optimization problem 1 is always feasible.

A.2 Implementation Details
Models are implemented by Pytorch framework (Paszke et al., 2019) and Huggingface transformers (Wolf et al., 2020).We initialized BART with "facebook/bart-large-cnn" for CNN/DM, and with "facebook/bart-large" for arXiv/PubMed.As for PEGASUS, we chose "google/pegasus-large" for both datasets.We trained the models using Adam optimization with a learning rate of 5e − 5, and weight decay of 0.01.The learning rate is updated using a polynomial decay schedule and warm-up steps is set to 500.We trained models with 10 epochs and used early stopping with patience 5.All experiments were performed on NVIDIA GeForce RTX 3090, and it took about 4 days for 1 epoch on both datasets.The hyperparameters in Equation 2 are set as λ 1 = 5, λ 2 = 2, λ 3 = 1.λ in Equation 5 is set to 0.4 in a range [0.3, 0.7] based on performance on validation sets.
During inference, we used beam search with a beam size 4, length penalty 2 and non-repeat n-gram size is set to 3. For evaluation, we used ROUGE python package3 .

A.3 Case Study
We presented generated examples from our model and BART in Table 7.We highlighted the texts that BART copied from the source document in yellow , and we bolded the text that our model copied.By comparing to the target summary, we can observe that BART copied longer continuous spans and the continuous spans contain content that should not be covered in the target summary, such as "MacLaren was announced as director of the movie in November".By contrast, in the output from out model, it does not cover those texts that shouldn't appear in the summary.However, both BART and our model still fail to generate the important content about "MacLaren left the project over "creative differences"".This indicates that there is still room for improvement in our model.

Table 2 :
Empirical results of continuity and continuity salience under different level extractivity.DE denotes dataset extractivity.T1, T2, and T3 represent the triplet subsets we split.

Table 3 :
enhances the copying mechanism by encouraging the model to Experimental results on CNN/DM.† indicates the results are from our implementation, and ‡ means the results are taken from the corresponding papers.

Table 5 :
Experimental results on arXiv/PubMed.† indicates the results are from our implementation.
Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush.2020.Transformers: State-of-the-art natural language processing.In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP).Jin-ge Yao, and Rui Yan.2018b.On the abstractiveness of neural document summarization.In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium.Association for Computational Linguistics.Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020.Pegasus: Pre-training with extracted gap-sentences for abstractive summarization.In International Conference on Machine Learning, pages 11328-11339.PMLR.Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi.2019.Bertscore: Evaluating text generation with bert.In International Conference on Learning Representations.