Learn to Copy from the Copying History: Correlational Copy Network for Abstractive Summarization

The copying mechanism has had considerable success in abstractive summarization, facilitating models to directly copy words from the input text to the output summary. Existing works mostly employ encoder-decoder attention, which applies copying at each time step independently of the former ones. However, this may sometimes lead to incomplete copying. In this paper, we propose a novel copying scheme named Correlational Copying Network (CoCoNet) that enhances the standard copying mechanism by keeping track of the copying history. It thereby takes advantage of prior copying distributions and, at each time step, explicitly encourages the model to copy the input word that is relevant to the previously copied one. In addition, we strengthen CoCoNet through pre-training with suitable corpora that simulate the copying behaviors. Experimental results show that CoCoNet can copy more accurately and achieves new state-of-the-art performances on summarization benchmarks, including CNN/DailyMail for news summarization and SAMSum for dialogue summarization. The code and checkpoint will be publicly available.


Introduction
Text summarization techniques (Rush et al., 2015;Chopra et al., 2016;Zhou et al., 2017;Li et al., 2019Yuan et al., 2020) aim to generate a condensed and cohesive version of the input text, enabling readers to grasp the main points without reading the full text. There are two types of summarizers: extractive and abstractive. Extractive methods produce a summary by taking important sentences from the original text and combining these extracts, while abstractive methods involve interpreting and paraphrasing the input when generating a summary. The latter is more similar to Dialogue Ernest: hey Mike , did you park your car on our street? Mike: no, took it into garage today Ernest: ok good Mike: why? Ernest: someone just crashed into a red Honda looking just like yours Mike: lol lucky me Summary Mike took his car into garage today. Ernest is relieved as someone had just crashed into a red Honda which looks like Mike's. Table 1: An example from the dialogue summarization task. Highlighted words are copied consecutively from the input. Previously copied words (such as "just crashed") can guide the following copying operations (such as"into a red Honda").
how humans would summarize a text, but it is far more challenging to achieve.
Currently, the sequence-to-sequence (Seq2Seq) framework has become the mainstream for performing abstractive summarization tasks. However, it suffers from handling out-of-vocabulary words (OOV). As it has been observed that some words in the input text reappear in the summary, one way of coping with the OOV issue is by extracting words from the input text and incorporating them into the abstractive summary. Following this strategy, existing works Gu et al., 2016;See et al., 2017) propose the copying mechanism, which copies words from the input sequence to form part of the summary. These models generally regard the encoder-decoder attention as the copying distribution, which we call "attentional copying". They perform copying at each time step independently of the former ones, neglecting the guidance of the copying history. Our work demonstrates that the copying history can provide crucial clues of the copying behaviors for the following time steps and thereby encourage the summarizer to copy more accurately. For example, in Table 1, assuming the source words "a red" have been copied, the next copying operation for the following word "Honda" can be explicitly induced.
In this paper, we propose a novel copying architecture named Correlational Copying Network (CoCoNet) that can learn to copy from the copying history. We build CoCoNet based on the Transformer-based Seq2Seq architecture (Vaswani et al., 2017) , which has shown superiority in various text generation tasks, such as machine translation and text summarization. More specifically, CoCoNet copies from the input text at each time step by selecting what is relevant to the previously copied word. It keeps track of the prior copying distribution and explicitly models the correlation between different source words by integrating semantic and positional correlations. We obtain the semantic correlations based on the encoder selfattention matrix as Xu et al. (2020b). Inspired by , we represent positional correlations as a Gaussian bias, which considers the relative distances between source words and the scope of the local context when copying. The framework of our model is shown in Figure 1.
Furthermore, we enhance CoCoNet through pretraining with a self-supervised objective of text span generation with copying on the raw text corpora. Motivated by the work of , which has proven that pre-training resembling the downstream task leads to better and faster fine-tuning performances, we make sure our pretraining simulates the copying behaviors desired for the downstream summarization tasks. We divide each sequence in the corpora into two spans with some overlapping words, and the first span is used to generate the second in pre-training. We measure the overlap between the two spans based on ROUGE scores (Lin, 2004) to ensure that there are enough words to be generated by copying.
Our main contributions are as follows: • We propose a Correlational Copying Network (CoCoNet) for abstractive summarization. It tracks the copying history and copies the next word from the input based on its relevance with the previously copied one.
• We further enhance CoCoNet's learning of copying through self-supervised pre-training on text span generation with copying.
• CoCoNet achieves new state-of-the-art performances on news summarization and dialogue summarization tasks, and experimental results show that CoCoNet can copy more accurately.
2 Related work

Copying Mechanism
The copying mechanism is widely used in abstractive summarization. It allows models to directly copy words from the input to the output.  present the pointer network that uses attention distribution to select tokens in the input sequence as the output. Luong et al. (2015) propose to copy source words to the target sentence by a fixed-size softmax layer over a relative copying range.  leverage the attention mechanism to predict the location of the word to copy and apply a copying gate to determine whether to copy or not. Gu et al. (2016) propose to predict output words by combining copying and generating modes through a shared softmax function. See et al. (2017) introduce a copying probability to incorporate copying and generating distributions dynamically. Bi et al. (2020) adopt the copy mechanism in the language model pre-training. Existing works do not attempt to calculate the copying distributions based on the copying history, which is our focus.

Temporal Attention Mechanism
Our proposed copying mechanism is partially inspired by the temporal attention mechanism (Sankaran et al., 2016) that keeps track of previous attention scores and adjusts the future attention distribution by normalization with historical attention scores. This model has been proven effective in the text summarization task . Similar ideas are also adopted by the coverage mechanism for image caption (Xu et al., 2015), machine translation (Tu et al., 2016), and text summarization (See et al., 2017), maintaining a coverage vector to record the attention history to compute future attention distributions. Temporal attention mechanism is designed to avoid repetitive or insufficient attentions. While our work aims to learn a better copying mechanism from the copying history.

Overview
The input of the text summarization task is a longer text, x = (x 1 , x 2 , ..., x S ) of S tokens, and the output is a condensed summary, y = (y 1 , y 2 , ..., y T )  of T tokens. The hypothesis of our proposed CoCoNet is that the standard attentional copying mechanism can be enhanced by the copying history. For example, a source word that is relevant to the previously copied one is more likely to be copied at the current time step. We further pre-train CoCoNet with the objective of text span generation with copying, which aims to strengthen the learning of the copying mechanism.

Transformer-based Seq2Seq Model
We adopt Transformer-based Seq2Seq architecture (Vaswani et al., 2017). The encoder of Transformer is a stack of N identical blocks, and each of them consists of two sublayers: a self-attention layer and a feed-forward layer. The encoder reads and converts the input sequence into the encoder's hidden states, h enc , as follows: The decoder has similar structures as the encoder, stacking M identical blocks consisting of a selfattention attention layer, an encoder-decoder attention layer, and a feed-forward layer. The decoder's hidden states, h enc , are generated given the encoder's hidden states and the previously generated words, and then we get the generation distribution based on h enc : The maximum likelihood (ML) training objective aims to minimize the negative log-likelihood of the parameters as follows:

Attentional Copying Mechanism
The copying mechanism facilitates the model in predicting output words by integrating copying and generating distributions as follows: where λ t denotes the copying probability, and P attCopy t (w) denotes the exiting copying distribution that is generally represented as the decoderencoder attention by existing works as follows: where d k denotes the number of columns of the query matrix Q t . Note that for the multi-head attention, we can obtain the copy distributions with the average of multiple heads.

Correlational Copying Mechanism
We propose a correlational copying mechanism that takes advantage of prior copying distributions and, at each time step, explicitly encourages the model to copy the input word that is relevant to the previously copied one. Our hypothesis comes from the observation that a cohesive summary typically has a reasonable language modeling for copying, especially for some important contents. For example, a source word that is relevant to the previously copied one is more likely to be copied at the current time step. As illustrated in Table 1, previously copied words "just crashed" are indicative for the following copied words "into a red Honda". Therefore, we propose to explicitly learn the language modeling for copying. We maintain a correlational copying distribution transferred from the last copying distribution based on the correlation between different source words: where P coCopy t (w) denotes the correlational copying distribution, and P f inalCopy t is the final copying distribution to predict output words, served as P attCopy t in Equation 5. rel t (x j , x i ) denotes the correlation score between source word x j and x i , integrating semantic correlation s j,i and positional correlation p j,i , which we will introduce later. The above process can be regarded as one step of transition in the Markov chain, where the correlation matrix is analogous to the transition matrix. Note that there is no self-transferring for the correlational copy distribution, and thus, the word already obtaining a high copy score will not be copied repetitively.
Then, the correlational copying distribution is used to adjust the current copying distribution, which informs the model of the previously copied one when determining which word to copy now.
P coCopy t is initialized as a zero vector. In the next time step, P f inalCopy t in Equation 14 serves as P f inalCopy t−1 in Equation 11. In this way, the copying history is maintained recurrently. Xu et al. (2020b) propose to obtain the centrality score for each source word based on the last encoder self-attention layer. Following this work, we represent the semantic correlation between source words by the encoder self-attention weight:

Positional Correlation
Inspired by , we represent the positional correlation as a Gaussian bias, which considers the relative distances between different source words and range of local context suitable for copying: where pst j and pst i denote the positions for source word x j and x i , respectively. δ j denotes the standard deviation that conditions on the length of the source sequence, i.e., |x|. Different from Yang et al. (2018), we do not apply the predicted central position, because we argue that the information of relative position is strongly associated with the word correlations. In addition, following Shaw et al. (2018), we perform a relative distance clipping to improve the generalization of our model.

Correlational Copying Pre-training (CoCoPretrain)
Pre-training with self-supervised objectives on raw text corpora has demonstrated the effectiveness of a broad range of text generation tasks (Song et al., 2019;Dong et al., 2019;Lewis et al., 2020;. In this paper, we enhance CoCoNet through correlational copying pre-training (CoCo-Pretrain) on text span generation. The process of  Figure 2: The process of constructing the pre-training data. Given a piece of text, we divide it into an input span and an output span, and we calculate the overlap score of them by Equation 22. The top-K scored span pairs are selected.
constructing the pre-training data suitable for correlational copying is as follows, and an example is shown in Figure 2. We first divide each sequence in the raw corpora into two continuous spans, and the first longer span is used to generate the second in pre-training. We elaborately select the input text span followed by the output span by maximizing the overlap between the input and output. In this way, our CoCoPretrain objective can be also called overlapped text span generation.
As a measure for overlap, we adopt ROUGE F1 score (Lin, 2004) between the input and output text span. When calculating the ROUGE score, we consider ROUGE-1, ROUGE-2, ROUGE-L, and combinations of them such as: Specifically, for fair comparison, we use the same pre-training data as BART (Lewis et al., 2020) as our source corpus for CoCoPretrain. We set the length of the input text span and output span to 128 and 32, respectively, After ranking with the ROUGE score, we select the top 20M samples as our final pre-training data.
We believe this data selection strategy towards pre-training can make sure that there are enough output words that can be generated by copying from the input, which resembles the downstream task and learns our proposed correlational copying mechanism better.

Dataset
For downstream applications, we conduct experiments on the news summarization task with CNN/DailyMail dataset and on the dialogue summarization task with SAMSum dataset.
CNN/DailyMail dataset  contains 312K news articles paired with multi-sentence summaries. We use the non-anonymized version used in See et al. (2017), which has 287,226 training samples, 13,368 validation samples and 11,490 test samples.
SAMSum dataset (Gliwa et al., 2019) contains 16K chat dialogues with manually annotated summaries, splited into 14,732 training samples, 818 validation samples, and 819 test samples. We use the version of the dataset with artificial separator (Gliwa et al., 2019), in which utterances are separated with "|".

Experimental Settings
For simplicity, we warm-start the model parameters with the publicly released pre-trained BART (large) model 1 with 12 layers in both the encoder and decoder, and the hidden size is 1024. The learning rate is set to 3e-5, and learning decay is applied. We use Adam optimizer with β 1 = 0.9, β 1 = 0.999, and = 10 −8 . We use the dropout with a probability of 0.1 and the gradient clipping of 0.1. The hyper-parameters are set to the values used in BART. We use a clipping distance of 16 when computing positional correlation, Our experiments are conducted with 8 NVIDIA A100 GPUs. We continually pre-train our model with CoCoPretrain, which converges within 1M steps using a batch size of 8000. During decoding, we use beam search with a beam size of 4.

Experimental Results
We evaluate our model with the official ROUGE toolkit (Lin, 2004). We report the F1 score of ROUGE-1, ROUGE-2, and ROUGE-L. Table 2 and Table 3 show the results on CNN/DailyMail and SAMSum dataset, respectively.

Results on CNN/DailyMail
The first block in Table 2 displays the results of models without pre-training.  • Lead-3 baseline that simply selects the first three sentences in the input document.
• PGNet (See et al., 2017) is a hybrid pointergenerator model applying an attentional copy mechanism.
• DRM (Paulus et al., 2018) is a deep reinforced model with an intra-attention mechanism.
• Bottom-Up (Gehrmann et al., 2018) introduces a content selector that identifies which phrases in the document should be included in the summary. The copying is then constrained to the selected phrases.
• DCA (Celikyilmaz et al., 2018) is a reinforcement learning model with deep communicating agents, each of which encodes a subsection of the input text.
The second block are the results of models with pre-training.
• MASS (Song et al., 2019) pre-trains the Seq2Seq language model (LM) to predict a span of masked tokens.
• BERTSUMEXTABS (Liu and Lapata, 2019) applies BERT in text summarization. It is a two-stage fine-tuned model that first finetunes the encoder on the extractive summarization task and then on the abstractive summarization task.
• SAGCopy (Xu et al., 2020b) fine-tunes MASS by incorporating the importance score for source words into the copying module.
• PEGASUS  adopts gapsentence generation as the pre-training objective.
• ProphetNet (Qi et al., 2020) proposes to simultaneously predict the future n-gram at each time step for pre-training.
• PALM (Bi et al., 2020) incorporates the copy mechanism into the pre-training model.
First, we can find that the models with pretraining outperform most of the models without pre-training, which shows the effectiveness of pretraining. Second, fine-tuning the BART model with attentional copying (i.e., BART + AttnCopy) improve the results over the original BART model we implemented (+ 0.14%/0.10%/0.13% for ROUGE-1/ROUGE-2/ROUGE-L). To evaluate the selfattention guided copy model (SAGCopy) (Xu et al., 2020b), we apply the SAGCopy mechanism to the BART model, obtaining superior results over BART (+ 0.19%/0.14%/0.15% for ROUGE-1/ROUGE-2/ROUGE-L). By comparison, the improvement for our proposed CoCoNet model is larger (+ 0.27%/0.20%/0.20% for ROUGE-1/ROUGE-2/ROUGE-L), which proves the necessity of the copying mechanism and superiority of the correlational copying over the attentional copying (paired t-test, p-value<0.05). Third, continue pre-training the CoCoNet model (i.e., CoCoNet + CoCoPretrain) leads to the best performance (+ 0.38%/0.34%/0.39% for ROUGE-1/ROUGE-2/ROUGE-L over the BART model). When we continue pre-training BART with the same pretraining data but without copying mechanism (i.e., BART + Cont. Pre-train), the result outperforms BART with a small margin, indicating that general pre-training with selected data is not effective, and correlational copying is essential for pretraining. Fourth, we study the effectiveness of semantic and positional correlation between source words (i.e., SemCorrelation and PosCorrelation, respectively), we can observe that semantic and positional correlation are both useful, and depriving positional correlation decreases the performance larger.

Results on SAMSum
The results on the SAMSum dataset are shown in Table 3.
• Longest-3 takes three longest utterances as the summary.
• Fast Abs RL (Chen and Bansal, 2018) is a hybrid extractive-abstractive model with the policy-based reinforcement learning.
• DynamicConv (Wu et al., 2018) is a dynamic convolution model based on lightweight convolutions.
• D-HGN (Feng et al., 2020) is a dialogue heterogeneous graph network modeling the utterance and commonsense knowledge.
• TGDGA ) is a topic-word guided dialogue method based on the graph attention model.
First, we can find that the models with pretraining outperform the models without pretraining to a significant extent, possibly due to the small size of the dataset. Second, similar to the results on the CNN/DailyMail dataset, the CoCoNet has better performances than attentional copying and self-attention guided copying. Third, continue pre-training the CoCoNet model (i.e., CoCoNet + CoCoPretrain) achieves the best performance (+ 1.15%/1.41%/1.45% for ROUGE-1/ROUGE-2/ROUGE-L over the BART model). We can find  that the improvement is larger than that on the CNN/DailyMail dataset. Looking into the datasets, we observe that the copying phenomenon is more common in the SAMSum dataset, with 14.4% of the source words reappearing in the target summary, as opposed to 10.7% in the CNN/DailyMail dataset. Thus, our proposed CoCoNet can work more remarkably on the SAMSum dataset.

Human Evaluation
Since the readability (how easy it is to understand) and informativeness (how much important information is captured) are difficult to measure automatically, three expert annotators are involved to conduct manual evaluation. They rate the readability and the informativeness of 100 instances sampled from the test set on a scale of 1 to 5 (with 5 being the best). Results in Table 4 show that CoCoNet outperforms PGNet and BART models. For informativeness, CoCoNet receives comparative results as BART, but it shows a significant increase in readability comparing to BART, suggesting that correlational copying mechanism is crucial to reducing reading difficulty.

Effect of Pre-Training Data Selection
We compare various strategies to select pre-training data according to Equation 22 with different values of λ 1 , λ 2 , and λ 3 . The results are shown in Figure 3. Note that the y-axes are normalized by the result of strategy only using ROUGE-1. First, we find that strategies based on ROUGE   are significantly better than Random. Second, among single ROUGE measurements, ROUGE-1 and ROUGE-2 are slightly better than ROUGE-L. Third, combining ROUGE-1 and ROUGE-2 with"λ 1 =1 and λ 2 =2" achieves the best performance. We can conclude that fitting strategies for pre-training data selection will benefit downstream summarization tasks, and we adopt "ROUGE-1 + 2 * ROUGE-2" in our work.

Can Our Model Copy More Accurately?
We have demonstrated that CoCoNet improves the summarization model qualitatively and quantitatively. But has our model learned to copy more accurately (especially for the consecutive copying)? Figure 4 shows that the summaries generated by our CoCoNet+CoCoPretrain model contain a higher rate of "correct" n-grams (i.e., those that appear both in the input text and reference summary), indicating that learning to copy from the copying history is beneficial to consecutive copies. On the other hand, we investigate whether our model triggers the over-copying problem (when source words are unnecessarily copied). We find that the average numbers of over-copied words for BART and CoCoNet + CoCoPretrain are 35.29 and 33.19 on CNN/DailyMail, 8.21 and 7.84 on SAMSum, showing that our model can alleviate   over-copying. Table 5 illustrates an example from the SAMSum dataset. BART generates a summary that is contradictory to the dialogue, saying "Mike's car is crashed". In fact, the crashed car just looks like Mike's. By contrast, CoCoNet successfully captures the correlation between "crashed into" and "a red Honda looking like". As a result, CoCoNet copies the correct information (highlighted) from the source text through correlational copying and expresses exactly the same idea as the reference.

Conclusion
We propose CoCoNet that can take advantage of prior copying distributions and encourage the decoder to copy the source word that is relevant to the previously copied one. We further enhance the copying ability through pre-training with the objective of text span generation. Our model gains new state-of-the-art results on the news summarization and dialogue summarization tasks.