Highlight-Transformer: Leveraging Key Phrase Aware Attention to Improve Abstractive Multi-Document Summarization

Abstractive multi-document summarization aims to generate a comprehensive summary covering salient content from multiple input documents. Compared with previous RNN-based models, the Transformer-based models employ the self-attention mechanism to capture the dependencies in input documents and can generate better summaries. Existing works have not considered key phrases in determining attention weights of self-attention. Consequently, some of the tokens within key phrases only receive small attention weights. It can affect completely encoding key phrases that convey the salient ideas of input documents. In this paper, we introduce the Highlight-Transformer, a model with the highlighting mechanism in the encoder to assign greater attention weights for the tokens within key phrases. We propose two structures of high-lighting attention for each head and the multi-head highlighting attention. The experimental results on the Multi-News dataset show that our proposed model signiﬁcantly outperforms the competitive baseline models.


Introduction
Abstractive Multi-Document Summarization (MDS) offers the challenge of generating a comprehensive summary of multiple related documents. It requires summarization models to capture the salient content from input documents. Compared with the previous RNN-based models for abstractive MDS, the Transformer-based models (Gehrmann et al., 2018;Liu and Lapata, 2019a;Li et al., 2020b) employ the self-attention mechanism to capture the dependencies in input documents, and they can generate better summaries.
Calculating attention weights is a crucial step in the self-attention mechanism. Input documents usually contain some key phrases that convey the Highlighting Matrix

Self-Attention Weight Matrix
Highlighting Attention Weight Matrix Figure 1: The highlighting mechanism assigns greater attention weights for tokens within key phrases indicated by the highlighting matrix. salient ideas of input documents. However, existing works have not considered key phrases in determining attention weights of self-attention. Key phrases usually comprise multiple tokens, which should be highly related and serve as a complete grammatical unit in input documents. When testing Transformer-based models, we observe some of the tokens within key phrases only receive small attention weights, which can affect completely encoding key phrases and the salient ideas they convey.
In this paper, we propose the Highlight-Transformer, an abstractive summarization model with the highlighting mechanism in the encoder. As depicted in Figure 1, the highlighting mechanism assigns greater attention weights for tokens within key phrases. Furthermore, the highlighting mechanism mainly comprises three parts: the highlighting matrix, the highlighting attention, and the multi-head highlighting attention.
Our work is inspired by previous studies in education and psychology that indicate key phrases are important for people to understand (Rello et al., 2014;Hargreaves and Crabb, 2016) and summarize (Benzer et al., 2016;Chou, 2012) the given documents. Highlighting key phrases can help people with dyslexia improve comprehension (Rello et al., 2014;Hargreaves and Crabb, 2016). Their findings can be instructive to improve the self-attention mechanism.
We build a highlighting matrix for each input token sequence to indicate key phrases' positions in the attention weight matrix and phrases' importance values. We propose two structures of highlighting attention for each head to adjust attention weights according to the phrase importance. After comparing the effects of adopting the highlighting attention in the different numbers of heads and layers, we discover that adopting it in a subset of heads surpass adopting it in all heads. Experimental results on the Multi-News dataset (Fabbri et al., 2019) exhibit that our proposed model significantly improves the ROUGE scores (Lin, 2004) of generated summaries.
Our contribution is threefold: • We present the highlighting mechanism that assigns greater attention weights for the tokens within key phrases.
• We propose the multi-head highlighting attention and two structures of highlighting attention for each head to combine attention weights with the phrase importance.
• Our proposed model significantly outperforms the competitive baseline models on the Multi-News dataset.

Related Work
Previous encoder-decoder models (Rush et al., 2015;Nallapati et al., 2016;Paulus et al., 2018;Chopra et al., 2016) equipped with the attention mechanism (Bahdanau et al., 2015) have achieved great performance on abstractive summarization. However, they were found to miss some important content in input documents (Li et al., 2018;Xu et al., 2020). How to retain the key information of input documents in the generated summaries has received increasing attention in the past few years. Some previous works focus on improving the copy mechanism. Gehrmann et al. (2018) utilize the attention masks to restrict copying phrases from the selected parts of an input document. Xu et al. (2020) explicitly guide the copy process with the centrality of each source word. Several papers also explore the potential of enhancing the encoder. Li et al. (2018Li et al. ( , 2020a extend the pointer-generatorbased models (See et al., 2017) with a separate LSTM-based encoder to get the keywords' representation and then combine it with the sentence representation. In this work, we explore the potential of leveraging phrase importance as guidance to adjust attention weights in the multi-head selfattention of the Transformer encoder.

Model
In this section, we present the Highlight-Transformer, a model with the highlighting mechanism. We introduce its three main components: the highlighting matrix, the highlighting attention for each head, and the multi-head highlighting attention. We focus on the encoder part, and our decoder follows the CopyTransformer model used in (Gehrmann et al., 2018;Fabbri et al., 2019). Each input example of our proposed model includes the source articles, the articles' key phrases, and the phrases' importance values. The automatic key phrases extraction method we used will be introduced in section 4.1.

Highlighting Matrix
The first step of the highlighting mechanism is to build a highlighting matrix for each input example. It can indicate key phrases' positions in the attention weight matrix and the phrases' importance values. The concatenated source articles can be represented as an input sequence (t 1 , ..., t n ) containing n tokens. We use (p 1 , ..., p k ) and (v 1 , ..., v k ) to denote key phrases and their importance values. For each input example, We build the highlighting matrix H ∈ R n×n with the same shape as the self-attention weight matrix. Assuming a phrase p r contains b tokens in the input sequence p r = (x a , ..., x a+b ), the phrase's importance value v r is added to the elements H i,j , where i = a, . . . , a + b, j = a, . . . , a + b, in the highlighting matrix. The phrases can be overlapping or nested, and the token t i may be contained in c phrases (p r , ..., p r+c ), whose importance values are (v r , ..., v r+c ). The element H ii is assigned as the sum of the c phrases' importance values.

Highlighting Attention
The highlighting attention is the key component in our proposed model for adjusting attention weights according to the phrase importance. For the head m, the Transformer model (Vaswani et al., 2017) adopts the scaled dot-product attention that operates on a query Q, a key K, and a value V: where W m ∈ R n×n , and d k is the dimensionality of key. In the encoder layers, queries, keys, and values come from the output of the previous layer.
The highlighting matrix H can be used to determine which elements in the attention weight matrix should be increased. We propose two structures of highlighting attention, namely the weighted highlighting attention and the additive highlighting attention, to adjust attention weights according to the phrase importance.
The weighted highlighting attention mainly modifies Equation (1b) to calculate the attention weight matrix W m for the head m. The highlighting matrix H is multiplied by a scalar α, named the brightness factor. The product will be added to the input of the softmax function.
Since the softmax function applies the exponential function to each input element and divides them by the sum of all these exponentials, the above additive operation can be identical to calculating the weighted average.
The additive highlighting attention is also designed to adjust the attention weight matrix W m . In Equation (4a), the product of the highlighting matrix H and the scalar α is normalized by the softmax function 1 and added to the original attention weight matrix W m a calculated by Equation (1b). And then, elements in W m b will be normalized to ensure the sum of the attention weights equals one along the dimension where the softmax conducts.

Multi-Head Highlighting Attention
In our proposed model, the encoder with d model consists of N layers and h heads. Each encoder layer has two sub-layers: the multi-head highlighting attention layer and the position-wise fully connected feed-forward network. We proposed the multi-head highlighting attention mechanism, which employs the highlighting attention on p highlighted heads and the scaled dot-product attention on the rest of (h − p) normal heads.
where the projection is a parameter matrix W o ∈ R hdv×d model . The matrix Head i is calculated by Equation (1a). The attention weight matrix W of the highlighted heads can be calculated by Equation (2) or (4), and that of the normal heads can be calculated by Equation (1b). The results on all heads will be concatenated and then projected through a feed-forward layer.

Data Preparation
We train and evaluate our model on a MDS dataset named Multi-News (Fabbri et al., 2019), in which each example includes multiple news articles about the same event and a human-written summary collected from the website newser.com. Following the setting in (Fabbri et al., 2019), we truncate each input article to 500/S tokens for the example with S news articles and concatenate the truncated articles into a single document. For each example, we first filter out stopwords and select candidate phrases from these truncated source articles. And then, we use the library named scikitlearn to calculate the candidate phrases' tf-idf values (Salton and Buckley, 1988) as their importance values. These candidate phrases are sorted in descending order of their importance values. We only select the top-10 bigrams or trigrams as key phrases in each example since we observe longer phrases are sparse and more likely to be compressed in summary. Each input example of our proposed model includes the source articles, key phrases together with their L2 normalized tf-idf values.

Experimental Setting
We adopt a 4-layer encoder and a 4-layer decoder to build our proposed model, in which each layer has eight attention heads. Both the word embedding size and hidden size are set as 512. The maximum size of the vocabulary is set as 50000. Besides, we implement our model with the framework named OpenNMT-py (Klein et al., 2017).
The optimizer is Adam (Kingma and Ba, 2015) with learning rate 2, β 1 =0.9, and β 2 =0.998. Learning rate warmup is adopted to linearly increase the learning rate over the first 8,000 steps and then decrease it as the setting in (Vaswani et al., 2017). In addition, the brightness factor α in the highlighting attention also progressively decreases at the end of each epoch. Following the setting in (Fabbri et al., 2019), we also apply label smoothing (Szegedy et al., 2016) with smoothing factor 0.1 and dropout (Srivastava et al., 2014) with probability 0.2.
During testing, we use beam search with a beam size of 5. We also use trigram blocking to reduce repetitions. Our models are trained and evaluated on one NVIDIA Tesla P100 GPU.

Baselines
We compare our proposed Highlight-Transformer model with the following extractive and abstractive summarization methods.
A tf-idf-based extractive summarization method  is evaluated to compare with introducing tf-idf score into our abstractive method.
BertExt (Liu and Lapata, 2019b) stacks intersentence Transformer layers on top of the pretrained BERT (Devlin et al., 2019). We fine-tune this model on the Multi-News training set.
PG and PG-MMR are the pointer-generator (PG) network based summarization models reported by (Lebanoff et al., 2018).
Hi-MAP (Fabbri et al., 2019) extends the PG network into a hierarchical network. The attention distribution of tokens is multiplied by the MMR score of the sentence to which they belong.  SAGCopy (Xu et al., 2020) adds the word centrality score to the linearly transformed hidden state when calculating the copy distribution.
BertAbs (Liu and Lapata, 2019b) adopts the pretrained BERT as the encoder. A decoder with six Transformer layers is initialized randomly. We finetune this model on the Multi-News training set.
CopyTransformer (Gehrmann et al., 2018;Fabbri et al., 2019) adds the copy mechanism (See et al., 2017) to a 4-layer Transformer model. The decoder of our model follows its architecture.
As shown in Table 1, the Highlight-Transformer significantly outperforms these baseline models on all metrics, which proves the effectiveness of the highlighting mechanism. Compared with the additive highlighting attention, the weighted high-  Table 3: Human evaluation results. "Win" represents the generated summary of our proposed model is better than that of CopyTransformer in one aspect.
lighting attention is more favorable.
We also compare the effects of adopting the weighted highlighting attention in different numbers of heads and layers in the encoder of our proposed model. The results on the test set of Multi-News are summarized in Table 2. It reveals that adopting it in a quarter of the heads and half of the layers achieves the best performance. We discover that adopting highlighting attention in a subset of heads surpasses adopting it in all heads. Besides, applying the multi-head highlighting attention on all layers of the encoder is also not optimal.
Multi-head attention in the Transformer model (Vaswani et al., 2017) is designed for jointly attending to information from different representation sub-spaces. Voita et al. (2019) find the heads in Transformer model trained on the neural machine translation dataset have specialized functions and focus on different types of information. Adopting the highlighting attention in all heads and layers will affect the Transformer-based model to encode other types of useful information and lead to performance degradation.
In addition to automatic evaluation, we performed a human evaluation to compare the generated summaries in terms of informativeness (the coverage of information from input documents), fluency (content organization and grammatical correctness), and non-redundancy (less repetitive information). We randomly selected 50 samples from the test set of the Multi-News dataset. Four annotators are required to compare two models' generated summaries that are presented anonymously. We also assess their agreements by Fleiss' kappa (Fleiss, 1971). The human evaluation results in Table 3 exhibit that the Highlight-Transformer significantly outperforms the CopyTransformer in terms of informativeness and is comparative in terms of fluency and non-redundancy.
The ablation study aims to validate the effectiveness of individual components in our proposed model. In Table 4, "w/o highlight attn" refers to the CopyTransformer model used in (Gehrmann  Table 4: Ablation study on the Multi-News test set. "brightness" denotes the brightness factor α and "highlight attn" denotes the highlighting attention. Fabbri et al., 2019). The results confirm that incorporating the highlighting attention is beneficial for multi-document summarization, and the decreasing brightness factor α also benefits our model's performance. Besides, we tried replacing the self-attention weight matrices in a quarter of the heads and half of the layers with the highlighting matrices. The performance degradation reveals that it is important to combine the attention weights with the phrase importance instead of directly replacing the attention weights.

Conclusion
In this paper, we introduce the Highlight-Transformer, a novel summarization model with the highlighting mechanism in the encoder. The highlighting mechanism assigns greater attention weights for the tokens within key phrases, and it comprises three main parts: the highlighting matrix, the highlighting attention, and the multi-head highlighting attention. Specifically, a block diagonal highlighting matrix is built for each input token sequence to indicate key phrases' positions and phrases' importance values. For each head, we propose and compare two structures of highlighting attention. Furthermore, we also compare the effects of adopting the weighted highlighting attention in different numbers of heads and layers in the encoder of our proposed model. The experimental results exhibit the effectiveness of our proposed model. We intend to incorporate more phrase-level and sentence-level information into Transformerbased summarization models and evaluate them on different datasets in future work.