Entity-Aware Abstractive Multi-Document Summarization

,


Introduction
Multi-document summarization aims at generating a short and informative summary across a set of topic-related documents. It is a task that can be more challenging than single-document summarization due to the presence of diverse and potentially conflicting information (Ma et al., 2020).
While significant progress has been made in single-document summarization, the mainstream sequence-to-sequence models, which can perform well on single-document summarization, often struggle with extracting salient information and handling redundancy in the presence of multiple, long documents. Thus, simply adopting models that were shown effective for single-document summarization to the multi-document setup may not lead to ideal results (Lebanoff et al., 2018;Zhang et al., 2018;Baumel et al., 2018).
Several previous research efforts have shown that modeling cross-document relations is essential in multi-document summarization (Liu and Lapata, 2019a;. Such relations were shown useful in identifying the salient and redundant information from long documents, and can thus guide the summary generation process. However, while effective empirically, such approaches do not focus on explicitly modeling the underlying semantic information across documents.
Entities and their mentions convey rich semantic information, and can be significant in summarization, especially when a specific entity is the topic under discussion for a set of documents. As shown in Figure 1, entity mentions frequently appear in the input article, and are playing unique roles that contribute towards the coherence and conciseness of the text. We believe that entities can be regarded as the indicator of saliency and can be used to reduce redundancy. This motivates us to propose an entity-aware abstractive multi-document summarization model that effectively encodes relations across documents with the help of entities, and explicitly solve the issues of saliency and redundancy.
Inspired by Wang et al. (2020a), we build a heterogeneous graph that consists of nodes that represent documents and entities. The entity nodes can serve as bridges that connect different documents -we can model the relations across documents through entity clusters. We apply the graph attention network (GAT) (Veličković et al., 2017) to enable information flow between nodes and iteratively update the node representations. In the decoding process, we design a novel two-level attention mechanism. The decoder first attends to the entities. Next, the attention weights of entities are incorporated with graph edge weights to guide the attention to the documents. Intuitively, the first stage indentifies the salient content in each decoding step. By considering the global interactions between entities and documents in the graph, the second stage is able to handle the redundancy issue. Experiments show that our model significantly improves the performance on several multi-document datasets. Further improvements can be made when our model is used together with the pre-trained language models.
Our contributions are as follows: • We construct a heterogeneous graph network for multi-document summarization. The graph consists of document-level and entitylevel nodes. To the best of our knowledge, we are the first to model the relations between documents and entities in one heterogeneous graph. Experiments show that exploiting entity nodes as the intermediary between documents can be more effective than exploiting other semantic units (e.g., words). • We propose a novel two-level attention mechanism during the decoding process, solving the issues of saliency and redundancy explicitly. The mechanism can also reduce the computational cost, making it easier to process long inputs. Abstractive summarization is often regarded as the ultimate goal of document summarization research. Extractive summarization methods produce summaries that are semantically similar to the original documents. Thus, they may be able to achieve relatively high ROUGE scores (Lin, 2004). However, sentence-level extraction lacks flexibility and tends to produce redundant information. By contrast, the process of abstractive summarization is more similar to the human summarization process and requires more sophisticated natural language understanding and generation techniques. Traditional approaches to abstractive summarization can be divided into sentence fusion-based (Barzilay and McKeown, 2005;Filippova and Strube, 2008;Banerjee et al., 2015), paraphrasing-based (Bing et al., 2015;Cohn and Lapata, 2009) and information extraction-based (Li, 2015;Wang and Cardie, 2013;Pighin et al., 2014).
With the development of neural-based methods, abstractive methods achieved promising results on single document summarization (See et al., 2017;Paulus et al., 2018;Gehrmann et al., 2018;Li et al., 2018). More recently, due to the excellent performance on various text generation tasks, transformer-based methods become the mainstream approach for abstractive multi-document summarization, as well as pre-trained language models. Liu and Lapata (2019b) propose Bert-SUM for both extractive and abstractive summarization. Zhang et al. (2019) build low-level and high-level Berts for sentence and document understanding, respectively. Moreover, several general purpose sequence-to-sequence pre-trained models are proposed, such as T5 (Raffel et al., 2020) and BART . They are further finetuned for the summarization task.  propose PEGASUS, in which they design a pre-training objective tailored for abstractive text summarization. Zou et al. (2020) present three sequence-to-sequence pre-training objectives by reinstating source text for abstractive summarization.

Graph-based Document Summarization
Graph-based methods have long been utilized for extractive summarization. Text units on graphs are ranked and selected as the most salient ones to be included in the summary. LexRank (Erkan and Radev, 2004) computes sentence salience based on the eigenvector centrality in the connectivity graph of inter-sentence cosine similarity. Wan (2008) further incorporate the document-level information and the sentence-to-document relationship into the graph-based ranking process. Christensen et al. (2013) build multi-document graphs to approximate the discourse relations across sentences based on indicators including discourse cues, deverbal nouns, co-reference and more.
For recent methods based on graph neural networks, Tan et al. (2017) propose a graph-based attention mechanism to identify salient sentences. Yasunaga et al. (2017) construct an approximate discourse graph based on discourse markers and entity links, then apply graph convolutional networks over the relation graph. Fan et al. (2019) construct a local knowledge graph, which is then linearized into a structured input sequence so that models can encode within the sequence-to-sequence setting.  further design a graph encoder, which improves upon graph attention networks, to maintain the global context and local entities complementing each other.  utilize homogeneous graphs to capture cross-document relations and guide the summary generation process. However, Wang et al. (2020a) are the first to introduce different granularity levels of text nodes to construct heterogeneous graphs for extractive summarization. Our work is partly similar to theirs, but we construct heterogeneous graphs composed of text unit nodes and entity nodes for abstractive multi-document summarization.

Summarization with Additional Features
In addition to the direct application of the general sequence-to-sequence framework, researchers attempted to incorporate various features into summarization. Cao et al. (2018) extract actual fact descriptions from the source text and propose a dual-attention mechanism to force the generation conditioned on both the source text and the extracted fact descriptions. Sharma et al. (2019) take a pipeline method for single-document summarization which is composed of an entity-aware content selection module and a summary generation module. By contrast, our EMSum model is an end-to-end method for multi-document summarization. Gunel et al. (2020) inject structural world knowledge from Wikidata to a transformer-based model, enabling the model to be more fact-aware. Zhu et al. (2020) extract factual relations from the source texts to build a local knowledge graph and integrated it into the transformer-based model.
Apart from entity or fact information, there are several works that incorporate topic information into summarization model. Narayan et al. (2018) recommend an encoder associating each word with a topic vector capturing whether it is representative of the document's content, and a decoder where each word prediction is conditioned on a document topic vector. Zheng et al. (2019) propose to mine cross-document subtopics. In their work, sentence salience is estimated in a hierarchical way with subtopic salience and relative sentence salience. Perez-Beltrachini et al. (2019) explicitly model the topic structure of summaries, and utilize it to guide a structured convolutional decoder. Wang et al. (2020b) rearrange and further explore the semantics of the topic model and develope a friendly topic assistant for transfomer-based abstractive summarization models.

Model
Our model is illustrated in Figure 2, which follows the transformer-based encoder-decoder architecture (Vaswani et al., 2017). We modify the encoder with graph neural networks, so we can incorporate entity information and graph representations at the same time. We design a novel two-level decoding process to explicitly deal with the problem of saliency and redundancy.

Entity Cluster Extraction
Wang et al. (2020a) use words as semantic units in addition to sentence nodes, acting as the intermediary to enrich the relationships between sentences. However, we argue that word-level semantic units are too fine and will bring huge computational costs. For multi-document summarization, models are usually required to process tens of documents. The total number of words will be vast, which further causes a hindrance for the graph construction and message passing process. Therefore, we use entity clusters as more advanced semantic units. We utilize the co-reference resolution tool (Lee et al., 2017) from AllenNLP (Gardner et al., 2018) to extract entity clusters. Note that we perform extraction globally, which means we concatenate all the documents into one long document. We denote the extracted entity clusters as C = {C 1 , C 2 , . . . , C m }, where C i = {mention 1 , mention 2 , . . . , mention l }, and l is

Graph Construction
Given a source document cluster D, we firstly divide them into smaller semantic units P = {P 1 , P 2 , . . . , P n }, such as paragraphs and sentences, depending on the characteristics of datasets. We then construct a heterogeneous graph G = (V, E). V includes paragraph nodes V p and entity cluster nodes V c . E represents undirected edges between nodes. There exists no edge inside paragraph nodes or entity cluster nodes, but only between them. An edge which connects P i and C j means paragraph P i contains an entity mention in C j .
We would like to include more information in the graph. We get an occurrence matrix E ∈ R m×n after extraction, where e ij = 0 indicates P i contains entity mentions in C j for e ij times. Based on E, we further calculate the TF-IDF value matrix E ∈ R m×n to model the importance of relationships between entity clusters and paragraphs.

Document Encoder
Paragraph Encoder Several token-level transformer encoding layers are stacked to encode contextual information within each paragraph. The transformer layer is the same as the vanilla transformer layer (Vaswani et al., 2017). Let x 0 w be the input token vector. For the l-th transformer layer, the input is x l−1 w , the hidden state is h l w , and the LayerNorm is the layer normalization operation (Ba et al., 2016). MHAttn is multi-head attention from Vaswani et al. (2017). FFN is a feed-forward network with ReLU as activation function. We take the output of last layer as token-level features. We use H pw ∈ R nw×dw to denote the token-level feature matrix, where n w is the total number of tokens in all paragraphs and d w is the dimension of token embedding.
Multi-Head Pooling To obtain fixed length paragraph representations, we follow Liu and Lapata (2019a) to apply a weighted-pooling operation. The multi-head pooling mechanism calculates the weight distributions over tokens, allowing the model to flexibly encode paragraphs in different representational subspace by different head.
We use H p ∈ R n×d h to denote the paragraph level feature matrix, where n is the number of paragraphs, d h is the hidden size.
Entity Cluster Encoder We perform the same encoding process as the paragraph encoder to get entity clusters' representation, but without sharing parameters between the two encoders. We choose this method rather than additional entity embedding methods because we seek to model the relationship between paragraphs and entities in a unified semantic space. Note that we firstly remove pronouns and stopwords in entity mention clusters, which are common in co-reference resolution results but render little benefit for our semantic modeling. We use H cw ∈ R mw×dw and H c ∈ R m×d h to denote the token level feature matrix and cluster level feature matrix, respectively.

Graph Encoder
We use graph attention networks (GAT) (Veličković et al., 2017) to update the representations of semantic nodes. We use i, j ∈ {1, ..., (m + n)} to denote an arbitrary node in graph, use h i , h j ∈ R d h to denote the node representations, and use N i to denote the set of neighboring nodes of node i. The GAT layer is designed as follows: where W a , W q , W k , W v are trainable weights, σ is the sigmoid function,ẽ ij is the edge weight derived from TF-IDF value matrixẼ. We basically follow Wang et al. (2020a) to iteratively update node representations. They infuse the scalar edge weightẽ ij by simply discretizing the real values into integers, and then learn embeddings for such integers. That is how they map the weights to the multi-dimensional embedding space e ij ∈ R de . In this way, the information contained in the values needs to be learned by an additional embedding matrix. However, we argue that TF-IDF values themselves indicate the closeness between an entity cluster and a paragraph. Therefore, we directly incorporate the raw TF-IDF information into the GAT mechanism by modifying the attention weights using Equation 5.
We combine GAT with multi-head operation. We also add a residual connection to avoid gradient vanishing after several iterations: We use the above GAT layer and positionwise feed-forward layer to iteratively update the node representations. Each iteration contains a paragraph-to-entity and a entity-to-paragraph updating process. After iterating for t times, we con-catenateH p to each corresponding input token vector, arriving atH pw ∈ R nw×(dw+d h ) .

Entity-Aware Decoder with Two-level Attention
Under the setting of multi-document summarization, the input source documents may involve an extremely large number of word tokens. If the decoder needs to compute attention weights over all tokens, the cost would be very high and the attention could be dispersed. Our two-level decoding process firstly focuses on several centering entity cluster nodes, which can be regarded as indicators of saliency. The indicator restricts the token-level attention only to some of the paragraphs, which can further reduce redundancy than naively attending to all tokens. Different from Section 3.4, we use i and j to denote the entity node and paragraph node, respectively.
Attending the Entity Cluster Nodes At each decoding step, the state of decoder is s, we compute attention scores over entity cluster nodes c i .
The entity nodes act as the intermediary between paragraph nodes. We incorporate z i with edge weightsẽ ij to enable the information flow between entity nodes and paragraph nodes by: Attending the Paragraph Tokens We select the top-k paragraph nodes with the highest attention score β j . Then we apply the attention mechanism over the T w tokens in the selected paragraphs.
For token w i in paragraph P j , we further modify γ w i byγ Then the context vector v t can be computed by: Token Prediction Context vectors, treated as salient contents summarized from sources, are concatenated with the decoder hidden state s t to produce the vocabulary distribution: We use the weight-sharing strategy between the input embedding matrix and the matrix W o to reuse linguistic knowledge (Paulus et al., 2018). We further add a copy mechanism as proposed by See et al. (2017).

Training
Our training process follows that of the traditional sequence-to-sequence modeling, with maximum likelihood estimation that minimizes: where x and y are document-summary pairs from training set D, and θ are parameters to be learned.

Pre-trained LMs as Document Encoder
Our document encoder illustrated in section 3.3 can be replaced by a pre-trained language model such as BERT (Devlin et al., 2019) and RoBERTa . Pre-trained language models can be more effective on short inputs than training stacked transformer layers from scratch. We feed input tokens to a pre-trained language model and take the last layer output as token embeddings. Then a single-layer bidirectional LSTM is employed over token embeddings, producing token features. Finally, we perform the same multi-head pooling strategy to obtain paragraph representations.

Experimental Setup
We conduct experiments on two major datasets used in the literature of multi-document summarization, namely WikiSum  and MultiNews (Fabbri et al., 2019). Hyperparameters We set the number of our vanilla Transformer encoding layers as 6, the hidden size as 256 and the number of heads as 8, while the hidden size of feed-forward layers is 1,024. We truncate the length of input paragraphs and entity clusters to 100 and 50 tokens, respectively. In the multi-head pooling layer, the number of heads is 8. In the graph encoding process, each layer has 8 heads and the hidden size is 256. We select the  . ' * ' indicates the results are obtained by running the released code. Model name with suffix '+R' means RoBERTa is used. 'Ext' means extractive methods, 'Abs' means abstractive methods. number of iterations t = 2 based on the performance. We use dropout with probability 0.1 before all linear layers and label smoothing (Szegedy et al., 2016) with smoothing factor 0.1. We train our model for 200,000 steps with gradient accumulation every four steps. During decoding we apply beam search with beam size 5 and length penalty (Wu et al., 2016) with factor 0.4.
For models with pre-trained LMs, we choose the base version of RoBERTa. We follow Liu and Lapata (2019b), employing two Adam optimizers (Kingma and Ba, 2015) for the pre-trained part and other parts, with β 1 = 0.9, β 2 = 0.998. For the pre-trained part, the learning rate and warmup steps are set as 0.002 and 20,000, while for other parts are 0.2 and 8,000, respectively.

Baseline Models
We choose a series of Transformer-based models for comparison due to their excellent performance. Flat Transformer (FT) is a 6-layer encoder-decoder model. The title and ranked paragraphs were concatenated and truncated to 800 tokens. Transformer Decoder with Memory Compressed Attention model (T-DMCA) is proposed by  with the WikiSum dataset. They use a Transformer decoder but apply a convolutional layer to compress the key and value in self-attention. Moreover, we choose Hierarchical Transformer (HT) proposed by Liu and Lapata (2019a), GraphSum proposed by , and HeterSumGraph proposed by Wang et al. (2020a) for comparisons. We have introduced them in Section 2. Table 1 summarizes the evaluation results on the WikiSum dataset. The first block shows the baseline model Lead and LexRank (Erkan and Radev, 2004), which are extractive methods. The second block shows the results of abstractive models introduced in Section 4.2. We report their results following . The last block shows the results of some abstractive models and our model, but such models are fed with 20 top-ranked paragraphs as input.

Results on WikiSum
The results show that if we limit the number of input paragraphs to 20, ROUGE score of all models will drop by about 2 points. We believe this is because the lower-ranked paragraphs can still provide information anyway.
Our model EMSum performs the best under the top-20 setting. Compared to the reported results of GraphSum (which used top 40 documents), EM-Sum achieves improvements on ROUGE-2 and ROUGE-L, even though EMSum takes shorter source documents as input. The gap between EM-Sum and GraphSum on ROUGE-1 score is 0.23 (42.40 vs 42.63). Considering all these three metrics together, the results show the effectiveness of our model.
For models combined with pre-trained LMs, the results show that EMSum+RoBERTa further improves the summarization performance on all metrics over EMSum. The improvements over Graph-Sum+RoBERTa are 0.28 on ROUGE-2 and 0.83 on ROUGE-L, also showing the effectiveness of our model even in the presence of pre-trained LMs. Table 2 summarizes the evaluation results on the MultiNews dataset. Similarly, the first block shows two extractive baselines LexRank, and HeterSumGraph. The second block shows the abstractive methods. We report the results of FT, HT and GraphSum following . The last block shows the results of our models. We can see that EMSum outperforms GraphSum and EMSum+RoBERTa outperforms GraphSum+RoBERTa. HeterSumGraph is a extractive method so it achieves better ROUGE-L score. However, our model still achieves higher ROUGE-1 and ROUGE-2 scores than HeterSum-Graph. Overall, the results demonstrate the effectiveness of our model on different types of corpora.

Analysis
We further conduct experiments to analyze the effects of the number of iterations and the number of paragraphs selected for attention. We also conduct ablation studies to validate the effectiveness of different components of our model.

The Number of Iterations
We investigate how the number of iterations t influences the performance of our model. To this end, we conduct experiments on WikiSum dataset when t = 1, 2, 3, 4. The first block in Table 3 shows the results. Intuitively, the more iterations the graph is updated, the more information is flowed across the nodes. However, the results show us that t = 3, 4 outperforms t = 2 on ROUGE-L and the overall performancẽ R fluctuates very little. We argue the performance is limited by the number of introduced parameters. Therefore we choose t = 2 finally.
The Number of Paragraphs Selected for Attention At each decoding step, our two-level attention mechanism firstly computes weights over entity nodes to identify the most salient parts of source documents. The attention weights over the entire long token sequence may be sparse. So we need to figure out how much salient information is enough for our model, namely the proper value of k. We conduct experiments on WikiSum dataset when k = 5, 10, 15, 20. As the results in the second block of Table 3 show, when k = 5, the number of attended paragraphs is relatively small thereby degrading performance heavily. When k = 20, that means we do not perform any cut-off but only modify the paragraph attention weights with the entity attention weights, so the performance is also reduced. When k = 10, 15, the cut-off strategy   works and boosts the performance. Finally, we choose k = 10 because it performs the best.
Ablation Study To validate the effectiveness of individual components such as graph encoder module and two-level attention module, we conduct experiments of ablation studies. For experiments without graph encoder module, we simply fix the entity cluster representation and paragraph representation after the multi-head pooling layer. For experiments without two-level attention, we apply token-level attention directly, but attend to the entity cluster representation additionally, which is a naive way to incorporate entity information. Table  4 shows the results. The results show the effectiveness of our new introduced module. Incorporating entity information to construct a heterogeneous graph network enables better information flowing between text nodes, and our design of the novel two-level attention mechanism in this task is indeed playing an important role towards the overall effectiveness of our approach.

Human Evaluation
We further employ human evaluation to assess model performance. We randomly sampled 20 documents-summary pairs from the WikiSum test set and 20 from the MultiNews test set, and invited 3 participants to assess the outputs of different models independently. Following criteria used by previous work (Liu and Lapata, 2019a), the evaluation score takes three aspects into account: (1) Informativeness: does the summary include salient  , 2015) because it has been shown to produce more reliable results than rating scales (Kiritchenko and Mohammad, 2017). Annotators are presented with the gold summary and summaries generated from 3 out of 4 systems and decide which summary is the best and which is the worst based on the criteria mentioned above. The rating of each system was computed as the percentage of times it was chosen as best minus the times it was selected as worst. Ratings range from -1 (worst) to 1 (best). On the WikiSum dataset, we choose FT, T-DMCA, HT, EMSum and conduct human evaluation to compare their performance. On the Multi-News dataset, we choose FT, T-DMCA, HS, together with EMSum. The results are shown in Table 5. These results show that our EMSum model is able to generate summaries of higher quality than other models and further show the effectiveness of our proposed approach.

Conclusion
In this paper, we propose an entity-aware multidocument summarization model. We introduce entity nodes in addition to text unit nodes to construct a heterogeneous graph, helping our model capture complicated relations between text units. We also introduce a decoder with a two-level attention mechanism, which firstly attends to the entity nodes, where the attention weights are then subsequently utilized to guide the attention to the text units. With such a novel design, our model is able to deal with the problems of saliency and redundancy explicitly. Moreover, like other Transformerbased models, our model can be easily integrated with pre-trained language models for improved re-sults. Experiments on standard datasets show the effectiveness of our model.
In the future, we would like to explore other approaches such as reinforcement learning based methods (Sharma et al., 2019) to further improve the summary quality in the context of multidocument summarization. We would also like to apply our method to other tasks such as multidocument question answering (Joshi et al., 2017).