SgSum:Transforming Multi-document Summarization into Sub-graph Selection

Most of existing extractive multi-document summarization (MDS) methods score each sentence individually and extract salient sentences one by one to compose a summary, which have two main drawbacks: (1) neglecting both the intra and cross-document relations between sentences; (2) neglecting the coherence and conciseness of the whole summary. In this paper, we propose a novel MDS framework (SgSum) to formulate the MDS task as a sub-graph selection problem, in which source documents are regarded as a relation graph of sentences (e.g., similarity graph or discourse graph) and the candidate summaries are its sub-graphs. Instead of selecting salient sentences, SgSum selects a salient sub-graph from the relation graph as the summary. Comparing with traditional methods, our method has two main advantages: (1) the relations between sentences are captured by modeling both the graph structure of the whole document set and the candidate sub-graphs; (2) directly outputs an integrate summary in the form of sub-graph which is more informative and coherent. Extensive experiments on MultiNews and DUC datasets show that our proposed method brings substantial improvements over several strong baselines. Human evaluation results also demonstrate that our model can produce significantly more coherent and informative summaries compared with traditional MDS methods. Moreover, the proposed architecture has strong transfer ability from single to multi-document input, which can reduce the resource bottleneck in MDS tasks.


Introduction
Currently, most extractive models treat summarization as a sequence labeling task. They score and select sentences one by one (Zhong et al., 2020). * Equal contribution. 1 Our code and results are available at: https: //github.com/PaddlePaddle/Research/tree/ master/NLP/EMNLP2021-SgSum These models (called sentence-level extractors) do not consider summary as a whole but a combination of independent sentences. This may cause incoherent and redundant problem, and result in a poor summary even if the summary consists of high score sentences. Some works (Wan et al., 2015;Zhong et al., 2020) treat summary as a whole unit and try to solve the weakness of sentencelevel extractors by using a summary-level extractor. However, these models neglect the intra and cross-document relations between sentences which also have benefits for extracting salient sentences, detecting redundancy and generating overall coherent summaries. Relations become more necessary when input source documents are much longer and more complex such as multi-document input.
In this paper, we propose a novel MDS framework called SgSum which formulates the MDS task as a sub-graph selection problem. In our framework, source documents are regarded as a relation graph of sentences (e.g., similarity graph or discourse graph) and the candidate summaries are its sub-graphs. In this view, how to generate a good summary becomes how to select a proper subgraph. In our framework, the whole graph structure is modeled to help extract salient information from source documents and the sub-graph structures are also modeled to help reflect the quality of candidate summaries. Moreover, the summary is considered as a whole unit, so SgSum directly outputs the final summary in the form of sub-graph. By capturing relations between sentences and evaluating summary as a sub-graph, our framework can generate more informative and coherent summaries compared with traditional extractive MDS methods.
We evaluate SgSum on two MDS datasets with several types of graphs which all significantly improve the MDS performance. Besides, the human evaluation results demonstrate that SgSum can obtain more coherent and informative summaries compared with traditional MDS methods. More-

Whole Graph
Sub-graphs Generate sub-graphs Select Sub-graph Graph Representation

Origin Text
Manchester City and Gareth Bale are the latest voices to oppose a biennial World Cup amid widespread anger at Fifa's lack of consultation over plans to radically alter the football calendar.
The proposals -which have been developed by Arsène Wenger, Fifa's chief of global football development -would lead to a World Cup or European Championship every summer, as well as potentially no club football in October while international qualifiers are played instead.
Step I Step II Step III Figure 1: Overview of our sub-graph selection framework. Firstly, well-established graph construction methods are used to transform input documents into a graph where sentences are nodes and semantic links between sentences are edges. Then its sub-graphs can be treated as candidate summaries. Finally, we select the best sub-graph as the final summary.
over, the experimental results also indicate that Sg-Sum has strong power on transfer ability when only trained on single-document data. It performs much better than several strong MDS baselines including supervised and unsupervised models.
The contributions of our work are as follows: • We propose a novel framework called SgSum which transforms MDS task into the problem of sub-graph selection. The framework leverages graph to capture relations between sentences, and generates more informative and coherent summaries by modeling sub-graph structures. • Due to the graph-based multi-document encoder, our framework unifies single and multidocument summarization and has strong transfer ability from SDS to MDS task without any parallel MDS training data. Thus, it can reduce the resource bottleneck in MDS tasks. • Our model is general to several well-known graph representations. We experiment with similarity graph, topic graph and discourse graph on two benchmark MDS datasets. Results show that SgSum has achieved superior performance compared with strong baselines.

Summarization as Sub-graph Selection
The graph structure is effective to model relations between sentences which is an essential point to select interrelated summary-worthy sentences in extractive summarization. Erkan and Radev (2004) utilize a similarity graph to construct an unsupervised summarization methods called LexRank. G-Flow (Christensen et al., 2013) and DISCOBERT (Xu et al., 2020) both use discourse graphs to generate concise and informative summaries. Li et al. (2016) and Li and Zhuge (2019) propose to utilize event relation graph to represent documents for MDS. However, most existing graph-based summarization methods only consider the graph structure of source document. They neglect that summary is also a graph and its graph structure can reflect the quality of a summary. For example, in a similarity graph, if selected sentences are lexical similar, the summary is probably redundant. And in a discourse graph, if selected sentences have strong discourse connections, the summary tend to be coherent.
We argue that the graph structure of summary is equally important as the source document. Document graph helps to extract salient sentences, while summary graph helps to evluate the quality of summary. Based on this thought, we propose a novel MDS framework SgSum which transforms summarization into the problem of sub-graph selection. SgSum captures relation of sentences both in whole graph structure (source documents) and sub-graph structures (candidate summaries). Moreover, in our framework, summary is viewed as a whole unit in the form of sub-graph. Thus, SgSum can generate more coherent and informative results than traditional sentence-level extractors. Figure 1 shows the overview of our framework. Firstly, source documents are transformed into a relation graph by well-known graph construction methods such as similarity graph and discourse graph. Sentences are the basic information units and represented as nodes in the graph. And relations between sentences are represented as edges. For example, a similarity graph can be built based on cosine similarities between tf-idf representations of sentences. Let G denotes a graph representation matrix of the input documents, where G[i] [j] indicates the tf-idf weights between sentence S i and S j .

Graph Encoder
Transformer DocN Transformer Doc1 … Figure 2: Model architecture of SgSum. Graph-based multidocument encoder takes tokenized documents as input and outputs sentence representations after graph encoding layers. Candidate summaries are modeled by its sub-graph structure in the sub-graph encoder, then scored in a ranking layer.
Formally, the task is to generate the summary S of the document collection given L input sentences S 1 , . . . , S L and their graph representation G.
As Figure 1 shows, if we represent the source documents as a graph, it can be easily observed that sentences will form plenty of different subgraphs. By further modelling the sub-graph structures, we can distinguish the quality of different candidate summaries and finally select the best one. Compared with the whole document graph view, sub-graph view is more appropriate to generate a coherent and concise summary. This is also the key point of our framework. Additionally, important sentences usually build up crucial sub-graphs. So it is a simple but efficient way to generate candidate sub-graphs based on those salient sentences.

Graph-based Multi-document Encoder
In this section, we introduce our graph-based multidocument encoder. It takes a multi-document set as input and represents all sentences by graph structure. It has three main components: (1) Hierarchical Transformer which processes each document independently and outputs the sentence representations.
(2) Graph encoding layer which updates sentence representations by modeling the graph structure of documents. (3) Graph pooling layer which helps to generate an overall representation of source documents. Figure 2 illustrates the overall architecture of SgSum.
Hierarchical Transformer Most previous works (Cao et al., 2017;Jin et al., 2020;Wang et al., 2017) did not consider the multi-document structure. They simply concatenate all documents together and treat the MDS as a special SDS with longer input.  preprocess the multi-document input by truncating lead sentences averagely from each document, then concatenating them together as the MDS input. These preprocessing methods are simple ways to help the model encode multi-document inputs. But they do not make full use of the source document structures. Lead sentences extracted from each document might be similar with each other and result in redundant and incoherent problems. In this paper, we encode source documents by a Hierarchical Transformer, which consists of several sharedweight single Transformers (Vaswani et al., 2017) that process each document independently. Each Transformer takes a tokenized document as input and outputs its sentence representations. This architecture enables our model to process much longer input. Graph Encoding To effectively capture the relations between sentences in source documents, we incorporate explicit graph representations of documents into the neural encoding process via a graph-informed attention mechanism similar to Li et al. (2020). Each sentence can collect information from other related sentences to capture global information from the whole input. The graphinformed attention mechanism extends the vanilla self-attention mechanism to consider the pairwise relations in explicit graph representations as: where e ij denotes the origin self-attention weights between sentences S i and S j , α ij denotes the adjusted weights by graph structure. The key point of the graph-based self-attention is the additional pairwise relation bias R ij , which is computed as a Gaussian bias of the weights of graph representation matrix G: where σ denotes the standard deviation that represents the influence intensity of the graph structure. Then a two-layer feed-forward network with ReLU activation and a high-way layer normalization are applied after the graph-informed attention mechanism. These three components form the graph encoding layers. Graph Pooling In the MDS task, information is more massive and relations between sentences are much more complex. So it is necessary to have an overview of the central meaning of multi-document input. Zhong et al. (2020) generate a document representation with Siamese-BERT to guide the training and inference process. In this paper, based on the graph representation of documents, we apply a multi-head weighted-pooling operation similar to  to capture the global semantic information of source documents. It takes sentence representations in the source graph as input and outputs an overall representation of them (denoted as D), which provides global information of documents for both the sentence and summary selection processes. Let x i denotes the graph representation of sentence S i . For each head z ∈ {1, ..., n head }, we first transform x i into attention scores a z i and value vectors b z i , then we calculate an attention distribution a z i over all sentences in the source graph based on attention scores: We next apply a weighted summation with another linear transformation and layer normalization to obtain vector head z for the source graph. Finally, we concatenate all heads and apply a linear transformation to ontain the global representation D: where W z a , W z b , W z c and W d are weight matrices, and || denotes the concatenating operator.
Based on the graph-based multi-document encoder, our model can process much longer input than traditional summarization models. Furthermore, our model can treat SDS and MDS as similar tasks in the unified sub-graph selection framework.

Select from Graph
Sub-graph Encoder As we mentioned in Section 2, sub-graph structure can reflect the quality of candidate summaries. A sub-graph with similar nodes means a redundant summary. And a sub-graph with unconnected nodes represents an incoherent summary. So we apply a sub-graph encoder which has the same architecture with the graph encoder to model each sub-graph. Then we score each subgraph in a sub-graph ranking layer to select the best sub-graph as the final summary. Sub-graph Ranking Layer In the training process, we first calculate ROUGE scores of each sentence with the gold summary. Then we select top-K scoring sentences and make a combination of them to form candidate summaries. The sentences in each candidate summary form a subgraph of the source document graph.
There are two principles to optimize our framework. Firstly, a good summary can represent the central meaning of source documents which indicates that a good sub-graph should also represent the whole graph. Specifically, the global document representation D which reflects the overall meaning of source documents should be semantic similar with the gold summary. We use a greedy method (Nallapati et al., 2017) to extract an oracle summary (composed by source sentences) with the largest ROUGE score corresponding to the abstractive reference summary. Then, sentences in the oracle summary are considered as gold summary sentences, which also form a subgraph. Let C * denotes the gold summary and the similarity score between C * and D is measured by f (D, C * ) = cosine(D, C * ), which form the following summary-level loss: Furthermore, we also design a pairwise margin loss for all the candidate summaries similar with Zhong et al. (2020). We sort all candidate summaries in descending order of ROUGE scores with the gold summary. All candidate summaries are also represented in the form of sub-graph by using sub-graph encoder. Naturally, the candidate pair with a larger ranking gap should have a larger margin, which is the second principle to design our loss function: where C i represents the candidate summary ranked i and γ is a hyperparameter used to distinguish between good and bad candidate summaries. L sum1 and L sum2 compose a summary-level loss function: Additionally, we adopt a traditional binary crossentropy loss between candidate sentences and oracles to learn more accurate sentence and summary representations.
where a label y i ∈ {0, 1} indicates whether the sentence S i should be a summary sentence. Finally, our loss can be formulated as: During inference, there are hundreds of sentences in a multi-document set which means there are thousands of sub-graphs need to be considered. In order to overcome this difficulty, we adopt a greedy strategy by first selecting several salient sentences as candidate nodes and then making a combination of them to generate candidate subgraphs. As the important sentences usually build up crucial sub-graphs, it is a simple way to generate candidate sub-graphs based on those salient sentences. Then we calculate cosine similarities between all sub-graphs with the global document representation D in the sub-graph ranking layer, and select the sub-graph with the highest score as the final summary. Thus, our model can be viewed as a sub-graph selection framework which means selecting a proper sub-graph from a whole graph.
Furthermore, the graph structure can help to reorder the sentences in the summary to obtain a more coherent summary (Christensen et al., 2013). We order the summary by placing sentences with discourse relations next to each other.

Experimental Setup
Graph types We experiment with three wellestablished graph representations: similarity graph, topic graph and discourse graph. (1) The similarity graph is built based on tf-idf cosine similarities between sentences to capture lexical relations. (2) The topic graph is built based on LDA topic model (Blei et al., 2003) to capture topic relations. The edge weights are cosine similarities between the topic distributions of sentences. (3) The discourse graph is built to capture discourse relations based on discourse markers (e.g. however, moreover), co-reference and entity links as in Christensen et al. (2013). Other types of graphs can also be used in our model. In our experiments, if not explicitly stated, we use the similarity graph by default as it is the most widely used in previous work.   to initialize our models in all experiments. The optimizer is Adam (Kingma and Ba, 2014) with β1=0.9 and β2=0.999, and the learning rate is 0.03 for MultiNews and 0.015 for DUC. We apply learning rate warmup over the first 10000 steps and decay as in (Kingma and Ba, 2014). Gradient clipping with maximum gradient norm 2.0 is also utilized during training. All models are trained on 4 GPUs (Tesla V100) for about 10 epochs. We apply dropout with probability 0.1 before all linear layers. The number of hidden units in our models is set as 256, the feedforward hidden size is 1,024, and the number of heads is 8. The number of transformer encoding layers and graph encoding layers are set as 6 and 2, respectively. As we mentioned in Section 3.2, during inference we select several salient candidate nodes to build up sub-graphs. And the number of nodes in a sub-graph is determined by the average number of sentences in the gold summary. For MultiNews and DUC, we set the number of candidate nodes and sub-graph nodes as 10/9 and 7/5, respectively.

Evaluation Results
We evaluate our models on both the MultiNews and DUC datasets to validate their effectiveness on dif- ferent types of corpora. The summarization quality is evaluated using ROUGE F1 (Lin, 2004). We report unigram and bigram overlap (ROUGE-1 and ROUGE-2) between system summaries and gold references as a means of assessing informativeness, and the longest common subsequence (ROUGE-L2) as a means of accessing fluency. Table 1 summarizes the evaluation results on the MultiNews. Several strong extractive and abstractive baselines are evaluated and compared with our models. The first block in the table shows results of extractive methods: LexRank (Erkan and Radev, 2004), MMR (Carbonell and Goldstein, 1998), HeterGraph  and MatchSum (Zhong et al., 2020) which is the previous extractive SOTA model on the MultiNews dataset. The second block shows results of abstractive methods: PG (Lebanoff et al., 2018), Hi-MAP (Fabbri et al., 2019), FT(Flat Transformer) and GraphSum (Li et al., 2020) which is the previous abstractive SOTA model. We report their results following Zhong et al. (2020); ; Li et al. (2020). The last block shows the results of SgSum. Compared with both previous extractive and abstractive SOTA models, SgSum achieves more than 1.1/1.2/0.9 improvements on R-1, R-2 and R-L which demonstrates the effectiveness of our sub-graph selection framework. Furthermore, due to our graph representation and graph-based multi-document encoder, our model has the ability to unify single and multi-document summarization task. In our framework, a single document can also be viewed the same as multidocument input. So our model can be enhanced by feeding extra single-document training data. In the last block, extra means we leverage CNN/DM data 2 -n 2 -m -w 1.2 -c 95 -r 1000 -l 250   (Erkan and Radev, 2004), DPP (Kulesza and Taskar, 2011), Sim-DPP (Cho et al., 2019) following Cho et al. (2019). Besides, we also report the results of SubModular (Lin and Bilmes, 2010), StructSVM (Sipos et al., 2012) and PG (See et al., 2017) as strong baselines. The last block shows the results of our models. The results indicate that our model SgSum consistently outperforms most baselines, which further demonstrate the effectiveness of our model on different types of corpora. Additionally, we also test the performance of SgSum-extra which add CNN/DM data as a supplement. It is comparable to Sim-DPP baseline which also uses extra CNN/DM data to train a similarity model. And the results again show that singledocument data greatly improves the performance of our model.

Transfer Performances
It is commonly known that deep neural networks achieved great improvement on SDS task recently (Liu and Lapata, 2019b;Zhong et al., 2020;Li et al., 2018a,b). However, such supervised models can not work well on MDS task because parallel data for mulit-document are scarce and costly to obtain. For example, the DUC dataset only contains tens of parallel MDS data. There is a pressing need   to propose an end-to-end model which is trained on single-document data but can work well with multiple-document input. In this section we do further experiments to verify the transfer ability of our model from single to multi-document task. We follow the experiment setups of Lebanoff et al. (2018), and compare with several strong baseline models: (1) BERTSUMEXT (Liu and Lapata, 2019b), an extractive method with pre-trained LM model; (2) PG-MMR (Lebanoff et al., 2018), an encoder-decoder model which exploits the maximal marginal relevance method to select representative sentences; (3) Extract+Rewrite , is a recent approach that scores sentences using LexRank and generates a title-like summary for each sentence using an encoder-decoder model. We follow the results of Lebanoff et al. (2018). Table 3 and Table 4 demonstrate the results on MultiNews and DUC2004 respectively. Tables 3 and 4, the second blocks are transfer models which are only trained on SDS data and tested on MDS data directly. BERTSUMEXT, PG-MMR, SgSum are trained on CNN/DM, while Extract+Rewrite is trained on Gigaword. The results show that our model achieves better performance than several strong unsupervised models. Furthermore, when trained only on the SDS data, SgSum performs much better on transfer ability compared with the three baselines in the second block of Table 4. The above evaluation results on MultiNews and DUC datasets both validate the effectiveness of our model. The subgraph selection framework greatly improves the performance of MDS and shows a powerful trans-   fer ability which can reduce the resource bottleneck in MDS.

Analysis
We further analyze the effects of graph types on our model and validate the effectiveness of different components of our model by ablation studies.

Effects of Graph types
We compare the results of similarity graph, topic graph and discourse graph on the MultiNews test set. The comparison results in Figure 3 show that the discourse graph achieves the best performance on all metrics, which demonstrate that graphs with richer relations are more helpful for MDS. Table 5 summarizes the results of ablation studies, which aim to validate the effectiveness of each individual component of our model. "w/o graph enc" denotes removing the graph-based multi-document encoder, encoding the source input by concatenating all documents as a sequence. "w/o subgraph enc" and "w/o subgraph rank" denontes removing the subgraph encoder and the subgraph ranking layer, respectively. "w/o all" denotes removing all graph components, which is actually the BERTSUMEXT baseline model. The experimental results confirmed that our framework which transforms MDS task into sub-graph selection is effective (see w/o subgraph enc and subgraph rank). Besides, incorporating explicit graph structure (see w/o graph enc) also help to process long input source and result in better performances for MDS.

Human Evaluation
In addition to the automatic evaluation, we also assess system performance by human evaluation. We use the DUC2004 as human evaluation set, and invite 2 annotators to assess the outputs of different models independently. We use Cohen Kappa (Cohen, 1960) to calculate the inter-annotator agreement between annotators. Annotators assess the overall quality of summaries by ranking them considering the following criteria: (1) Informativeness: is the main meaning expressed in the source documents preserved in the summary? (2) Coherence: is the summary coherent between sentences and well-formed? Annotators were asked to ranking all systems from 1 (best) to 4 (worst). All systems get score 2, 1, -1, -2 for ranking 1, 2, 3, 4 respectively. The rating of each system is computed by averaging the scores on all test instances. Four system summaries are presented in Table  6. The results demonstrate that SgSum is rated as the best on both informativeness and coherence. Regarding the overall ratings, the summaries generated by SgSum are frequently ranked as the best, which significantly outperforms other models. The human evaluation results further validate the effectiveness of our proposed sub-graph selection framework.

Graph-based Summarization
Most previous graph extractive MDS approaches aim to extract salient textual units from documents based on graph structure representations of sentences. Erkan and Radev (2004) introduce LexRank to compute sentence importance based on the eigenvector centrality in the connectivity graph of inter-sentence cosine similarity. Christensen et al. (2013) build multi-document graphs to identify pairwise ordering constraints over the sentences by accounting for discourse relationships between sentences. More recently, Yasunaga et al. (2017) build on the approximate discourse graph model and account for macro-level features in sentences to improve sentence salience prediction. Yin et al. (2019) also propose a graph-based neural sentence ordering model, which utilizes an entity linking graph to capture the global dependencies between sentences. Li et al. (2020) incorporate explicit graph representations to the neural architecture based on a novel graph-informed selfattention mechanism. It is the first work to effectively combine graph structures with abstractive MDS model. Wu et al. (2021) present BASS, a novel framework for Boosting Abstractive Summarization based on a unified Semantic graph, which aggregates co-referent phrases distributing across a long range of context and conveys rich relations between phrases. However, these works only consider the graph structure of source documents, but neglect the graph structures of summaries which are also important to generate coherent and informative summaries.

Sentence or Summary-level Extraction
Extractive summarization methods usually produce a summary by selecting some original sentences in the document set by a sentence-level extractor. Early models employ rule-based methods to score and select sentenecs (Lin and Hovy, 2002;Lin and Bilmes, 2011;Takamura and Okumura, 2009;Schilder and Kondadadi, 2008). Recently, SUM-MARUNNER (Nallapati et al., 2017) adopt an encoder based on Recurrent Neural Networks which is the earliest neural summarization model. SUMO  capitalizes on the notion of structured attention to induce a multi-root dependency tree representation of the document. However, all these models belong to sentence-level extractors which select high score sentences individually and might raise redundancy (Narayan et al., 2018).
Different from above studies, some work focus on the summary-level selection. Wan et al. (2015) optimize the summarization performance directly based on the characteristics of summaries and rank summaries directly during inference. Bae et al. (2019), Paulus et al. (2017) and Celikyilmaz et al. (2018) use reinforcement learning to globally optimize summary-level performance. Recent studies (Alyguliyev, 2009;Galanis and Androutsopoulos, 2010;Zhang et al., 2019) have attempted to a build two-stage document summarization. The first stage is usually to extract some fragments of the original text, and the second stage is to select or modify on the basis of these fragments. Mendes et al. (2019) follow the extract-then-compress paradigm to train an extractor for content selection. Zhong et al. (2020) propose a novel extract-then-match framework which employs a sentence extractor to prune unnecessary information, then outputs a summary by matching models. These methods consider summary as a whole rather than individual sentences. However, they neglect the relations between sentences during both scoring and selecting.

From Single to Multi-document
Recent neural network summarization models focus on SDS due to the large parallel datasets automatically harvested from online news websites including Gigaword (Rush et al., 2017), CNN/DM (Hermann et al., 2015), NYT (Sandhaus, 2018) and Newsroom (Grusky et al., 2018). However, MDS has not yet fully benefited from the development of neural network models, because parallel data for MDS are scarce and costly to obtain.
A promising route to generating summary from a multi-document input is to apply a model trained for SDS to a "mega-document" (Lebanoff et al., 2018) created by concatenating all documents together. Nonetheless, such a model may not suit well for two reasons. First, identifying important text pieces from a mega-document can be challenging for the model, which is trained on singledocument data where the summary-worthy content is often contained in the first few sentences. This is not the case for a mega-document. Second, redundant text pieces in a mega-document can be repeatedly used for summary generation under the current framework. Lebanoff et al. (2018) present a novel adaptation model, named PG-MMR, to generate summary from multi-document inputs. However, it still considers MDS data as a meta-document. In contrast, our model unifies SDS and MDS by graph representations, and achieves great performance on transferring from SDS to MDS.

Conclusion
We propose a novel framework SgSum which transforms the MDS task into the problem of sub-graph selection. SgSum captures the relations between sentences by modelling both the graph structure of the whole document set and the candidate subgraphs, then directly output an integrate summary in the form of sub-graph which is more informative and coherent. Experimental results on two MDS datasets show that SgSum brings substantial improvements over several strong baselines. Moreover, the proposed architecture has strong transfer ability from single to multi-document, which can reduce the resource bottleneck in MDS tasks.