Multiplex Graph Neural Network for Extractive Text Summarization

Extractive text summarization aims at extracting the most representative sentences from a given document as its summary. To extract a good summary from a long text document, sentence embedding plays an important role. Recent studies have leveraged graph neural networks to capture the inter-sentential relationship (e.g., the discourse graph) within the documents to learn contextual sentence embedding. However, those approaches neither consider multiple types of inter-sentential relationships (e.g., semantic similarity and natural connection relationships), nor model intra-sentential relationships (e.g, semantic similarity and syntactic relationship among words). To address these problems, we propose a novel Multiplex Graph Convolutional Network (Multi-GCN) to jointly model different types of relationships among sentences and words. Based on Multi-GCN, we propose a Multiplex Graph Summarization (Multi-GraS) model for extractive text summarization. Finally, we evaluate the proposed models on the CNN/DailyMail benchmark dataset to demonstrate effectiveness of our method.


Introduction
Numerous documents from a variety of sources are uploaded to the Internet or database everyday, such as news articles (Hermann et al., 2015), scientific papers (Qazvinian and Radev, 2008) and electronic health records (Jing et al., 2019). How to effectively digest the overwhelming information has always been a fundamental question in natural language processing (Nenkova and McKeown, 2011). This question has sparked the research interests in the task of extractive text summarization, which aims to generate a short summary of a document by extracting the most representative sentences from it.
Albeit the effectiveness of the existing methods, there are still two under-explored problems. Firstly, the constructed graphs of the existing studies only involve one type of edges, while sentences are often associated with each other via multiple types of relationships (referred to as the multiplex graph in the literature (De Domenico et al., 2013;Jing et al., 2021a)). Two sentences with some common keywords are considered to be naturally connected (we refer this type of graph as the natural connection graph). For example, in Figure 1, the first and the last sentence exhibit a natural connection (green) via the shared keyword "City". Although the two sentences are far away from each other, they can be jointly considered as part of the summary since the entire document is about the keyword "City". However, such a relation can barely be captured by traditional encoders such as RNN and CNN. Two sentences sharing similar meanings are also considered to be connected (we refer this type of graph as the semantic graph). In Figure  1, the second and the third sentence are semantically similar since they express a similar meaning (yellow). The semantic similarity graph maps the semantically similar sentences into the same cluster and thus helps the model to select sentences from different clusters, which could improve the coverage of the summary. Different relationships provide relational information from different aspects, and jointly modeling different types of edges will improve model's performance Park et al., 2020;Jing et al., 2021b;Yan et al., 2021;Jing et al., 2021c). Secondly, the aforementioned methods fall short in taking advantage of the valuable relational information among words. Both of the syntactic relationship (Tai et al., 2015;He et al., 2017) and the semantic relationship among words (Kenter and De Rijke, 2015;Wang et al., 2020b;Varelas et al., 2005;Wang et al., 2021; have been proven to be useful for the downstream tasks, such as text classification (Kenter and De Rijke, 2015;Jing et al., 2018), information retrieval (Varelas et al., 2005) and text summarization .
We summarize our contributions as follows: • To exploit multiple types of relationships among sentences and words, we propose a novel Multiplex Graph Convolutional Network (Multi-GCN). • Based on Multi-GCN, we propose a Multiplex Graph based Summarization (Multi-GraS) framework for extractive text summarization. • We evaluate our approach and competing methods on the CNN/DailyMail benchmark dataset and the results demonstrate our models' effectiveness and superiority.

Methodology
We first present Multi-GCN to jointly model different relations, and then present the Multi-GraS approach for extractive text summarization.

Multiplex Graph Convolutional Network
Figure 2c illustrates Multi-GCN over a multiplex graph with initial node embedding X and a set of relations R. Firstly, Multi-GCN learns node embeddings H r of different relations r ∈ R separately, and then combines them to produce the final Figure 1: An example document: There are two different relationships among sentences: the semantic similarity (yellow) and the natural connection (green). Sentences 2, 3, 21 are the oracle sentences.
embedding H. Secondly, Multi-GCN employs two types of skip connections, the inner and the outer skip-connections, to mitigate the over-smoothing (Li et al., 2018) and the vanishing gradient problems of the original GCN (Kipf and Welling, 2016). More specifically, we propose a Skip-GCN with an inner skip connection to extract the embeddings H r for each relation. The updating functions for the l-th layer of the Skip-GCN are defined as: where A r is the adjacency matrix for the relation r; W (l) r denote the weight and bias. Note that H (0) r = X is the initial embedding, and H r is the output after all Skip-GCN layers.
Next, we combine the embedding of different relations {H r } r∈R by the following equations: where cat denotes the concatenation operation and W and b denote the weight and bias of the project block in Figure 2c. Finally, we use an outer skip connection to directly connect X with H:

The Multi-GraS model
The overview of the proposed Multi-GraS is illustrated in Figure 2a. Multi-GraS is comprised of three major components: the word block, the sentence block, and the sentence selector. The word block and the sentence block share a similar "Initialization -Multi-GCN -Readout" structure to extract the sentence and document embeddings. The sentence selector picks the most representative sentences as the summary based on the extracted embeddings.

The Word Block
The architecture of a word block is illustrated in Figure 2b. Given a sentence s m with N words {w n } N n=1 , the word block takes the pre-trained word embeddings {e wn } N n=1 as inputs, and produces the sentence embedding e sm . Specifically, the Initialization module produces contextualized word embeddings {x wn } N n=1 via Bi-LSTM. The Multi-GCN module jointly captures multiple relations for {x wn } N n=1 and produces {h wn } N n=1 . The Readout module produces the sentence embedding e sm based on max pooling over {h wn } N n=1 . In this paper, we jointly consider the syntactic and semantic relation among words. For the syntactic relation, we use a dependency parser to construct a syntactic graph A syn : if a word w n and another word w n has a dependence link between them, then A syn [n, n ] = 1, otherwise A syn [n, n ] = 0. For the semantic relation, we use the absolute value of dot product between the embeddings of words to construct the graph: A semw [n, n ] = |x T wn · x w n |. Note that we use the absolute value since GCN (Kipf and Welling, 2016) requires the values in the adjacency matrix to be non-negative.

The Sentence Block
Given a document with M sentences {s m } M m=1 , the sentence block takes the sentence embeddings {e sm } M m=1 as inputs, and generates a document embedding d through a Bi-LSTM, a Multi-GCN and a pooling module. Essentially, the architecture of the sentence block resembles the word block, thus we only elaborate the construction of graphs for sentences.
In this paper, we consider the natural connection and the semantic relation between sentences. The semantic similarity between s m and s m is the ab-solute value of dot product between x sm and x s m , and thus the semantic similarity graph A sems can be constructed by A sems [m, m ] = |x T sm · x s m |. For the natural connection, if two sentences share a common keyword, then we consider they are naturally connected. Such a relation helps to cover more sections of a document by connecting faraway sentences (not necessarily semantic similar) via their shared keywords, as shown in Figure 1.
where tfidf (sm,w) is the tfidf score of the keyword w within s m ; W is the set of keywords.

Sentence Selector
The sentence selector first scores the sentences {s m } M m=1 and then selects the top-K sentences as the summary. The model design for scoring the sentences follows the human reading comprehension strategy (Pressley and Afflerbach, 1995;Luo et al., 2019), which contains reading and postreading processes. The reading process extracts rough meaning of s m : The post-reading process further captures the auxiliary contextual information -document embedding e d and the initial sentence embedding e sm : The final score for s m is given by: where σ() denotes the sigmoid activation. When ranking the sentences {s m } M m=1 , we follow (Paulus et al., 2018;Liu and Lapata, 2019b;Wang et al., 2020a) and use the tri-gram blocking technique to reduce the redundancy.

Datasets
We evaluate our propose model on the benchmark CNN/DailMail (Hermann et al., 2015) dataset. This dataset is a combination of the CNN and Daily-Mail datasets, which contains 287, 227, 13, 368 and 11, 490 articles for training, validating and testing respectively.
For the DailyMail dataset (Hermann et al., 2015), the news articles were collected from the DailyMail website. Each article contains a story and highlights, and the story and highlights are treated as the document and the summary respectively. The dataset contains 219, 506 articles, which is split into 196, 961/12, 148/10, 397 for training, validating and testing.
For the CNN dataset (Hermann et al., 2015), the news articles were collected from the CNN website. Each article is comprised of a story and highlights, where the story is treated as the document and highlights are considered as the summary. The CNN dataset contains 92, 579 articles in total, 90, 266 are used for training, 1, 220 for validation and 1, 093 for testing.

Implementation Details
The vocabulary size is fixed as 50, 000 and the pre-trained Glove embeddings (Pennington et al., 2014) are used for the input word embeddings. For both of the word block and the sentence block, the Initialization modules employ two-layer Bi-LSTMs. The Multi-GCN modules use two-layer Skip-GCNs. We fix all the hidden dimensions as 300. We use the Stanford CoreNLP  to extract syntactic graphs. For the  natural connection graphs, we filter out the stop words, punctuation, and the words whose document frequency is less than 100. During training, we use the Adam optimizer (Kingma and Ba, 2014), and the learning rates for CNN, DailyMail, and CNN/DailyMail datasets are set to be 0.0001, 0.0005, and 0.0005, respectively. When generating summaries, we select the top-2 and top-3 sentences for the CNN and DailyMail datasets, respectively.

Oracle Label Extraction
The summaries of the documents are the highlights of the news written by human experts, hence the sentence-level labels are not provided. Given a document and its summary, we follow (Wang et al., 2020a;Liu and Lapata, 2019b;Mendes et al., 2019;Narayan et al., 2018) to identify the set of sentences (or oracles) of the document which has the highest ROUGE scores with respect to its summary.

Evaluation Metrics
We evaluate the quality of the summarization by the ROUGE scores (Lin, 2004), including R-1, R-2 and R-L for calculating the unigram, bigram and the longest common sub-sequence overlapping between the generated summarization and the groundtruth summary. In addition to automatic evaluation via ROUGE, we follow (Luo et al., 2019;Wu and Hu, 2018) and conduct human evaluation to score the quality of the generated summaries.

Overall Performance
The ROUGE (Lin, 2004) scores of all comparison methods are presented in Table 1. Within baseline methods, HSG achieves the highest performance, which indicates that considering graph structures could improve performance. We also observe that Multi-GraS outperforms all of the comparison methods and it achieves 0.21/0.38/0.26 performance increase on R-1/R-2/R-L scores.

Ablation Study
Firstly, as shown in Table 2, tri-gram blocking and contextual information within the sentence selector help improve model's performance. Then we study the influence of the Multi-GCN within the sentence block and the word block separately. To do so, we remove the Multi-GCN within the sentence block (Multi-GraS word ) and further remove the Multi-GCN within the word block (LSTM). By comparing LSTM, Multi-GraS word and Multi-GraS, it can be observed that Multi-GCN in both sentence and word blocks significantly improve the performance. Next, we study the influence of the components within Multi-GCN. Table  2 indicates that the inner and outer skip connections play an important role in Multi-GCN. Besides, jointly considering different relations is always better than considering one relation alone.
Finally, for the Initialization module in the word and sentence blocks, LSTM performs better than Transformer (Vaswani et al., 2017).

Human Evaluation
We randomly select 50 documents along with the summaries obtained by HSG, Multi-GraS, Multi-GraS word , LSTM as well as the oracle summaries. Three volunteers (proficiency in English) rank the summaries from 1 to 5 in terms of the overall quality, coverage and non-redundancy. The human evaluation results are presented in Table 3: oracle ranks the highest, Multi-GraS ranks higher than HSG.

Sensitivity Experiments
To check the performance on the number of selected sentences, we conduct a sensitivity experiment for both CNN and DailyMail datasets. The results in Figure 3 show that the Multi-GraS performs the best when the number of the selected sentences is 2 for the CNN dataset and 3 for the DailyMail dataset.

Conclusion
In this paper, we propose a novel Multi-GCN to jointly model multiple relationships among words and sentences. Based on Multi-GCN, we propose a novel Multi-GraS model for extractive text summarization. Experimental results on the benchmark CNN/DailyMail dataset demonstrate the effectiveness of the proposed methods.