Extractive Summarization Considering Discourse and Coreference Relations based on Heterogeneous Graph

Modeling the relations between text spans in a document is a crucial yet challenging problem for extractive summarization. Various kinds of relations exist among text spans of different granularity, such as discourse relations between elementary discourse units and coreference relations between phrase mentions. In this paper, we propose a heterogeneous graph based model for extractive summarization that incorporates both discourse and coreference relations. The heterogeneous graph contains three types of nodes, each corresponds to text spans of different granularity. Experimental results on a benchmark summarization dataset verify the effectiveness of our proposed method.


Introduction
Automatic summarization aims to condense the information of the input document into a shorter summary. The task has two main paradigms: extractive summarization and abstractive summarization. Generating summary sentences from scratch, abstractive summarizers can generate concise and flexible summaries. However, they also suffer from the problem of not being able to reproduce factual details correctly (See et al., 2017). On the other hand, extractive summarization aims to select salient text spans (mostly sentences) from the input document. Compared to abstractive summarizers, extractive summarizers have the advantage of being efficient and factually reliable. In this paper, we will focus on extractive summarization.
For extractive summarization, it is crucial to model the relations between text spans throughout the document. Between text spans of different granularity, there exist many different kinds of relations ( Figure 1). For example, coreference relations exist between mention phrases of the same entity, and discourse relations exist between Elementary Discourse Units (EDUs) within a document. Due To capture inter-sentential relations, some recent works utilize recurrent neural networks (RNNs) or Transformer (Vaswani et al., 2017) based encoders on top of the acquired sentence representations (Cheng and Lapata, 2016;Nallapati et al., 2016;Liu and Lapata, 2019). However, empirical observations show that these sentence-level encoders do not bring much performance gain (Liu and Lapata, 2019). Graph structure is an intuitive way to model long-range dependencies among text spans throughout a document. Early works build connectivity graphs based on content similarity between sentences (Erkan and Radev, 2004;Mihalcea and Tarau, 2004). Some recent works incorporate discourse or coreference relations into the graph structure and utilize graph neural networks (GNNs) to obtain a high-level representation of text spans (Yasunaga et al., 2017;Xu and Durrett, 2019;Xu et al., 2020). Most of these works operate on homogeneous graphs with only one type of nodes, such as Approximate Discourse Graph (ADG) (Christensen et al., 2013) or Rhetorical Structure Theory (RST) (Mann and Thompson, 1988) dependency graph. As illustrated in Figure 1, the various types of relations exist between text spans of different granularity. Thus, homogeneous graphs may not be an ideal way to encode the various types of relations between text spans.
In this paper, we propose a novel heterogeneous graph based model for extractive summarization. Heterogeneous graphs are graphs that contain multiple node types and/or multiple edge types, which is in contrast to homogeneous graphs that have only one type of nodes and one type of edge. Heterogeneous graphs have been widely studied and applied to model data structures such as citation networks (Yu et al., 2012), recommendation systems (Dong et al., 2012), etc. In this work, we use heterogeneous graph to model the document structure with three types of nodes of different granularity: sentence nodes, EDU nodes, and entity nodes.
We also try to encode both discourse and coreference relations into the graph structure. We encode the discourse relations with the edges between EDU nodes. As for the coreference relations, edges between EDU nodes and entity nodes are introduced. Instead of extracting salient sentences like most existing extractive summarizers, our model extracts salient EDUs. To identify salient EDUs in a certain sentence, we add edges between sentence nodes and their constituent EDU nodes. To the best of our knowledge, we are the first to utilize heterogeneous graph to incorporate multiple types of relations simultaneously for extractive summarization.

Proposed Method
Given an input document D with n EDUs {d 1 , d 2 , ..., d n }, we formulate extractive summarization as a sequence labeling problem. The model predicts a sequence of binary labels Y = {y 1 , y 2 , ..., y n }, where y i = 1 indicates that the i th EDU should be included in the summary. Figure 2 provides an overview of our proposed model. First, a BERT encoder is used to embed the input document D. With the EDU and entity encoders, we acquire the initial node representation of the heterogeneous graph. We then apply a heterogeneous graph encoder to obtain high-level node representations. Finally, we make predictions based on the EDU node representations.

Heterogeneous Graph Construction
We represent each input document D with a heterogeneous graph G = {V, E}, where V and E are the set of nodes and edges, respectively.
Given document D with m sentences {s 1 , ..., s m }, we first segment the sentences into n sub-sentential EDUs {d 1 , ..., d n } and perform RST discourse parsing to identify the relations between the EDUs. In addition, we perform coreference resolution to identify the mentions and the coreference relations between them. The mentions in D are then clustered into k entities {e 1 , ..., e k }, with each entity e i representing a cluster of mentions among which coreference relations holds.
The set of nodes sentence nodes V s , n EDU nodes V d , and k entity nodes V e . There are three types of edges in E. First, we use edges between EDU nodes to represent the discourse structure of the document. Similar to Xu et al. (2020), we derived the discourse dependency links between EDU nodes based on the RST tree of the document. The discourse dependency links are directional, which capture the dependency relations going from satellite to nucleus EDUs. Second, we use edges between EDU nodes and entity nodes to embed the coreference relations. If EDU d i contains a mention of entity e j , then we add an undirected edge (d i , e j ) to E. In this way, each entity indirectly connects all EDUs with mentions of the entity. Third, we also link each sentence node to its constituent EDU nodes undirectionally. The proposed heterogeneous document graph enables us to simultaneously model various relations between different sizes of text spans: sentence, EDU, entity phrase efficiently.

Graph Node Initialization
Following the settings in Liu and Lapata (2019), we utilize pretrained BERT (Devlin et al., 2019) to encode the input document D. We insert the 〈CLS〉 and 〈SEP〉 special tokens to the beginning and the end of each sentence s i , respectively. With the BERT output vectors, we acquire the initial representations of each node in V as follows:

Sentence Representations
For each sentence node s i , we take the BERT output vector of the 〈CLS〉 token before s i as the sentence node representation sent i .

EDU Representations
We use a self-attention based EDU encoder to encode each EDU node d i . Given an EDU d i with tokens {w j i }, we obtain its node representation EDU i by taking self-attention on the BERT output vectors {h j i } of the tokens:

Entity Representations
The structure of the entity encoder is identical to the EDU encoder. For each entity e i , we consider all mentions of it. By taking self-attention among the BERT output vectors which correspond to tokens of these mentions, we can acquire the entity representation entity i .

Heterogeneous Graph Encoder
We initialize the representation of each node in G with the sentence representations (sent i ), EDU representations (EDU i ), and entity representations (entity i ) acquired in section 2.2. We apply graph attention networks (GAT) (Veličković et al., 2018) to update the node representations in G. For each iteration, we update the representation h i of node i with the representations of its neighbors {h j } based on the attention weights α ij : An example of the graph attention mechanism is illustrated in Figure 3, where the subgraph around node EDU 1 is highlighted. EDU 1 has five neighbors: a sentence node (sent 1 ), two EDU nodes (EDU 2 , EDU 3 ), and two entity nodes (entity 1 , entity 2 .) We first calculate the attention weights α across the five neighbors of EDU 1 using equation 4, and update the node representation of EDU 1 accordingly.
Although a single GAT network only considers the first-degree neighbors, we can obtain a higherlevel representation for each node in G by updating the node representations for several iterations.

Prediction Layer
We feed the final representation of the EDU nodes (EDU i ) to the prediction layer with sigmoid activation to predict binary labels: The training loss of the model is the binary crossentropy loss against the oracle extraction labels.

Dataset
We evaluated our proposed model on the benchmark CNN/DailyMail dataset (non-anonymized version) (Hermann et al., 2015). We used the standard dataset split, which contains 287,227 / 13,368 / 11,490 documents for training, validation, and test split, respectively. We used the Stanford CoreNLP (Manning et al., 2014) to split sentences. Further, we used the RST discourse parser proposed by Ji and Eisenstein (2014) for both discourse segmentation and discourse parsing. For coreference resolution, we used the spanBERT-based (Joshi et al., 2020) version of the end-to-end coreference resolver proposed by Lee et al. (2017).
Since the CNN/DailyMail dataset only contains abstractive gold summaries, we have to construct oracle labels heuristically. We obtained the oracle labels on EDU-level with the heuristic algorithm based on ROUGE (Lin, 2004), similar to the one in Liu and Lapata (2019). For each document, we selected up to 5 EDUs.

Experimental Settings
We used the base model of Longformer (Beltagy et al., 2020) to encode the input document. The length of each document is truncated to 1024 BPEs. The hidden size of the EDU encoder and the entity encoder is 128. Based on the evaluation losses on the validation set, we set the number of iterations of the GAT layer to 3. Also, the number of attention heads is set to 8, with each head having a hidden size 64.  (Dong et al., 2018) 41.50 18.70 37.60 NEUSUM (Zhou et al., 2018) 41.59 19.01 37.98 HIBERT (Zhang et al., 2019) 42.37 19.95 38.83 HSG (Wang et al., 2020) 42 During training, we used a batch size of 32. We used Adam optimizer with β 1 = 0.9 and β 2 = 0.999 and followed the learning rate scheduling in Vaswani et al. (2017) with warm-up of 4000 steps. All models are trained for 50000 steps. We selected the top-3 checkpoints based on the evaluation losses on the validation set and report the average scores of them on the test set. Table 1 shows the results on CNN/DailyMail dataset. The first part contains the LEAD-3 baseline and Oracle upper bounds. The second part of the table includes other sentence based extractive models, and the third part includes other EDU based extractive models. In the last row of the table, we present the evaluation scores of our proposed model.

Results and Analysis
As Table 1 shows, our proposed model outperforms the BertSum(EDU) baseline by a significant margin (0.88/0.78/0.96 on F 1 of R-1/R-2/R-L). Our proposed model also outperforms the BertSum(sent) model and other sentence based extractive summarization baseline models. The proposed model is comparable to the state-of-theart EDU-extraction model DiscoBERT in R-1 and R-2 metrics, and outperforms it in R-L metrics.
DiscoBERT incorporates a strict RST-based rule during both oracle label construction and postprocessing stages to ensure discourse consistency. Since the purpose of this paper is to propose a heterogeneous graph based method for modeling text span relations, we will leave the question of discourse consistency to future work.

Ablation Study
We conduct an ablation study on the components of our proposed model (Table 3). First, we remove the RST dependency edges between EDU nodes (-discourse). Next, we remove the corefer-   ential edges between EDU nodes and entity nodes (-coref). The results of the ablation study show that discourse information plays an important role in our proposed model, while adding coreference information also gives a gain in performance. We also try to remove the edges between sentence nodes and their constituent EDU nodes (-sent). However, linking the sentence and EDU nodes does not seem to have a significant impact on model performance.

Qualitative Analysis
We also conduct a qualitative analysis of the proposed model. The effectiveness of discourse relations is more straight-forward and widely studied in previous research. Thus, we focus on the analysis of the role of coreference information in our proposed summarization model. In the heterogeneous document graph, EDUs containing the same entity phrase are indirectly connected through the node of the given entity. By analyzing the output of the full proposed model and the model without coreference information (-coref), we found that the models rank the importance of coreferent EDUs differently. Table 2 indicates a common pattern of the improved cases by incorporating coreference information. The table shows examples of coreferent EDUs and the ranking of their likelihood scores to be included in the summary. Comparing the EDU ranking of the full model (Rank coref ) and the model without coreference information (Rank w/o coref ), we argue that the model with coreference information is better in discriminating the important EDUs among all EDUs sharing the same entity.

Related Work Graph based Summarization
Graph based summarization models have been broadly explored. Early works build connectivity document graphs based on inter-sentential similarity (Erkan and Radev, 2004;Mihalcea and Tarau, 2004). With the promising results of graph neural networks (GNNs) (Kipf and Welling, 2017;Veličković et al., 2018), some recent works utilize GNN to incorporate external knowledge into the model. For instance, Yasunaga et al. (2017) utilizes a sentence-level ADG graph to model discourse and coreference relations. Some works convert the RST tree of the input document into dependency form in either sentence or EDU level (Xu and Durrett, 2019;Xu et al., 2020). Most of these models operate on homogeneous graphs with only one type of node. One of the major disadvantages of homogeneous graphs is that they can only embed one relation type in a single graph, since there is only one type of node and one type of edge.
Fewer summarization models operate on heterogeneous graphs with different types of nodes. Wei (2012) introduces a heterogeneous graph of sentence, word, and topic nodes, and Wang et al. (2020) also utilizes a heterogeneous graph of sentence and word nodes. However, neither of the above works incorporates external knowledge into the graph.

EDU based Extractive Summarization
Li et al. (2016) illustrates the potential of using EDU as the extraction unit for summarization. Xu et al. (2020) also introduces an end-to-end EDU based extractive summarization model. By using a heuristic based on RST dependency structure, they enhanced the grammaticality and discourse consistency of the extracted summary.

Conclusion
In this paper, we proposed a novel heterogeneous graph based model for extractive summarization. By introducing nodes of different granularity, the heterogeneous graph has the capacity to embed various types of relations between text spans. Experiments on CNN/DailyMail benchmark dataset illustrated the effectiveness of our proposed method.