Contrastive Hierarchical Discourse Graph for Scientific Document Summarization

The extended structural context has made scientific paper summarization a challenging task. This paper proposes CHANGES, a contrastive hierarchical graph neural network for extractive scientific paper summarization. CHANGES represents a scientific paper with a hierarchical discourse graph and learns effective sentence representations with dedicated designed hierarchical graph information aggregation. We also propose a graph contrastive learning module to learn global theme-aware sentence representations. Extensive experiments on the PubMed and arXiv benchmark datasets prove the effectiveness of CHANGES and the importance of capturing hierarchical structure information in modeling scientific papers.


Introduction
Extractive document summarization aims to extract the most salient sentences from the original document and form the summary as an aggregate of these sentences.Compared to abstractive summarization approaches that suffer from hallucination generation problems (Kryściński et al., 2019;Zhang et al., 2022b), summaries generated in an extractive manner are more fluent, faithful, and grammatically accurate, but may lack coherence across sentences.Recent advances in deep neural networks and pre-trained language models (Devlin et al., 2018;Lewis et al., 2019) have led to significant progress in single document summarization (Nallapati et al., 2016a;Narayan et al., 2018;Liu and Lapata, 2019;Zhong et al., 2020).However, these methods mainly focus on short documents like news articles in CNN/DailyMail (Hermann et al., 2015) and New York Times (Sandhaus, 2008), and struggle when dealing with relatively long documents such as scientific papers.
The challenges of lengthy scientific paper summarization lie in several aspects.First, the extended input context hinders cross-sentence relation modeling, the critical step of extractive summarization (Wang et al., 2020).Thus, sequential models like RNN are incapable of capturing the long-distance dependency between sentences, and hard to differentiate salient sentences from others.Furthermore, scientific papers tend to cover diverse topics and contain rich hierarchical discourse structure information.The internal hierarchy structure, like sections, paragraphs, sentences, and words, is too complex for sequential models to capture.Scientific papers generally follow a standard discourse structure of problem definition, methodology, experiments and analysis, and conclusions (Xiao and Carenini, 2019).Moreover, the lengthy input context also makes the widely adopted self-attention Transformer-based models (Vaswani et al., 2017) inapplicable.The input length of a scientific paper can range from 2000 to 7, 000 words, which exceeds the input limit of the Transformer due to the quadratic computation complexity of self-attention.Thus, sparse Transformer models like BigBird (Zaheer et al., 2020) and Longformer (Beltagy et al., 2020) are proposed.
Recently, researchers have also turned to graph neural networks (GNN) as an alternative approach.Graph neural networks have been demonstrated to be effective at tasks with rich relational structure and can preserve global structure information (Yao et al., 2019;Xu et al., 2019;Zhang and Zhang, 2020).By representing a document as a graph, GNNs update and learn sentence representations by message passing, and turn extractive summarization into a node classification problem.Among all attempts, one popular way is to construct crosssentence similarity graphs (Erkan and Radev, 2004;Zheng and Lapata, 2019), which uses sentence representation cosine similarity as edge weights to model cross-sentence semantic relations.Xu et al. (2019) proposed using Rhetorical Structure Theory (RST) trees and coreference mentions to capture cross-sentence discourse relations.Wang et al. (2020) proposed constructing a word-document het-erogeneous graph by using words as the intermediary between sentences.Despite their success, how to construct an effective graph to capture the hierarchical structure for academic papers remains an open question.
To address the above challenges, we propose CHANGES (Contrastive HierArchical Graph neural network for Extractive Summarization), a hierarchical graph neural network model to fully exploit the section structure of scientific papers.CHANGES first constructs a sentence-section hierarchical graph for a scientific paper, and then learns hierarchical sentence representations by dedicated designed information aggregation with iterative intra-section and inter-section message passing.Inspired by recent advances in contrastive learning (Liu and Liu, 2021;Chen et al., 2020), we also propose a graph contrastive learning module to learn global theme-aware sentence representations and provide fine-grained discriminative information.The local sentence and global section representations are then fused for salient sentence prediction.We validate CHANGES with extensive experiments and analyses on two scientific paper summarization datasets.Experimental results demonstrate the effectiveness of our proposed method.Our main contributions are as follows: • We propose a hierarchical graph-based model for long scientific paper extractive summarization.Our method utilizes the hierarchical discourse of scientific documents and learns effective sentence representations with iterative intra-section and inter-section sentence message passing.
• We propose a plug-and-play graph contrastive module to provide fine-grained discriminative information.The graph contrastive module learns global theme-aware sentence representations by pulling semantically salient neighbors together and pushing apart unimportant sentences.Note that the module could be added to any extractive summarization system.
• We validate our proposed model on two benchmark datasets (arXiv and PubMed), and the experimental results demonstrate its effectiveness over strong baselines.
2 Related Work

Extractive Summarization on Scientific Papers
Despite the superior performance on news summarization by recent neural network models (Zhou et al., 2018;Zhang et al., 2023a,b;Fonseca et al., 2022) and pre-trained language models (Liu and Lapata, 2019;Lewis et al., 2019), progress in long document summarization such as scientific papers is still limited.
Traditional approaches to summarize scientific articles rely on supervised machine learning algorithms such as LSTM (Collins et al., 2017) with surface features such as sentence position, and section categories.Recently, Xiao and Carenini (2019) proposed a neural-based method by incorporating both the global context of the whole document and the local context within the current topic with an encoder-decoder model.Ju et al. (2021) designed an unsupervised extractive approach to summarize long scientific documents based on the Information Bottleneck principle.Dong et al. (2020) proposed an unsupervised ranking model by incorporating two-level hierarchical graph representation and asymmetrical positional cues to determine sentence importance.Recent works also apply pretrained sparse language models like Longformer for modeling long documents (Beltagy et al., 2020;Ruan et al., 2022;Cho et al., 2022).

Graph-based Summarization
Graph models have been widely applied to extractive summarization due to the capability of modeling cross-sentence relations within a document.The sparsity nature of graph structure also brings scalability and flexibility, making it a good fit for long documents.Graph neural networks' memory costs are generally linear with regard to the input size compared to the quadratic self-attention mechanism.
Researchers have explored supervised graph neural network methods for summarization (Cui and Hu, 2021;Jia et al., 2020;Huang and Kurohashi, 2021;Xie et al., 2022;Phan et al., 2022).Yasunaga et al. (2017) first proposed to use Graph Convolutional Network (GCN) on the approximate discourse graph.Xu et al. (2019) then applied GCN on structural discourse graphs based on RST trees and coreference mentions.Recently, Wang et al. (2020) proposed constructing a word-document heterogeneous graph by using words as the intermediary Figure 1: The overall model architecture of CHANGES.We first construct a hierarchical graph for an input document, and then learn representations with a graph contrastive module and hierarchical graph layers.The concatenation representations of sentence node and its section node will be fused for summary sentence selection.between sentences.Zhang et al. (2022a) proposed to use hypergraph to capture the high-order sentence relations within the document.Our paper follows the series of work but incorporates hierarchical graphs for scientific paper discourse structure modeling and graph contrastive learning for theme-aware sentence representation learning.

Method
Given a document D = {s 1 , s 2 , ..., s n } with n sentences and m sections, we first represent it as a hierarchical graph and formulate extractive summarization as a node labeling task.The objective is to predict labels y i ∈ (0, 1) for all sentences, where y i = 1 and y i = 0 represent whether the i-th sentence should be included in the summary or not, respectively.
The overall model architecture of CHANGES is shown in Figure 1.CHANGES consists of two modules: a graph contrastive learning module to learn global theme-aware sentence representations and a hierarchical graph layer module to learn hierarchical graph node representations with iterative message passing.The learned sentence node and section node representations will be used as indicators for salient sentence selection.

Graph Construction
Given an academic paper D, we first construct a hierarchical graph G = (V, E), where V stands for the node set and E represents edges between nodes.In order to utilize the sentence-section hierarchical structure of academic papers, the undirected hierarchical graph G contains both sentence nodes and section nodes, defined by V = V sen ∪ V sec , where each sentence node v sen i ∈ V sen represents a sentence s i in the document D and v sec j ∈ V sec represents one section in the document.The edge connection of G is defined as E = E sen ∪ E sec ∪ E cross , where E sen denotes the connection between sentence nodes within the same section, E sec denotes the connection between section nodes, and E cross denotes the cross-connection between a sentence node and its corresponding section node.Note that we also add a special section supernode v D that represents the whole document D. An illustration of the hierarchical graph is shown in Figure 2.
Edge Connection Unlike prior work (Zheng and Lapata, 2019;Dong et al., 2020) that uses cosine similarity of sentence semantic representations as edge weights, we construct unweighted hierarchical graphs to disentangle structural information (adjacency matrix A) from semantic information (node representation H).In other words, connected nodes have weight 1, and disconnected nodes have weight 0 in the adjacency matrix A.
Formally, sentence-level edge e sen i,j connects sentence nodes v sen i and v sen j if they are within the same section, aiming to aggregate local intrasection information.All section nodes are fully connected by section-level edges e secp,q , aiming to aggregate global inter-section information.The cross-level edge e cross i,p connects the sentence node v sen i to its corresponding section node v secp , which allow message passing between sentencelevel and section-level nodes.
In a hierarchical graph, a sentence node could only directly interact with local neighbor nodes within the same section, and indirectly interact with sentence nodes of other sections via section-level node connections.Node Representation We adopt BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2018) as sentence encoder to embed the semantic meanings of sentences {s 1 , s 2 , ..., s n } as initial node representations X = {x 1 , x 2 , ..., x n }.Note that BERT here is only used for initial sentence embedding, but is not updated during the training process to reduce model computing cost and increase efficiency.
In addition to the semantic representation of sentences, we also inject positional encoding following Transformer (Vaswani et al., 2017) to preserve the sequential order information.We apply the hierarchical position embedding by (Ruan et al., 2022) to model sentence positions accompanying our hierarchical graph.Specifically, the position of each sentence s i can be represented as two parts: its corresponding section index p sec i , and its sentence index within section p sen i .Formally, the hierarchical position embedding (HPE) of sentence s i can be calculated as: where PE(•) refers to the position encoding func-tion in (Vaswani et al., 2017): PE(pos, 2i + 1) = cos(pos/10000 2i/d ). (3) Overall, we can get the initial sentence node representations H 0 sen = {h 0 sen 1 , h 0 sen 2 , ..., h 0 senn }, with vector h 0 i ∈ R d defined as: where d is the dimension of the node embedding.
The initial section node representation h 0 for the j-th section is the mean of its connected sentences embeddings, and the document node representation h 0 doc ∈ R d is the mean of all section node embeddings.

Graph Contrastive Module
After constructing the hierarchical graph with adjacency matrix A and node representation H 0 sen ∈ R n×d , we apply a graph contrastive learning (GCL) module to capture global context information.Motivated by the principal idea that a good summary sentence should be more semantically similar to the source document than the unqualified sentences (Radev et al., 2004;Zhong et al., 2020), our GCL module updates sentence representations using Graph Attention Network (Veličković et al., 2017) with a contrastive objective to learn the global theme-aware sentence representations.Note that the module could be added to any extractive summarization system.
Graph Attention Network Given a constructed graph G = (V, E) with node representations H and adjacent matrix A, a GAT layer updates a node v i with representation h i by: , (5) where N i denotes the 1-hop neighbors of node v i , α ij denotes the attention weight between nodes h i and h j , W in , W a , W v are trainable weight matrices, and ∥ denotes concatenation operation.
The above single-head graph attention is further extended to multi-head attention, where T independent attention mechanisms are conducted and their outputs are concatenated as: Contrastive Loss Contrastive learning aims to learn effective representation by pulling semantically close neighbors together and pushing apart non-neighbors (Marelli et al., 2014).Recent works have demonstrated contrastive learning to be effective in high-order representation learning (Chen et al., 2020;Gao et al., 2021).Thus, we optimize our GCL module in a contrastive manner with the following contrastive loss.The goal of contrastive learning is to learn theme-aware sentence embedding by pulling semantically salient neighbors together and pushing apart less salient sentences.The contrastive loss is formally defined as: where h ′ D is the document node embedding, h ′ i is the updated representaion of sentence s i , and τ is the temperature factor.
After passing through the GCL module, the learned global theme-aware sentence embeddings H c sen = {h c sen 1 , h c sen 2 , ..., h c senn } ∈ R n×d are then passed to the hierarchical graph layer module.

Hierarchical Graph Layer
To exploit the sentence structure of academic papers, CHANGES then updates sentence and section node representations with hierarchical graph layers in an iterative manner.
The hierarchical graph layer first updates sentence embeddings with the local neighbor sentences within the same section with GAT for intrasection message passing, then update section nodes with sentence nodes for cross-level information aggregation to exploit the hierarchical structure of academic papers.Next, inter-section message passing allow global context information interaction.Finally, the sentence nodes are updated based on their corresponding section node, fusing both local and global context information.
Formally, each iteration contains four update processes: one intra-section message passing, one sentence-to-section aggregation, one intersection message passing, and finally one sectionto-sentence aggregation.For the l-th iteration, the process can be represented as: where H ′ sen , H ′ sec denotes the intermediate representations of sentence and section nodes, H l+1 sen , H l+1 sec denotes the updated sentence and section node representations, and [H ′ sen ∥H l+1 sec ] denotes the concatenation of intermediate sentence node representation and its corresponding updated section node representation.
In this way, CHANGES updates and learns hierarchy-aware sentence embeddings through the hierarchical graph layers.

Optimization
After passing L hierarchical graph layers, we obtain the final sentence node representations H L sen = {h L sen 1 , h L sen 2 , ..., h L senn } ∈ R n×d .We then add a multi-layer perceptron (MLP) followed by a sigmoid activation function to indicate the confidence scores for extracting each sentence in the summary.Formally, the predicted confidence score ŷi to extract a sentence s i in section sec j as a summary sentence is: where W o1 , W o2 are both trainable parameters, and [h L sen i ∥h L sec j ] denote the concatenation of sentence embedding and its corresponding section embedding.During the inference phase, we will select the k sentences with the highest predicted confidence scores as the extractive summary for the input long document.
Since the extractive ground truth labels for long documents are highly imbalanced, we optimize hierarchical graph layers using weighted cross entropy loss following (Xiao and Carenini, 2019) as: 11) where N denotes the number of training instances in the training set, N d denotes the number of sentences in the document, η = #negative #positive denote the ratio of the number of negative and positive sentences in the document, and y i represent the groundtruth of sentence i.
Training Loss Overall, we optimize CHANGES in an end-to-end manner, by optimizing the graph contrastive module and hierarchical graph layers simultaneously.
The overall training loss of CHANGES is: where λ is a re-scale hyperparameter and L c denotes the contrastive loss in Equation 7.

Experiment Setup
Dataset To validate the effectiveness of CHANGES, we conduct extensive experiments on two benchmark datasets: arXiv and PubMed (Cohan et al., 2018).The arXiv dataset contains papers in scientific domains, while the PubMed dataset contains scientific papers from the biomedical domain.These two benchmark datasets are widely adopted in long document summarization research and we use the original train, validation, and testing splits as in (Cohan et al., 2018).The detailed statistics of datasets are shown in Table 1.
Evaluation Following the common setting, we use ROUGE F-scores (Lin, 2004) as the automatic evaluation metrics.Specifically, we report the ROUGE-1/2 scores to measure summary informativeness and ROUGE-L scores to measure summary fluency.Following prior work (Liu and Lapata, 2019;Nallapati et al., 2016b), we also construct extractive ground truth labels (ORACLE) for training by greedily optimizing the ROUGE score on gold-reference abstracts.

Implementation Details
We use the publicly released BERT-base1 (Devlin et al., 2018) as the sentence encoder.The BERT encoder is only used to generate initial sentence embeddings, but is not updated during training to improve model efficiency.We adopt the Graph Attention Network2 (Veličković et al., 2017) implementation with 8 attention heads and 2 stack layers for graph message passing.The hidden size of our model is set to 2048.
Our model is trained with the Adam optimizer (Loshchilov and Hutter, 2017) with a learning rate of 0.0001 and a dropout rate of 0.1.We train our model on a single RTX A6000 GPU for 10 epochs and validate after each epoch using ROUGE-1 Fscore.We employ early stopping to select the best model for a patient duration of 3 epochs.We searched the training loss re-scale factor λ in the range of 0 to 1 with 0.1 step size and got the best value of 0.5.

Experiment Results
Table 2 shows the performance comparison of CHANGES and all baseline methods on both PubMed and arXiv datasets.The first blocks include the extractive ground truth ORACLE, position-based sentence selection method LEAD, and other unsupervised baseline approaches.The second block covers state-of-the-art supervised extractive neural baselines, and the third block covers the supervised abstractive baselines.
According to the results, HIPORANK (Dong et al., 2020) achieves state-of-the-art performance for graph-based unsupervised methods.Compared to PACSUM (Zheng and Lapata, 2019), the only difference is that HIPORANK incorporates section structural information for degree centrality calculation.The performance gain demonstrates the significance of capturing the hierarchical structure of academic papers when modeling cross-sentence relations.
Interestingly, the LEAD approach performs far better when summarizing short news like CNN/DailyMail (Hermann et al., 2015) and New York Times (Sandhaus, 2008) than summarizing academic papers, as shown in Table 2.The results show that the distribution of ground truth sentences in academic papers is more even.In other words, academic papers have less positional bias than news articles.
We also notice that the neural extractive models tend to outperform the neural abstractive methods in general, possibly because the extended context is more challenging for generative models during decoding.ExtSum-LG (Xiao and Carenini, 2019) is a benchmarked extractive method with section information by incorporating both the global context of the whole document and the local context within the current topic.We argue that CHANGES could better model the complex sentence structural information with the hierarchical graph than the LSTM-minus in ExtSum-LG.
According to the experimental results, our model CHANGES outperforms all baseline approaches significantly in terms of ROUGE F1 scores on both PubMed and arXiv datasets.The performance improvements demonstrate the usefulness of the global theme-aware representations from the graph contrastive learning module and the hierarchical graph structure for identifying the salient sentences.

Ablation Study
We first analyze the influence of different components of CHANGES in Table 3.Here the second row 'w/o Contra' means we remove the GCL module and do not update the theme-aware sentence embeddings.The third row 'w/o Hierarchical' denotes that we only use the theme-aware sentence embedding for prediction without hierarchical graph layers.As shown in the table, removing either      component causes a significant model performance drop, which indicates that modeling sequential order information, semantic information, and hierarchical structural information are all necessary for academic paper summarization.
Interestingly, the theme-aware sentence embeddings and the hierarchy structure-aware sentence embeddings are almost equally critical to sentence salience modeling.The finding indicates the importance of modeling cross-sentence relations from both semantic and discourse structural perspectives.

Performance Analysis
We also analyze the sensitivity of CHANGES to section structure and length of academic papers.As shown in Figure 3, we see a performance drop trending when the number of sections increases.This is likely because the complex section structure hinders the inter-section sentence interactions.The model performance on the arXiv dataset is more stable compared to the PubMed dataset although documents in the arXiv dataset are relatively longer.We notice the same trend in Figure 4, model performance is also more stable on arXiv datasets across different document lengths.We argue this may imply that our model is more fit for longer documents that have richer discourse structures.
Regarding the document length, we see a steady performance gain when comparing to benchmark baseline methods ExtSum-LG on both datasets as shown in Figure 4. We also see as the document length increases, the performance gap between CHANGES and extractive summary performance ceiling ORACLE becomes smaller.The finding also verifies that CHANGES is especially effective and fit for long academic papers modeling.
In this paper, we propose CHANGES, a contrastive hierarchical graph-based model for scientific paper extractive summarization.CHANGES first leans global theme-aware sentence representations by graph contrastive learning module.Moreover, CHANGES incorporates the sentence-section hierarchical structure by separating intra-section and inter-section message passing and aggregating both global and local information for effective sentence embedding.Automatic evaluation on the PubMed and arXiv benchmark datasets proves the effectiveness of CHANGES and the importance of capturing both semantic and discourse structure information in modeling scientific papers.
In spite of the strong zero-shot performance of large language models like ChatGPT on various downstream tasks, long document modeling is still a challenging problem in the LLM era.Transformer-based GPT-like systems still suffer from the attention computing complexity problem and will benefit from effective and efficient modeling of long documents.

Limitations
In spite of the strong performance of CHANGES, its design still has the following limitations.First, CHANGES only extracts the sentence-sectiondocument hierarchical structure of academic papers.We believe the model performance could be further improved by incorporating document hierarchy of different granularity like dependency parsing trees and Rhetorical structure theory trees.We leave this for future work.In addition, we only focus on single academic paper summarization in this work.Academic papers generally contain a large amount of domain knowledge, thus introducing domain knowledge from peer papers or citation networks should further boost model performance.

Figure 2 :
Figure 2: An illustration of a hierarchical graph for a long input document with rich discourse structures.

Figure 3 :
Figure 3: ROUGE-1,2 performance of CHANGES for test papers with different section numbers.

Figure 4 :
Figure 4: ROUGE-1 performance of ExtSum-LG, CHANGES, ORACLE for test papers with different lengths.

Table 1 :
Statistics of PubMed and Arxiv datasets.

Table 2 :
(Xiao and Carenini, 2019)d and Arxiv datasets.We keep the same train/validation/test splitting in all the experiments and report ROUGE scores from the original papers if available, or scores from(Xiao and Carenini, 2019)otherwise.

Table 3 :
Ablation study results of removing components of CHANGES on PubMed and arXiv datasets.