Discourse-Aware Unsupervised Summarization for Long Scientific Documents

We propose an unsupervised graph-based ranking model for extractive summarization of long scientific documents. Our method assumes a two-level hierarchical graph representation of the source document, and exploits asymmetrical positional cues to determine sentence importance. Results on the PubMed and arXiv datasets show that our approach outperforms strong unsupervised baselines by wide margins in automatic metrics and human evaluation. In addition, it achieves performance comparable to many state-of-the-art supervised approaches which are trained on hundreds of thousands of examples. These results suggest that patterns in the discourse structure are a strong signal for determining importance in scientific articles.


Introduction
Single document summarization aims at shortening a text and preserving the most important ideas of the source document. While abstractive strategies generate summaries with novel words, extractive strategies select sentences from the source to form a summary (Nenkova et al., 2011). Despite recent advances in abstractive summarization, extractive models are still attractive in cases where faithfully preserving the original text is the priority. For example, legal arguments can hinge on the exact wording of a contract (Farzindar and Lapalme, 2004), and ensuring the factual correctness of a summary can be critical in the health or scientific domains, which is a known weakness of current abstractive methods (Kryściński et al., 2019).
Supervised neural-based models have been the dominant paradigm in recent extractive systems, at least for short news summarization (Nallapati et al., Introduction although anxiety and depression are often related and coexist in pd patients, recent research suggests that anxiety rather than depression is the most prominent and prevalent mood disorder in pd.
Related Work furthermore, since previous work, albeit limited, has focused on the influence of symptom laterality on anxiety and cognition, we also explored this relationship .
Methodology this study is the first to directly compare cognition between pd patients with and without anxiety.
Result the findings confirmed our hypothesis that anxiety negatively influences attentional setshifting and working memory in pd.
Result moreover, anxiety has been suggested to play a key role in freezing of gait (fog), which is also related to attentional set-shifting.
Future work s. future research should examine the link between anxiety, set-shifting, and fog, in order to determine whether treating anxiety might be a potential therapy for improving fog. Table 1: Example of a PubMed article's summary produced by our model HIPORANK. The hierarchical and directed graph combined with discourse-aware edge weighting allow HIPORANK to generate summaries that cover topics from different sections of the scientific article. 2017; Dong et al., 2018;Zhou et al., 2018;Liu and Lapata, 2019;Narayan et al., 2018b;Zhang et al., 2019b). These models usually employ the encoderdecoder structure and have achieved promising performance on news datasets such as CNN/DailyMail (Hermann et al., 2015), and NYT (Sandhaus, 2008).
However, these models cannot easily be adapted to out-of-domain data that have greater length and fewer training examples such as scientific article summarization (Xiao and Carenini, 2019) due to two significant limitations. First, they require large domain-specific training pairs of source documents and gold-standard summaries, which are often not available or feasible to create (Zheng and Lapata, 2019). Second, the typical setup of using a tokenlevel encoder-decoder with an attention mechanism does not scale well to longer documents (Shao et al., 2017), as the number of attention computations is quadratic with respect to the number of tokens in the input document.
We instead explore unsupervised approaches to address these challenges on long document summarization. We show that a simple unsupervised graph-based ranking model combined with proper sophisticated modelling of discourse information as an inductive bias can achieve unreasonable effectiveness in selecting important sentences from long scientific documents.
For the choice of unsupervised graph-based ranking model, we follow the paradigm of LexRank (Erkan and Radev, 2004) and PACSUM (Zheng and Lapata, 2019). In these methods, sentences are nodes and weighted edges represent the degree of similarity between sentences. Summary generation is formulated as a node selection problem, in which nodes (i.e., sentences) that are semantically similar to other nodes are chosen to be included in the final summary. In other words, they determine node importance by defining a notion of centrality in the graph.
In addition, we augment the document graph with directionality and hierarchy to reflect the rich discourse structure of long scientific documents. In particular, our method relies on two insights about the discourse structure of long scientific documents. The first is that important information typically occurs at the start and end of sections; i.e., they tend to appear near section boundaries (Baxendale, 1958;Lin and Hovy, 1997;Teufel, 1997). We implement this using an asymmetric edge weighting function in a directed graph which considers the distance of a sentence to a boundary. The second is that most sentences across section boundaries are unlikely to interact significantly with each other (Xiao and Carenini, 2019). We implement this insight by injecting hierarchies into our model, introducing section-level representations as graph nodes in addition to sentence nodes. By doing so, we convert a flat graph into a hierarchical non-fully-connected graph, which has two advantages: 1) reduced computational cost and 2) pruning of distracting weak connections between sentences across different sections.
We call our approach Hierarchical and Positional Ranking model (HIPORANK) and evaluate it on summarizing long scientific articles from PubMed and arXiv (Cohan et al., 2018). Empirical results show that our method significantly improves performance over previous unsupervised models (Zheng and Lapata, 2019;Erkan and Radev, 2004) in both automatic and human evaluation. In addition, our simple unsupervised approach achieves performance comparable to many expensive state-of-the-art supervised neural models that are trained on hundreds of thousands of examples of long document pairs (Xiao and Carenini, 2019;Subramanian et al., 2019). This suggests that patterns in the discourse structure are highly useful for determining sentence importance in long scientific articles, and that explicitly building in biases inspired by this structure is a viable strategy for building summarization systems.

Extractive Summarization of Long Scientific Papers
Despite the success of deep neural-based models on news summarization, these approaches typically face challenges when applied to long documents such as scientific articles. Furthermore, these approaches are often blind to the topical information resulting from the structured sections in scientific articles (Xiao and Carenini, 2019). Two recent neural supervised models address these issues. Subramanian et al. (2019) used the introduction section as a proxy for the whole document, while Xiao and Carenini (2019) divided articles into sections and used non-auto-regressive approaches to model global and local information. Besides neural approaches, most previous scientific article summarization systems employ traditional supervised machine learning algorithms with surface features as input (Xiao and Carenini, 2019). Surface features such as sentence position, sentence and document length, keyphrase score, and fine-grain rhetorical categories are often combined with Naive Bayes (Teufel and Moens, 2002), CRFs and SVMs (Liakata et al., 2013), LSTM and MLP (Collins et al., 2017) for extractive summarization over long scientific articles. To the best of our knowledge, the only unsupervised extractive summarization model for long scientific documents relies on citation networks (Qazvinian and Radev, 2008;Cohan and Goharian, 2015), by extracting citation-contexts from citing articles and ranking Figure 1: Example of a hierarchical document graph constructed by our approach on a toy document that contains two sections {T 1 , T 2 }, each containing three sentences for a total of six sentences {s 1 , . . . , s 6 }. Each double-headed arrow represents two edges with opposite directions. The solid and dashed arrows indicate intra-section and inter-section connections respectively. When compared to the flat fully-connected graph of traditional methods, our use of hierarchy effectively reduces the number of edges from 60 to 24 in this example. these sentences to form the final summary. Our proposed method is different from their settings, where we perform single document summarization based on the long source article.

Method
Our proposed method combines simple graphbased ranking algorithms with a two-level hierarchical model of the rich discourse structures of long scientific documents (Teufel, 1997;Xiao and Carenini, 2019). We incorporate this discourse information into the graph as inductive biases through the construction of a directed hierarchical graph for document representation (Figure 1 and Section 3.2) and through the asymmetric edge weighting of edges with boundary functions (Section 3.3).

Graph-based Ranking Algorithm
Graph-based ranking algorithms for summarization represent a document as a graph G = (V, E), where V is the set of vertices that represent sentences or other textual units in the document, and E is the set of edges that represent interactions between sentences. The directed edge e ij from node v i to node v j is typically weighted by w ij = f (sim(v i , v j )), where sim is a measure of similarity between two nodes (e.g. cosine distance between their distributed representations), and f can be an additional weighting function. These algorithms select the most salient sentences from V based on the assumption that sentences that are similar to a greater number of other sentences capture more important content and therefore are more informative.

Hierarchical Document Graph Creation
To create a hierarchical document graph, we first split a document into its sections, then into sentences 2 . To create the hierarchy, we allow two levels of connections in our hierarchical graph: intrasectional connections and inter-sectional connections as shown in Figure 1.
Intra-sectional connections aim to model the local importance of a sentence within its section. It implements the idea that a sentence that is similar to a greater number of other sentences in the same topic/section should be more important. This is realized in our fully-connected subgraph for an arbitrary section I, where we allow sentence-sentence edges for all sentence nodes within the same section.
Inter-sectional connections aim to model the global importance of a sentence with respect to other topics/sections in the document, as a sentence that is similar to a greater number of other topics is deemed more important. However, calculating sentence-sentence connections across different sections is computationally expensive and may also suffer from performance degradation due to weak edges between sentences that are unrelated as a result of being from different sections (Mihalcea and Tarau, 2004). To address these issues, We introduce section nodes on top of sentence nodes to form a hierarchical graph. For inter-section connections, we only allow section-sentence edges for modeling the global information. This choice makes our approach more computationally efficient while greatly limiting the number of irrelevant intersection edges that arise from the fact that sections in scientific documents typically have independent topics (Xiao and Carenini, 2019). In contrast, traditional graph-based ranking algorithms have a flat fully-connected graph document with no sections.

Asymmetric Edge Weighting by Boundary functions
To calculate the weight of an edge, we first measure similarity between a sentence-sentence pair sim(v I j , v I i ) and a section-sentence pair sim(v J , v I i ). While our method is agnostic to the measure of similarity, we use cosine similarity with different vector representations in our experiments, averaging a section's sentence representations to obtain its own.
While the similarities of two graph nodes are symmetric, one may be more salient than the other when considering their discourse structures (Baxendale, 1958;Teufel, 1997). Based on these discourse hypotheses of long scientific documents, we capture this asymmetry by making our hierarchical graph directed and inject asymmetric edge weighting over intra-section and inter-section connections.
Asymmetric edge weighting over sentences Our asymmetric edge weighting is based on the hypothesis that important sentences are near the boundaries (start or end) of a text (Baxendale, 1958). We reflect this hypothesis by defining a sentence boundary function d b over sentences v I i in section I such that sentences closer to the section's boundaries are more important: where n I is the number of sentences in section I and x I i represents sentence i's position in the section I. α ∈ R + is a hyper-parameter that controls the relative importance of the start or end of a section or document.
The sentence boundary function allow us to incorporate directionality in our edges, and weight edges differently depending on if they are incident to a more important or less important sentence in the same section. Concretely, we define the weight w I ji for intra-section edges (incoming edges for i) as: (2) where λ 1 < λ 2 such that an edge e ji incident to i is weighted more if i is closer to the text bound-ary than j. Edges with a weight below a certain threshold β can be pruned (i.e., set to 0).

Asymmetric edge weighting over sections
Similarly, to reflect the hierarchy hypothesis over long scientific documents proposed by Teufel (1997), we also define a section boundary function d b to reflect that sections near a document's boundaries are more important: where N is the number of sections in the document and x I represents section I's position in the document. This section boundary function allows us to inject asymmetric edge weighting w JI i to intersection edges: where λ 1 < λ 2 such that an edge e JI i incident to i ∈ I is weighted more if section I is closer to the text boundary than section J.

Importance Calculation
We compute the overall importance of sentence v I i as the weighted sum of its inter-section and intrasection centrality scores: where I is the set of sentences neighbouring v I i and D is the set of neighbouring sections in the hierarchical document graph; µ 1 is a weighting factor for inter-section centrality.

Summary Generation
Lastly, we generate a summary by greedily extracting sentences with the highest importance scores until a predefined word-limit L is passed. Most graph-based ranking algorithms recompute importance after each sentence is extracted in order to prevent content overlap. However, we find that the  asymmetric edge scoring functions in (2) and (4) naturally prevent redundancy, because similar sentences have different boundary positional scores. Our method thus successfully extracts diverse sentences without recomputing importance.

Experimental Setup
This section describes the datasets, the hyperparameter choices, the baseline models, and the evaluation metrics used in the experiments.

Datasets
Our experiments are conducted on PubMed and arXiv (Cohan et al., 2018), two large-scale datasets of long and structured scientific articles with abstracts as summaries. The average source article length is four to seven times longer than popular news benchmarks (Table 2), making them ideal candidates to test our method.

Implementation Details
Our model's hyperparameters for testing are chosen from the ablation studies on the validation sets. The test results are reported with the following hyperparameter settings: λ 1 = 0.0, λ 2 = 1.0, α = 1.0, with µ 1 = 0.5 for PubMed and µ 1 = 1.0 for arXiv. We fix λ 2 to 1 and the choices of λ 1 ∈ {−0.2, 0, 0.5}. represent whether the edge between a less boundary-important sentence and a more boundary-important sentence is 1) negatively weighted, 2) pruned, or 3) down-weighted. λ 1 < λ 2 such that an edge e ji incident to i is weighted more if i is closer to the text boundary than j. α ∈ {0, 0.5, 0.8, 1.0, 1.2} controls the relative importance of the start or end of a section or document. µ 1 ∈ {0.5, 1.0, 1.5} controls how much we weigh intra-section sentence importance vs. inter-section sectional importance.
For each dataset, we experimented with different pretrained distributional sentence representation models. The dimension of sentence representations is model-dependent (details in Section 6.2). We used the publicly released BERT model 3 (Devlin et al., 2019), PACSUM BERT model 4 (Zheng and Lapata, 2019), SentBERT and Sen-tRoBERTa 5 (Reimers and Gurevych, 2019), and BioMed word2vec representations 6 (Moen and Ananiadou, 2013). A section's representation is calculated as the average of its sentences' representations. The similarity between sentences or sections is defined to be the cosine similarity between the distributed representations.

Baselines
We compare our approach with previous unsupervised and supervised models in extractive summarization. In addition, we also compare it with recent neural abstractive approaches for completeness.

Evaluation Methods
We evaluate our method with automatic evaluation metrics -ROUGE F1 scores (Lin, 2004). ROUGE-1 and ROUGE-2 compute unigram and bigram overlaps between system summaries and reference summaries, while ROUGE-L computes the longest common sub-sequence of the two.
In addition, we design a human evaluation experiment (details in Section 5.2) to compare our model with the best unsupervised summarization model -PACSUM (Zheng and Lapata, 2019). As far as we know, we are the first to perform human evaluation 3 https://github.com/huggingface/transformers 4 https://github.com/mswellhao/PACSUM 5 https://github.com/UKPLab/sentence-transformers 6 http://bio.nlplab.org/word-vectors  on the 2018 PubMed and arXiv datasets (Cohan et al., 2018). Human evaluation over long scientific articles require annotators to comprehend a full domain-specific long article and compare multiple summaries for quality evaluation. Due to the challenging nature of the task, previous papers choose to skip it and purely rely on automatic evaluations to judge the system performance.

Automatic Evaluation Results
Tables 3 and 4 summarize our automatic evaluation results on the PubMed and arXiv test sets with the best hyperparameters, as described in Section 4.2. The first blocks in Table 3,4 include the lead and the oracle baselines. The second and the third blocks in the tables present the results of supervised abstractive models, and of supervised extractive models. ROUGE-2 oracle summaries are used as gold standard summaries for training supervised extractive models, which likely contributes to their better ROUGE-2 scores.
The last blocks compare previous unsupervised models with our approach. Our model outperforms all other unsupervised approaches by wide margins in terms of ROUGE-1,2,L F1 scores on both PubMed and arXiv datasets. We also show that PACSUM is biased towards selecting sentences that We also see a similar trend on arXiv (the plots with more details can be found in the appendix).  appear at the beginning of a document while our method selects sentences in every section and near the article boundaries, similar to the oracle (Figure 2). This overlap with gold standard summaries suggests our use of discourse structure and hierarchy plays a significant role in our method's performance. Interestingly, despite limited access to only the validation set for hyperparameter tuning, our method achieves performance scores that are competitive with supervised models that require hundreds of thousands of training examples, outperforming almost all abstractive and extractive models on ROUGE-L. This suggests that our discourseaware unsupervised model is surprisingly effective at selecting salient sentences in long scientific document and perhaps should be used as a strong  baseline to accessing the merits of supervised approaches for learning content beyond discourse.

Human Evaluation
We asked the human judges 7 to read the reference summary 8 (abstract) and present extracted sentences from different summarization systems in a random and anonymized order. The judges are asked to evaluate the system summary sentence according to two criteria: 1) content coverage (whether the presented sentence contains content from the abstract); and 2) importance (whether the presented sentence is important for a goal-oriented reader even if it isn't in the abstract (Lin and Hovy, 1997)). Table 5 presents the human evaluation results. HIPORANK is shown to be significantly better than PACSUM in both content coverage and importance (p = 0.002 and p = 0.007 with Mann-Whitney U tests, respectively). We also measure inter-rater reliability using Fleiss' κ (46.56 for content-coverage and 41.37 for importance). These results help sup-  port that our method's use of hierarchy and discourse structure improves summarization quality.
6 Ablation Studies 6.1 Component-wise Analysis Table 6 presents the ablation study to assess the relative contributions of the boundary function and the hierarchical information. We keep all the hyperparameters unchanged with respect to the best setting in Section 4.2 and either vary the positional function or the hierarchical structures. We also found that the improvement of each components are stable across all the hyperparameters we tested (more details in the appendix). The first block of Table 6 reports the ablation results with different positional functions: no positional function (Erkan and Radev, 2004;Mihalcea and Tarau, 2004), lead bias function (Zheng and Lapata, 2019), and our proposed boundary function. We can see that using the wrong positional function hurts the model's performance when comparing no positional function with lead bias function. Our boundary positional function outperforms the lead or no positional functions significantly.
The second block of Table 6 reports the results with or without the hierarchical structure. We observe that adding the hierarchical information results in a huge performance improvement.

Effect of Embeddings
To disentangle the effect of sentence representation, we show PubMed test set results of our best model with different sentence embeddings in Table 7. While pretrained transformer models finetuned on sentence similarity improve performance, HIPORANK still consistently outperforms previous state-of-the-art unsupervised models (Table 3) even with random embeddings. These results once  again suggest that our method's improvement can indeed be attributed to the use of hierarchy and discourse structure, rather than to the the choice of representations. To further inspect our model's stability across different hyperparameter choices, we conducted fine-grained analysis across all different hyperparameter settings as below.

Stability of Hyperparameters
Stability w.r.t. Discourse Structure To evaluate the impact and the stability of discourse structure informed edge weighting (Section 3.3), we first compared our boundary positional function (Eqn. 1,3) to PACSUM's lead positional function, as well as the standard undirected approach over different hyperparameter settings. Figure 3 (a) shows that our method consistently performed better on the PubMed validation set, across different hyperparameters and embedding models outlined in Section 4.2.
Stability w.r.t. Hierarchy We then evaluated the effect of adding hierarchy (Section 3.2) on top of our boundary positional function. In addition to decreasing the computational cost, Figure  3 (b) shows that incorporating hierarchy further improved ROUGE-L consistently across different hyperparameters and embedding models we tested.
Application to other genres While our work here is focused on long scientific document summarization, we believe that our approach is promising for other genres of text, provided that the right discourse-aware biases are given to the model. Indeed, one version of our model with our proposed boundary function can be seen as a generalization of PACSUM, which achieves state-of-the-art performance on unsupervised summarization of news by exploiting the well known lead bias of news text (Zheng and Lapata, 2019;Grenander et al., 2019). We leave such explorations of adapting HIPORANK to other genres to future work.

Conclusion
We presented an unsupervised graph-based model for long scientific document summarization. The proposed approach augments the measure of sentence centrality by inserting directionality and hierarchy in the graph with boundary positional functions and hierarchical topic information grounded in discourse structure. Our simple unsupervised approach with rich discourse modelling outperforms previous unsupervised graph-based summarization models by wide margins and achieves comparable performance to state-of-the-art supervised neural models. This makes our model a lightweight but strong baseline for assessing the performance of expensive supervised approaches for long scientific document summarization.

A.1 Different Hierarchical Structure
Besides our proposed hierarchical model (Figure 4 (c), hierarchy-add) in the paper, we also proposed and experimented with another novel hierarchical graph by introducing section-section connections (Figure 4 (b), hierarchy-multiply). In this hierarchical setting, we multiply a sentence's sectional importance with its sentence importance (Eqn. (2)) to form the final centrality score:  Our empirical results indicate the hierarchymultiply model always outperforms no-hierarchy models ( (Figure 4 (a)) but under performs hierarchy-add. Nevertheless, Table 8 shows that adding any hierarchical structure results in performance improvement by wide margins when compared to the no-hierarchy model.  Figure 5 shows the sentence positions in source document for extractive summaries generated by different models on the arXiv validation set. We can again see that PACSUM is biased towards selecting sentences that appear at the beginning of a document while our method selects sentences in every section and near the article boundaries, similar to the oracle.