Considering Nested Tree Structure in Sentence Extractive Summarization with Pre-trained Transformer

Sentence extractive summarization shortens a document by selecting sentences for a summary while preserving its important contents. However, constructing a coherent and informative summary is difficult using a pre-trained BERT-based encoder since it is not explicitly trained for representing the information of sentences in a document. We propose a nested tree-based extractive summarization model on RoBERTa (NeRoBERTa), where nested tree structures consist of syntactic and discourse trees in a given document. Experimental results on the CNN/DailyMail dataset showed that NeRoBERTa outperforms baseline models in ROUGE. Human evaluation results also showed that NeRoBERTa achieves significantly better scores than the baselines in terms of coherence and yields comparable scores to the state-of-the-art models.


Introduction
Document summarization is a task of creating a concise summary from a given document while keeping the original content. In general, sentence extraction methods, which select sentences in a document to create its summary, have the advantages of truthfulness compared with abstractive methods (Cao et al., 2018) and of fluency compared with word extraction methods (Xu et al., 2020).
Neural networks have achieved great success in sentence extraction-based document summarization (Cheng and Lapata, 2016;Zhou et al., 2018). Recently, Liu and Lapata (2019) proposed BERT-SUM, which utilizes BERT (Devlin et al., 2019) for sentence representations to create a summary. Although the use of BERT resulted in significant performance improvement, this method decides the selection for each sentence independently. Xu et al. (2020) proposed DISCOBERT by considering inter-sentence information through discourse graphs to construct a coherent summary. Although they achieved remarkable scores in ROUGE, it was Inter (NeRoBERTa)
still difficult to construct a coherent summary compared to BERTSUM in human evaluation. Zhong et al. (2020) attempted to change the paradigm by formulating summary-level extraction with a RoBERTa encoder and achieved the state-of-theart results on the CNN/DailyMail dataset. In spite of the successful results of the above BERT-related methods, their sentence representations have room for improvement. As  reported, "[CLS]", a pre-defined token for indicating sentence representations on BERT, is insufficient to express sentence information. Even in RoBERTa, it is also a problem due to the lack of next sentence prediction in its pretraining step. Therefore, for further improving summarization performance, we need to consider how to represent sentences in a BERT-related model and how to capture relationships between such sentence representations. It is a key to create a coherent and informative summary with sentence extraction methods.
To tackle this problem, we propose a nested tree-based extractive summarization model on RoBERTa (NeRoBERTa). NeRoBERTa can extract coherent sentences for a summary of a given document by utilizing nested tree structures 1 of two 1 Kikuchi et al. (2014) considered the nested tree structure in the traditional non-neural tree-trimming method. Their method extracted words by tracking their parent words and different trees, syntactic and discourse dependency trees (Zhao and Huang, 2017). Figure 1 shows the proposed NeRoBERTa to select sentences from a given document. Different from the previous works that focused on inter-sentence information using discourse graphs (Ishigaki et al., 2019;Xu et al., 2020), NeRoBERTa considers both intra-and inter-sentence information (syntactic and discourse graphs) together as a nested tree. The nested tree is encoded as a vector space representation through a graph attention network (Veličković et al., 2018) on a BERT-based encoder. In this tree, we can explicitly represent sentence information at "root" words for each syntactic dependency tree without relying only on "[CLS]" tokens.
This representation is useful to extract informative and coherent sentences in that it can capture keywords in a sentence for considering textual coherence to other sentences. Furthermore, based on the representation, we can also capture interactions between sentences through discourse dependency trees, succeeding in extracting coherent sentences. It is also possible to consider even longdistance relationships as higher-order dependency relationships in this structure, such as relationships between children and their ancestors. Thus, NeRoBERTa considers textual coherence through both syntactic and discourse trees to capture longdistance interactions between sentences.
Experimental results on the CNN/DailyMail dataset showed that our NeRoBERTa outperforms RoBERTa-based strong baselines in ROUGE. Unlike the previous work (Xu et al., 2020), NeR-oBERTa successfully constructs a coherent summary and is comparable to the state-of-the-art methods in human evaluation.

Nested Tree Structure
In this section, we describe how we construct two different types of graphs for a nested tree structure: a discourse graph and a syntactic graph.
We obtain discourse dependency relationships between sentences in a document through an RST parser. A given document can be parsed into a tree format with the RST parser, where each leaf node is an EDU, a text span in the document. Each text span has two types, nucleus and satellite. While the nucleus spans contain semantically salient information, the satellite spans support and modify the nucleus ones. sentences to construct a summary for a given document.
We use the recent state-of-the-art RST parser 2 (Kobayashi et al., 2020) to build an RST discourse tree (RST-DT) for all documents and convert it to an Inter-Sentential RST-DT (ISRST-DT). The ISRST-DT is first converted into a dependency-based discourse tree (ISDEP-DT) using the method described in (Hirao et al., 2013). Then, parent-child dependency relationships for each sentence can be formed. We construct a directed graph for the discourse dependencies (Ishigaki et al., 2019).
A dependency parser is used to build up the syntactic dependency relationships between words (Manning et al., 2014). We construct an undirected graph for the syntactic dependencies by following the previous settings (Marcheggiani and Titov, 2017).

Our Model
Ishigaki et al. (2019) consider dependency information through hierarchical attention modules (Kamigaito et al., 2018) trained in supervised attention for dependency heads (Kamigaito et al., 2017). Unlike the previous work, our model uses constructed graph information through graph encoder layers that directly focus on the relationships between nodes defined by edges in the graph. We explain the details of our model in this section.
Let w i be the i-th token in a document D = {w 1 , w 2 , ..., w n }. Our model predicts p(1|D, k), the probability of the k-th sentence in D being kept in a summary through the following modules.

Pre-trained Document Encoder
We append "[CLS]" and "[SEP]" tokens between sentences to encode a whole document (Liu and Lapata, 2019). Then, BERT is used to build up a representation h i for each token w i as follows: Instead of BERT, we consider RoBERTa as well. However, RoBERTa cannot be directly used in place of BERT for sentence-level extraction because RoBERTa does not consider the two types of tokens for the segment boundaries. To address this issue, we use randomly initialized segment embeddings, W type ∈ R 2,768 , instead of the original embeddings for keeping the same condition as BERT. The number comes from the pre-trained segment embedding weights of the original BERT, which indicate the next sentence prediction step. Then, the encoded hidden states, {h 1 , h 2 , ..., h n }, are fed into our graph encoders.

Graph Encoders
Graph Notation: Let V d and V s be nodes for sentences and words, and E s and E d be edges between the nodes in V s and V d , respectively. We denote constructed discourse and syntactic graphs as We append undirected edges between "[CLS]" and "root" tokens in each sentence to E s because the parent of a "root" token would be a sentence representation. GAT Networks: We use Graph Attention Networks (GAT) (Veličković et al., 2018) to encode each graph G on hidden states of BERT as follows: where F i indicates i-th times stacked feed-forward networks. N is layer normalization. W n and W a are learnable weights. L and T denote a non-linearity activation function, LeakyReLU, and a hyperbolic tangent, respectively. α i,j indicates normalized attention coefficients through a softmax function. indicates concatenation, and n i represents connected nodes to node i in graph G. ReLU is an activation function. M is a learnable weight. After h i is fed into the graph encoder, we obtain h G i , which contains either syntactic or discourse graph information based on all tokens.
The syntactic and discourse graphs are independently encoded. Then, they are concatenated as h root k =ReLU(W (h Gs r(k) h G d r(k) )), where r(k) indicates the position of a root in the k-th sentence. For the final representations to predict labels, we use h root k to represent the k-th sentence.

Objective Function & Inference
We define p(1|D, k) = σ(W M (h root k ) + b), where M is a two-stacked multi-head attention, σ is a sigmoid function, and W and b are weight parameters (Liu and Lapata, 2019). Let y i ∈ {1, 0} be an oracle label and Y = {y 1 , y 2 , ..., y n } be its set for a document. We use − y k ∈Y log(y k |x, k) as our objective function. In the inference step, we score the k-th sentence with p(1|D, k) and sort the sentences in descending order. Then, we keep the top m sentences as a summary, where m is the number of sentences to be extracted.

Experimental Settings
Dataset: We used the non-anonymized CNN/DailyMail dataset (Hermann et al., 2015). Based on the standard split, we divided the dataset into 287,226, 13,368 and 11,490 articles for training, validation, and test datasets, respectively. Parameter Settings: We used PyTorch with the Torch Geometric (Fey and Lenssen, 2019) to build up entire architectures with graph encoders. The "bert-based-uncased" and "roberta-based" models in transformers 3 were used to encode maximum 768 tokens of each tokenized document. The best model was selected based on the lowest "loss" score on the validation dataset. A greedy search was used to construct the oracle summary by maximizing the sum of ROUGE-1-F and ROUGE-2-F against the gold summary.
For the syntactic graph encoder, we stacked GAT Networks. To track n-order dependency information, we simply added n-order nodes and edges to G d and G s . The number of attention heads was set to 6 in each graph encoder. To represent each word vector, we used a first sub-word vector. We employed a traditional method of selecting top 3 sentences to construct a summary (Liu and Lapata, 2019). Trigram blocking was used to reduce redundancy and to improve informativeness for all models (Paulus et al., 2018). Compared Methods: We compared our proposed methods with some baselines. The proposed methods are as follows: NeRoBERTa considers our nested tree structure for both syntactic and discourse information. SynRoBERTa and DiRoBERTa independently consider only either syntactic or discourse tree structure, respectively. The baselines, which include state-of-the-art models, are as follows: BERTSUM introduces a method for learning a sentence boundary in a BERT-based model for the document summarization task (Liu and Lapata, 2019).
DISCOBERT constructs a summary based on EDU-level extraction, incorporating discourse and coreference information (Xu et al., 2020). MatchSum attempts to shift the paradigm from sentence-level to summary-level extraction during the extractive document summarization task (Zhong et al., 2020). RoBERTa encodes input documents using a "roberta-based" model.

Automatic Evaluation
We utilized ROUGE-metrics for the evaluation. The experimental results on the CNN/DailyMail dataset are shown in Table 1. The first block contains Lead-3 and Oracle scores. The second block includes BERT-based previous studies including state-ofthe-art models. The last block includes scores for our models and for re-implemented BERTSUM.
Our strong baseline RoBERTa outperformed BERTSUM. The gain might be from using a bigger dataset with the dynamic masking pattern applied in the pre-trained RoBERTa. SynRoBERTa and DiRoBERTa show that considering syntactic or discourse information was beneficial. NeRoBERTa (n s = {1, 2}, n d = {1}) (in bold), that considers syntactic and discourse information simultaneously, further improved the performance. It outperformed RoBERTa with a clear margin, specifically, 0.31 points in the R-1-F score.
As can be seen in Figure 2, RoBERTa can improve the prediction loss compared with BERT-SUM. SynRoBERTa (n s = {1, 2}), which explicitly incorporates keywords information through syntactic information, can further improve the performance of RoBERTa. This shows that considering keywords information through syntactic structures is beneficial to construct the sentence representations for considering textual coherence to other sentences.

Human Evaluation and Analysis
Human evaluation was conducted for randomly sampled 100 documents from the test dataset. "Amazon Mturk" was used for the experiments, and human evaluators graded scores from 1 to 5 (5 is the best) in terms of four evaluation criteria. 5 Because summaries from DISCOBERT were worse than ones from BERTSUM in their human  evaluation (Xu et al., 2020), we evaluated only summaries from RoBERTa, NeRoBERTa (n s = {1, 2}, n d = {1}), and MatchSum. Table 2 shows the results. Coh, Infor, Read, and Redun indicate coherence, informativeness, readability, and redundancy, respectively. As we expected, the proposed NeRoBERTa, which considers a nested tree structure, could capture coherence better than our strong baseline, RoBERTa. In addition, NeRoBERTa was comparable to the current state-of-the-art model, MatchSum. The informativeness score for Match-Sum was lower than RoBERTa and NeRoBERTa. Table 3 shows example extracted sentences from a document and their discourse graph. In this example, the discourse information alone was not enough in that S3 and S10 have the same discourse information, while S3 is more similar to the third sentence in the gold summary. RoBERTa and DiRoBERTa constructed the same summary in-  Table 2: Human evaluation results. † indicates that the improvement with NeRoBERTa from RoBERTa was statistically significant. 4 S1 Barcelona club president josep maria bartomeu has insisted that the la liga leaders have no plans to replace luis enrique and they're 'very happy' with him. S3 Despite speculation this season that enrique will be replaced in the summer, bartomeu refuted these claims and says he's impressed with how the manager has performed. S4 Luis enrique only took charge at the club last summer and has impressed during his tenure. S5 Barcelona president josep maria bartemou says the club are 'very happy' with enrique's performance. S10 Enrique's side comfortably dispatched of champions league chasing valencia on saturday, with goals from luis suarez and lionel messi. S11 luis suarez opened the scoring for barcelona [...] flying Valencia Gold Barcelona president josep bartomeu says the club are happy with enrique. barca are currently top of la liga and closing in on the league title. enrique's future at the club has been speculated over the season. click here for all the latest barcelona news. , and MatchSum models. Arrows indicate the discourse graphs. The sentences in red are selected by all models. The sentence in blue is selected by NeRoBERTa and the sentence in purple is selected by RoBERTa and DiRoBERTa. S1 is the first sentence of the document. Gold denotes the gold summary.
cluding S10. On the other hand, NeRoBERTa could extract S3, which is coherent to S4 and S5, sharing important keywords "enrique" and "bartomeu". This is because our GAT network for syntactic information can capture keywords in the sentence to consider textual coherence to other sentences. Although NeRoBERTa constructed a summary with three sentences, MatchSum extracted only two sentences of S4 and S5. In this case, MatchSum might be less informative than NeRoBERTa.

Conclusion
In this paper, we proposed NeRoBERTa, which incorporates syntactic and discourse information as a nested tree structure to create an informative and coherent summary. The experimental results on the CNN/DailyMail dataset showed that our method improves the performance over the baseline methods both in the automatic and human evaluations.