A Simple and Strong Baseline for End-to-End Neural RST-style Discourse Parsing

To promote and further develop RST-style discourse parsing models, we need a strong baseline that can be regarded as a reference for reporting reliable experimental results. This paper explores a strong baseline by integrating existing simple parsing strategies, top-down and bottom-up, with various transformer-based pre-trained language models. The experimental results obtained from two benchmark datasets demonstrate that the parsing performance strongly relies on the pretrained language models rather than the parsing strategies. In particular, the bottom-up parser achieves large performance gains compared to the current best parser when employing DeBERTa. We further reveal that language models with a span-masking scheme especially boost the parsing performance through our analysis within intra- and multi-sentential parsing, and nuclearity prediction.


Introduction
Rhetorical Structure Theory (RST) (Mann and Thompson, 1987) is one of the most influential theories for representing the discourse structure behind a document. According to the theory, a document is represented as a recursive constituent tree that indicates the relation between text spans consisting of a single elementary discourse unit (EDU) or contiguous EDUs. The label of a nonterminal node describes the nuclearity status, either nucleus or satellite, of the text span, and the edge indicates the rhetorical relation between the text spans ( Figure 1).
RST-style discourse parsing (hereafter RST parsing) is a fundamental task in NLP and plays an essential role in several downstream tasks, such as text summarization (Liu and Chen, 2019;, question-answering (Gao et al., 2020), and sentiment analysis (Bhatia et al., 2015). In most cases, the performance of an RST parsing method has been evaluated on the largest English treebank, the RST discourse treebank (RST-DT) (Lynn Carlson, 2002), as the benchmark dataset. The evaluation measures used include the micro-averaged F1 score of unlabeled spans, that of nuclearity-labeled spans, that of rhetorical relation-labeled spans, and that of fully labeled spans, based on standard Parseval (Morey et al., 2017), when using gold EDU segmentation.
There are two major strategies for RST parsing: top-down and bottom-up. The former builds RST trees by splitting a larger text span consisting of EDUs into smaller ones recursively. The latter builds trees by merging two adjacent text spans. Non-neural parsers with classical handcrafted features prefer the bottom-up strategy (duVerle and Prendinger, 2009;Feng and Hirst, 2012;Wang et al., 2017). On the other hand, recent neural parsers prefer the top-down strategy (Kobayashi et al., 2020;Koto et al., 2021;Nguyen et al., 2021;Zhang et al., 2021), while a few parsers employ the bottom-up strategy . With advances in neural network models, such neural parsers obtain significant gains over the non-neural parsers.
Several techniques have been proposed to boost RST parsing performance: Nguyen et al. (2021) and Shi et al. (2020) introduced beam search and Koto et al. (2021) exploited dynamic oracles in search to improve their parsing algorithms. Kobayashi et al. (2020) used a model ensemble with multiple runs, Zhang et al. (2021) introduced adversarial training, and Kobayashi et al. (2021) and  exploited silver data to improve parameter optimization.
Furthermore, pre-trained language models are playing an important role in improving the parsing performance. Shi et al. (2020), Nguyen et al. (2021), andZhang et al. (2021) employed XLNet  to obtain better vector representations for arbitrary text spans consisting of EDU(s). As a result, the current best top-down parser, Zhang et al. (2021) with XLNet, achieved the fully labeled span F1 score of 53.8 with standard Parseval. The method has a gain of 4-5 points compared to the best non-neural parser (Wang et al., 2017). However, it is still unclear what contributed the most to the improvement among the models' various factors such as parsing strategies, pre-trained language models, and a model ensemble. Therefore, we need a simple but strong baseline for different parsing strategies along with a pre-trained language model to clarify the effects of the parsing strategies and pre-trained models. The baseline will contribute to building more reliable experiments for revealing the effectiveness of newly proposed methods. 1 This paper aims to build strong baselines for RST parsing, based on two simple open-source top-down (Kobayashi et al., 2020) and bottomup  parsers, employing transformer-based pre-trained language models, without incorporating any of our own mechanisms. 2 The experimental results on RST-DT (Lynn Carlson, 2002) and Instructional Discourse Treebank (Instr-DT) (Subba and Di Eugenio, 2009) with various pre-trained language models demonstrated that the parsing performance strongly relies on the performance of the pre-trained language models rather than the parsing strategies. While the current trend is a top-down parser, a bottom-up parser with DeBERTa (He et al., 2021), one of the current state-of-the-art pre-trained language models, achieved the best score, which is higher than those of the current state-of-the-art parsers. Further, our analysis based on intra-and multi-sentential parsing, and nuclearity prediction revealed that pre-trained language models with a span-masking scheme improve parsing performance more than those with a token-masking scheme. We will release our code at https://github.com/ nttcslab-nlp/RSTParser_EMNLP22.

Related Work
Early studies on RST parsing were based on nonneural supervised learning methods with handcrafted features. As parsing strategies, bottom-up greedy algorithms (duVerle and Prendinger, 2009;Feng and Hirst, 2012), shift-reduce (Wang et al., 2017), CRFs (Feng andHirst, 2014), and CKY-like parsing algorithms (Joty et al., 2013(Joty et al., , 2015 have been employed. In particular, Wang et al.'s shiftreduce parser (Wang et al., 2017), based on SVMs, achieved the best results among the non-neural statistical models on the RST-DT. The method first builds nuclearity-labeled RST trees and then assigns relation labels between two adjacent spans consisting of a single or multiple EDUs.
Inspired by the success of neural networks in many NLP tasks, several early neural networkbased models have been proposed for RST parsing (Ji and Eisenstein, 2014;Li et al., 2014Li et al., , 2016. However, as reported by Morey et al. (2017), while some neural approaches outperformed classical approaches, it was not by a large margin.
Recent end-to-end neural RST parsing models with sophisticated language models, such as GloVe and ELMo, achieved better performance. They used vector representations of text spans based on the LSTMs whose inputs are word embeddings from the language models. Yu et al. (2018) proposed a bottom-up parser, based on the shift-reduce algorithm, that leverages the information from their neural dependency parsing model within a sentence for RST parsing. The method outperformed traditional non-neural methods and obtained a remarkable relation-labeled span F1 score of 49.2. As another approach, a top-down neural parser based on a sequence-to-sequence (seq2seq) framework was proposed  for use only at the sentence level. The method parses a tree in a depth-first manner with a pointer-generator net- work. Zhang et al. (2020) extended the method and applied it to document-level RST parsing. Kobayashi et al. (2020) proposed another top-down RST parsing method, based on a minimal spanbased approach, that splits a span into smaller ones recursively and exploits multiple granularity levels in a document. Then, they demonstrated the impact of the model ensemble. Koto et al. (2021) extended Kobayashi et al.'s parser (2020) by introducing dynamic oracles as well as a new penalty for segmentation loss, which is based on the current tree depth and the number of EDUs in the input. The latter two methods also outperformed traditional non-neural methods. More recently, neural RST parsing models with transformer-based pre-trained language models, such as SpanBERT and XLNet, have been proposed.   They also reported that performance was boosted by XLNet, and the following current best scores were obtained: 76.3, 65.5, 55.6, and 53.8 for unlabeled, nuclearity-, relation-, and fully labeled span F1 scores, respectively.
As another approach,  and Kobayashi et al. (2021) proposed a pre-training and fine-tuning framework for RST parsing. They obtained silver data from automatically parsed largescale data and used them to pre-train their models. Then, they fine-tuned the models with gold data. Table 1 summarizes the previous best non-neural parser and recent end-to-end neural RST parsers with performance that can be considered state-ofthe-art. We can see that the RST parsing models with the transformer-based language models outperformed the other models regardless of the parsing strategy. The performance improvements are remarkable, especially for the relation-labeled and fully labeled span F1 scores.

End-to-end Neural RST Parsing
We employed the span-based parser (Kobayashi et al., 2020;Koto et al., 2021) for the top-down parsing strategy and the shift-reduce transition parser , an end-to-end variant of Wang et al.'s parser (2017), for the bottom-up parsing strategy. These parsers were chosen here for their simple architecture and their open code. Overviews of the parsers are shown in Figs. 2 and 3. Both parsers basically consist of simple Feed-Forward Networks (FFNs) and BERTbased embeddings. In this study, we used a twolayer perceptron with the GELU activation function and a dropout layer.

Vector Representations for Text Spans
Before describing the parsing models, we explain how to obtain a vector representation for an arbitrary text span by using BERT-based language models. Our procedure for obtaining the vector representation is a simplified variant of Guz and Carenini's method (2020).
First, we transform a document into a sequence of subwords, {t 1 , t 2 , . . . , t n }. Then, we obtain the vector representation for each subword in the sequence {w 1 , w 2 , . . . , w n } using a language model. Following , the vector representation of a text span u i:j , consisting of the i-th EDU to the j-th EDU, is obtained by averaging the vector of both edge subwords, u i:j = (w b(i) + w e(j) )/2, where b(i) returns the index of the leftmost subword in the i-th EDU and e(j) re- turns that of the rightmost subword in the j-th EDU. A document longer than the maximum allowed length of BERT (512 subwords) is embedded with sliding windows with 30-subword over-wrapping.

Top-down Parsing
Since the minimal span-based parser does not require any additional module like a decoder, as in the pointer-network-based top-down parsers, it is suitable for a comparison with a bottom-up parser of the shift-reduce algorithm, which consists of three simple FFNs, as we describe in Section 3.3. This top-down parser splits each span into smaller ones recursively until the span becomes a single EDU. We modified the code 3 to utilize transformerbased embeddings and simplified it by excluding organizational features that represent sentence and paragraph boundary information. By following Koto et al. (2021), we introduced a biaffine layer for span splitting and a loss penalty. For each position k in a span consisting of the i-th EDU to the j-th EDU, a scoring function, s split (i, j, k), is defined as follows: where W is a weight matrix and v left , v right are weight vectors corresponding to the left and right spans, respectively. Here, h i:k and h k+1:j are defined as follows: 3 https://github.com/nttcslab-nlp/ Top-Down-RST-Parser Then, the span is split at position k that maximizes Eq. (1) When splitting a span at position k, the score of the nuclearity status and relation labels for the two spans is defined as follows: where W is a weight matrix for a specific label , and v left and v right are weight vectors corresponding to the left and right spans for the label , respectively. While the correct split position in the training data is used fork in training time, the position predicted with Eq. (4) is used in testing time.
Then, the label that maximizes Eq. (5) is assigned to the spans: where L denotes a set of valid nuclearity status combinations, {N-S,S-N,N-N}, for predicting the nuclearity and a set of relation labels, {Elaboration, Condition,. . .}, for predicting the relation. Note that the weight parameters W and FFNs for the nuclearity and relation labeling are learned separately. 4

Bottom-up Parsing
Formally, in shift-reduce parsing, a parsing state is denoted as a tuple (S, Q), where S is a stack and Q is a queue that contains incoming EDUs. Each element in S can be a completed constituent or a terminal. At each step, the parser chooses one of the following actions with a neural classifier and updates the state: • SHIFT: pop the first EDU off the queue and push it onto the stack.
• REDUCE: pop two elements from the stack and push a new constituent that has the popped subtrees as its children onto the stack as a single composite item.
Stack Queue e 1 e 2 e 3 e 4 e 5 e 6 e i-2 e i-1 In the REDUCE action, nuclearity status and relation labels are predicted by different neural classifiers. That is, RST trees are built in three stages: First, unlabeled trees are built and then nuclearity status and relation labels are assigned independently. Note that previous shift-reduce parsers were based on the two-stage approach, which means they first build nuclearity-labeled RST trees and then assign relation labels to the trees. To fairly compare top-down and bottom-up approaches, we employed the three-stage approach both in topdown and bottom-up parsing. Our experiments demonstrated that there is no significant difference between the performances of the two-stage and three-stage approaches.
Our bottom-up parser has three classifiers, FFN act , FFN nuc , and FFN rel , for predicting action, nuclearity, and rhetorical relations, respectively. The difference among them is only the output dimension related to the number of classes; specifically, the output dimension of the action classifier is two (shift or reduce), that of the nuclearity classifier is three (N-S, N-N, S-N), and that of the relation classifier is the number of the rhetorical relations utilized in the dataset.
where function "Concat" concatenates the vectors received as the arguments. u s 0 is the vector representation of a text span stored in the first position of the stack S, u s 1 is that in the second position of S, and u q 0 is that in the first position of the queue Q. Weights for each FFN and the language model used for span embeddings are trained by optimizing the cross-entropy loss of s act , s nuc , and s rel .

Pre-trained Language Models
Since most of the transformer-based pre-trained language models originated in BERT, we employed BERT and four of its variants as language models to obtain vector representations for text spans. Table 2 shows the size of the dataset and the tasks for their pre-training. BERT: is trained with two tasks: (1) a masked language model (MLM); 15% of the tokens in the training data are randomly masked, and then the model is trained to predict the masked tokens, and (2) a next sentence prediction (NSP) task; the model is trained to correctly predict the following sentence for a given sentence. BERT is trained on Book Corpus and English Wikipedia. RoBERTa: is trained with longer and larger batches over more data and longer sequences. It further removes NSP and dynamically changes the masking pattern applied to the training data. In addition to the dataset used for training BERT, the training dataset here includes CC-News, OpenWeb-Text, and Stories as well. XLNet: is a generalized autoregressive pre-trained language model, known as a permuted language model (PLM). It is trained by maximizing the expected likelihood over all permutations of the factorization order of the input text to approximately consider bidirectional context. In addition to the dataset used for training BERT, the training dataset includes Giga5, ClueWeb, and CC as well.
SpanBERT: is trained with (1) a masked language model with random spans (contiguous tokens) and (2) span boundary token prediction in the masked span. Unlike the original BERT, SpanBERT does not include the NSP task. The dataset used for training is the same as that for BERT. DeBERTa: is a modified variant of RoBERTa. It uses disentangled attention and an enhanced mask decoder. During training, it masks spans randomly as for SpanBERT. While the dataset used for train-ing is a subset of that used for RoBERTa, it performs consistently better on various NLP tasks.

Datasets
We used the RST-DT and Instr-DT to evaluate the performance of the parsers. RST-DT contains 385 documents selected from the Wall Street Journal. It is officially divided into 347 documents as the training dataset and 38 documents as the test dataset. The number of rhetorical relation labels utilized in the dataset is 18. Since there is no official development dataset, we used 40 documents of the training dataset as the development dataset by following a previous study (Heilman and Sagae, 2015). Instr-DT contains 176 documents of the home-repair instruction domain. The number of rhetorical relation labels in the dataset is 39. Since there are no official development and test datasets, we used 126, 25, and 25 documents for training, development, and test datasets, respectively. We used gold EDU segmentation for both datasets by following conventional studies.

Evaluation Metrics
As in previous studies, we transformed RST-trees into right-heavy binary trees (Sagae and Lavie, 2005) and evaluated the system results with microaveraged F 1 scores of Span, Nuclearity, Relation, and Full, based on Standard-Parseval (Morey et al., 2017). Span, Nuclearity, Relation, and Full were used to evaluate unlabeled, nuclearity-, relation-, and fully labeled tree structures, respectively. Since neural models heavily rely on their given initialization, we report average scores and standard deviations of three runs with different seeds.

Training Configurations
We implemented all models based on Py-Torch (Paszke et al., 2019) with PyTorch Lightning (Falcon et al., 2019) and used language models from Transformers (Wolf et al., 2020). We used base models, such as BERT-base-cased, RoBERTa-base, for all the experiments. The dimension of hidden layers in FFNs was set to 512, and the dropout rate was set to 0.2. By following , we employed span-based batch rather than document-based batch. The minibatch size is 5 spans/action. We optimized all models with the AdamW (Loshchilov and Hutter, 2017) optimizer. We used a learning rate of 1e-5 for language models and 1e-5/2e-4 for other parameters 5 such as FFN and biaffine layers. We scheduled the learning rate by linear warm-up, which increases the learning rate linearly during the first epoch and then decreases it linearly to 0 until the final epoch. We trained the model up to 20 epochs and applied early stopping with a patience of 5 by monitoring the fully labeled span F1 score on the development dataset. Details of other hyperparameters are in Appendix A. Table 3 shows the results with different pre-trained language models. The scores on RST-DT are better than those on Instr-DT. This is attributed to the size of the datasets. Instr-DT is significantly smaller than RST-DT while the number of rhetorical relations is larger. In fact, standard deviations on Instr-DT are larger than those on RST-DT. However, the tendencies of the experimental results on both datasets are similar.

Results
The results obtained from paired bootstrap resampling tests 6 between top-down and bottom-up parsers whilst fixing the language model show that significant differences are found only in Span and Nuc. for XLNet on RST-DT, and Rel. and Full for SpanBERT on Instr-DT, respectively.
On the other hand, the performance of the parsers varies widely depending on their language model when fixing the parsing strategy. To investigate the significant differences among parsers, we performed multiple comparison tests based on the paired bootstrap resampling tests while controlling the false discovery rate (Benjamini and Hochberg, 1995  indicates significantly better than any model except De-BERTa. ♠ indicates significantly better than BERT, XLNet, and SpanBERT. ♥ indicates significantly better than BERT, RoBERTa, and SpanBERT. ♦ indicates significantly better than BERT and SpanBERT. ♣ indicates significantly better than BERT.  Table 4: Results for intra-sentential parsing with various language models (RST-Parseval).
the current best parser (Zhang et al., 2021) by 1.5, 2.5, 1.7, and 1.6 points, respectively. 7 Furthermore, most parsers yield a performance comparable to current state-of-the-art parsers. We believe that the results have a significant impact in the RST-parsing community. Since we built our baseline parsers based on a simple architecture, as described in Section 3, we can conduct more reliable experiments to reveal the effectiveness of newly proposed methods on top of them without any concern regarding the choice of pre-trained language models or parsing strategies.
While the evaluation results demonstrate that we successfully built baseline parsers, they also raise the following questions for us: (1) Why did De-BERTa, trained with the half-size dataset (85G), outperform RoBERTa, trained with the most extensive dataset (161G) (2) Why did SpanBERT consistently outperform BERT with significant differ-ences, even though they are trained with the same dataset (16GB). It is well known that pre-trained language models trained with large-scale datasets boost the performance (Kaplan et al., 2020); however, the above results do not necessarily agree with the finding.
We believe that the results may be due to a span-masking scheme, a common feature between SpanBERT and DeBERTa. With the span-masking scheme, randomly generated spans consisting of a sequence of tokens with the length up to 5 (for SpanBERT) or 3 (for DeBERTa) are masked, and then the language models are trained to predict the span boundary tokens in the mask. That is, the span-masking scheme is considered more contextsensitive than the token-masking scheme. Thus, pre-trained language models with a span-masking scheme are suitable for obtaining vector representations for long text spans consisting of EDUs.
To discuss the impact of the span-masking scheme in more detail, we evaluated our parsers  Table 5: Results for multi-sentential parsing with various language models (RST-Parseval).   in terms of intra-and multi-sentential parsing performance. Tables 4 and 5 show the results. 8 From Table 4, we can see that the tendency of the results is quite different from that in Table 3; the differences among parsers are smaller than those in Table 3. Particularly, the differences among the four methods except for BERT are within 1 point for Full on RST-DT. Other noteworthy points include that the scores of BERT are close to those of the other methods, and DeBERTa often did not achieve the best scores. In contrast, Table 5 emphasizes the effectiveness of DeBERTa and SpanBERT. De-BERTa completely outperformed the other methods with large differences, and the differences between SpanBERT and BERT became larger. To obtain better results in multi-sentential parsing, we need good representations for longer text spans over sentences. Thus, we believe that the span-masking scheme would help generate better representations for the longer text spans. The results also reveal that there is still much more room for further improvement than intra-sentential parsing. Finally, we show another piece of evidence for the effectiveness of the span-masking scheme in Table 6, which demonstrates the performance of nuclearity prediction among N-S, S-N, and N-N. N-N relations originally occur in n-array (n > 2) trees in many cases. Therefore, we need good representations for longer text spans to detect N-N relations accurately. From the table, we can see that DeBERTa achieved the best for N-N with large gains, over 2 points on RST-DT and over 3 points on InstrDT. Furthermore, SpanBERT is sometimes comparable to XLNet and RoBERTa in this table.

RST-DT
Those results indicate that the span-masking scheme is effective in obtaining good representations for longer text spans again. The impact of span-masking scheme may lead to novel research perspectives toward RST parsing-specific language models.

Conclusion
This paper explored ways to build strong baselines for RST parsing, based on existing top-down and bottom-up parsers, while varying the use of five transformer-based pre-trained language models: BERT, RoBERTa, XLNet, SpanBERT, and DeBERTa. We employed a span-based model as a top-down parser and a shift-reduce model as a bottom-up parser. The experimental results obtained from the RST-DT and Instr-DT revealed that the language models, except for BERT, boost the performance of RST parsing in both strategies. The DeBERTa-based bottom-up parser achieved the best scores; in particular, the fully labeled span F1 score of 55.4 on the RST-DT. Furthermore, our experimental results implied that language models with a span-masking scheme, such as SpanBERT, DeBERTa, are suitable for RST parsing since they would generate better representations for long text spans than those with a token-masking scheme.

Limitations
As shown in the experimental results, our approach would not perform well with insufficient training data. For example, the performance obtained from Instr-DT was inferior to that obtained from RST-DT in Rel. and Full. The results were caused by the small amount of training data and many rhetorical relations. Since the annotation costs for RST are considerable, how we obtain enough high-quality data is a significant issue for building RST parsers for new domains and languages. Furthermore, since our parsers rely on a large-scale pre-trained language model, they do not perform well for languages that are not ready to use the pre-trained language model, such as low-resource languages. In future work, we should improve the domain/language portability, and we believe the following are practical approaches:   Table 7 shows the hyperparameters utilized in our experiments.

B Evaluation in Intra-and Multi-sentential Parsing
Because human annotators sometimes build RSTtrees with disregarding sentence boundaries, some RST-trees have span boundaries that disagree with span boundaries in sentences. Figure 4 shows an example. In the gold tree, the subtree consisting of e 3 and e 4 straddles the boundary between s 1 and s 2 . Parsers also sometimes disregard sentence boundaries when building RST-trees. In the predicted tree, the subtree consisting of e 11 and e 12 straddles the boundary between s 4 and s 5 . Therefore, we make a best effort to find sentences in the parse trees. We extract subtrees whose root nodes correspond to sentences when evaluating a predicted tree in terms of intra-sentential parsing. In the example, we extract the subtrees corresponding to s 3 , s 4 , and s 5 from the gold tree and s 1 , s 2 , and s 3 from the predicted tree. However, in this case, s 1 and s 2 are ignored even though they form valid subtrees in the predicted tree. So we give a unary tree whose leaf node is a sentence for s 1 and s 2 for the gold tree. Similarly, we give a unary tree for s 4 and s 5 for the predicted tree (see the middle row in Figure 4). As a result, the leaf nodes of a gold RST-tree do not necessarily have a one-to-one correspondence with those of a predicted tree. Thus, we apply RST-Parseval to evaluate predicted trees in terms of intra-sentential parsing.
We replace subtrees corresponding to sentences as leaf nodes when evaluating a predicted tree in multi-sentential parsing. In the example, the subtrees dominating e 7 to e 9 , e 10 to e 11 , and e 12 to e 13 in the gold tree are respectively replaced with the leaf nodes s 3 , s 4 , and s 5 . Since the gold RST-tree does not have valid subtrees dominating e 1 to e 3 and e 4 to e 6 , we do not replace them as s 1 and s 2 , respectively. That is, subtrees that cannot be converted into leaf nodes as sentences are left as they are. Similarly, the subtrees dominating e 1 to e 3 and e 4 to e 6 in the predicted tree are respectively replaced as leaf nodes s 1 and s 2 . We also do not replace e 10 to e 11 and e 12 to e 13 as s 4 and s 5 (see the bottom row in Figure 4). The transformation may break down the one-to-one correspondence between leaf nodes of gold and predicted RST-trees. Thus, we also apply RST-Parseval to evaluate predicted trees in terms of multi-sentential parsing.  Figure 4: Example of decomposing a RST tree into intra-and multi-sentential trees.