Top-down Discourse Parsing via Sequence Labelling

We introduce a top-down approach to discourse parsing that is conceptually simpler than its predecessors (Kobayashi et al., 2020; Zhang et al., 2020). By framing the task as a sequence labelling problem where the goal is to iteratively segment a document into individual discourse units, we are able to eliminate the decoder and reduce the search space for splitting points. We explore both traditional recurrent models and modern pre-trained transformer models for the task, and additionally introduce a novel dynamic oracle for top-down parsing. Based on the Full metric, our proposed LSTM model sets a new state-of-the-art for RST parsing.


Introduction
Discourse analysis involves the modelling of the structure of text in a document. It provides a systematic way to understand how texts are segmented hierarchically into discourse units, and the relationships between them. Unlike syntax parsing which models the relationship of words in a sentence, discourse parsing operates at the document-level, and aims to explain the flow of writing. Studies have found that discourse parsing is beneficial for downstream NLP tasks including document-level sentiment analysis (Bhatia et al., 2015) and abstractive summarization (Koto et al., 2019).
Rhetorical Structure Theory (RST; Mann and Thompson (1988)) is one of the most widely used discourse theories in NLP (Hernault et al., 2010;Feng and Hirst, 2014;Ji and Eisenstein, 2014;Li et al., 2016;. RST organizes text spans into a tree, where the leaves represent the basic unit of discourse, known as elementary discourse units (EDUs). EDUs are typically clauses 1 Code and trained models: https://github.com/ fajri91/NeuralRST-TopDown EDU-1 EDU-4 EDU-2 EDU-3 elab EDU-1: Roy E. Parrott, the company's president and chief operating officer since Sept. 1, was named to its board. EDU-2: The appointment increased the number of directors to 10, EDU-3: three of whom are company employees. EDU-4: Simpson is an auto parts maker. elab elab Figure 1: An example discourse tree, from the RST Discourse Treebank (elab = elaboration). of a sentence. Non-terminal nodes in the tree represent discourse unit relations.
In Figure 1, we present an example RST tree with four EDUs spanning two sentences. In this discourse tree, EDUs are hierarchically connected with arrows and the discourse label elab. The direction of arrows indicates the nuclearity of relations, wherein a "satellite" points to its "nucleus". The satellite unit is a supporting sentence for the nucleus unit and contains less prominent information. It is standard practice that the RST tree is trained and evaluated in a right-heavy binarized manner, resulting in three forms of binary nuclearity relationships between EDUs: Nucleus-Satellite, Satellite-Nucleus, and Nucleus-Nucleus. In this work, eighteen coarse-grained relations are considered as discourse labels, consistent with earlier work . 2 Work on RST parsing has been dominated by the bottom-up paradigm (Hernault et al., 2010;Feng and Hirst, 2014;Ji and Eisenstein, 2014;Braud et al., 2017;Morey et al., 2017;. These methods produce very competitive benchmarks, but in practice it is not a straightforward  Figure 2: Comparison of our top-down models with Zhang et al. (2020) and Kobayashi et al. (2020). approach (e.g. transition-based parser with actions prediction steps). Furthermore, bottom-up parsing limits the tree construction to local information, and macro context such as global structure/topic is prone to be under-utilized. As a result, there has recently been a move towards top-down approaches (Kobayashi et al., 2020;Zhang et al., 2020).
The general idea behind top-down parsing is to find splitting points in each iteration of tree construction. In Figure 2, we illustrate how our architecture differs from Zhang et al. (2020) and Kobayashi et al. (2020). First, Zhang et al. (2020) utilize four levels of encoder that comprise 3 Bi-GRUs and 1 CNN layer. The splitting mechanism is applied through a decoder, a stack, and bi-affine attention mechanisms. Kobayashi et al. (2020) use the gold paragraph and sentence boundaries to aggregate a representation for each unit, and generate the tree based on these granularities. Two Bi-LSTMs are used, with splitting points determined by exhaustively calculating the bi-affine score of each possible split. The use of paragraph boundaries can explicitly lower the difficulty of the task, as 77% of paragraphs in the English RST Discourse Treebank ("RST-DT") are actually text spans (Carlson et al., 2001). These boundaries are closely related to gold span boundaries in evaluation.
In this paper, we propose a conceptually simpler top-down approach for RST parsing. The core idea is to frame the problem as a sequence labelling task, where the goal is to iteratively find a segmentation boundary to split a sequence of discourse units into two sub-sequences of discourse units. This way, we are able to simplify the architecture, in eliminating the decoder as well as reducing the search space for splitting points. Specifically, we use an LSTM (Hochreiter and Schmidhuber, 1997) or pre-trained BERT (Devlin et al., 2019) as the segmenter, enhanced in a number of key ways.
Our primary contributions are as follows: (1) we propose a novel top-down approach to RST parsing based on sequence labelling; (2) we explore both traditional sequence models such as LSTMs and also modern pre-trained encoders such as BERT; (3) we demonstrate that adding a weighting mechanism during the splitting of EDU sequences improves performance; and (4) we propose a novel dynamic oracle for training top-down discourse parsers.

Related Work
Previous work on RST parsing has been dominated by bottom-up approaches (Hernault et al., 2010;Joty et al., 2013;Li et al., 2016;Braud et al., 2017;Wang et al., 2017). For example, Ji and Eisenstein (2014) introduce DPLP, a transition-based parser based on an SVM with representation learning, combined with some heuristic features. Braud et al. (2016) propose joint text segment representation learning for predicting RST discourse trees using a hierarchical Bi-LSTM. Elsewhere,  showed that implicit syntax features extracted from a dependency parser (Dozat and Manning, 2017) are highly effective for discourse parsing.
Top-down parsing is well established for constituency parsing and language modelling (Johnson, 1995;Roark and Johnson, 1999;Roark, 2001;Frost et al., 2007), but relatively new to discourse parsing. Lin et al. (2019) propose a unified framework based on pointer networks for sentence-level discourse parsing, while  employ hierarchical pointer network parsers. Morey et al. (2017) found that most previous studies on parsing RST discourse tree were incorrectly benchmarked, e.g. one study uses macroaveraging while another use micro-averaging. 3 They also advocate for evaluation based on microaveraged F-1 scores over labelled attachment decisions (a la the original Parseval).
Pre-trained language models (Radford et al., 2018;Devlin et al., 2019) have been shown to benefit a multitude of NLP tasks, including discourse analysis. For example, BERT models have been used for classifying discourse markers (Sileo et al.,   2019) and discourse relations (Nie et al., 2019;Shi and Demberg, 2019). To the best of our knowledge, however, pre-trained models have not been applied in the generation of full discourse trees, which we address here by experimenting with BERT for topdown RST parsing.

Top-down RST Parsing
We frame RST parsing as a sequence labelling task, where given a sequence of input EDUs, the goal is to find a segmentation boundary to split the sequence into two sub-sequences. This is realized by training a sequence labelling model to predict a binary label for each EDU, and select the EDU with the highest probability to be the segmentation point. After the sequence is segmented, we repeat the same process for the two sub-sequences in a divide-and-conquer fashion, until all sequences are segmented into individual units, producing the binary RST tree (e.g. Figure 1).

LSTM Model
As illustrated in Figure 3, our LSTM parser consists of two main blocks: an encoder and a segmenter. For the encoder, we follow  in using two LSTMs (Bi-LSTM 1 and Bi-LSTM 2 ) to produce EDU encodings by processing: (1) x i , the concatenation of word embedding w i and POS tag embedding p i ; and (2) syntax embedding s i , the output of the MLP layer of the bi-affine dependency parser (Dozat and Manning, 2017). Similar to , we then take the average of the output states for both LSTMs over the EDU, and concatenate it with an EDU type embedding t E j (which distinguishes the last EDU in a paragraph from other EDUs) to produce the final encoding: where E j is an EDU, p is the number of words in E j , and ⊕ denotes the concatenate operation. t E j is generally an implicit paragraph boundary feature, and provides a fair benchmark with previous models. In Section 4.3, we also show results without paragraph boundary features. As each EDU is processed independently, we use another LSTM (Bi-LSTM 3 ) to capture the inter-EDU relationship to obtain a contextualized representation h E j : where q is the number of EDUs in the document. Note that h E j is the final encoder output (see Figure 3) and is only computed once for each document.
The second part is the segmenter. We frame segmentation as a sequence labelling problem with y E j ∈ {0, 1}, where 1 denotes the splitting point, and 0 a non-splitting point. For each EDU sequence there is exactly one EDU that is labeled 1, and we start from the full EDU sequence (whole document) and iteratively perform segmentation until we are left with individual EDUs. We use a queue to store the two EDU sub-sequences as the result of the segmentation process. In total, there are q − 1 iterations of segmentation (recall that q is the total number of EDUs in the document).
As segmentation is done iteratively in a divideand-conquer fashion, h E j serves as the input to the segmenter, which takes a (sub)sequence of EDUs to predict the segmentation position: where m/n are the starting/ending index of the EDU sequence, 4 andỹ E j gives the probability of a segmentation. From preliminary experiments we found that it's important to have this additional Bi-LSTM 4 to perform the EDU sub-sequence segmentation point prediction.

Transformer Model
Adapting BERT to discourse parsing is not trivial due to the limited number of input tokens it takes (typically 512 tokens), which is often too short for documents. Moreover, BERT is designed to encode sentences (and only two at maximum), where in our case we want to encode sequences of EDUs that span multiple sentences.
In our case, EDU truncation is not an option (since that would produce an incomplete RST tree), and the average number of words per document in our data is 521 (741 word pieces after BERT tokenization), which is much larger than the 512 limit. We therefore break the document into a number of partial documents, each consisting of multiple sentences that fit into the 512 token limit. This way, we allow the model to capture the fine-grained wordto-word relationships across (most) EDUs. Each partial document is then processed based on Liu and Lapata (2019) trick where we use an alternating even/odd segmentation embedding to encode all the EDUs in a document.
We illustrate this approach in Figure 4. First, all EDUs are formatted to start with [CLS] and end with [SEP], and words are tokenized using WordPiece. If the document has more than 512 tokens, we break it into multiple partial documents based on EDU boundaries, and pad accordingly (e.g. in Figure 4 we break the example document of 3 EDUs into 2 partial documents), and process each partial document independently with BERT.
We also experimented with the second alternative by encoding each EDU independently first with BERT, and use a second inter-EDU transformer to capture the relationships between EDUs. Preliminary experiments, however, suggest that this approach produces sub-optimal performance.
In Figure 4 each token is assigned three kinds of embeddings: (1) word, (2) segment, and (3) position. The input vector is computed by summing these three embeddings, and fed into BERT (initialized with bert-base). The output of BERT 4 In the first iteration, m = 1 and n = q (number of EDUs in the document).
Bert Out type P 4 P 3 P 6 P 5 P 8 P 7 P 9 P 2 P 1 P 4 P 3 P 6 P 5 P 8 P 7 P 9 g E1 g EDU g One g SEP g E2 g EDU g Two g SEP g PAD g E3 g EDU g Three g SEP g PAD g PAD g PAD g PAD  Figure 4: Architecture of the transformer model. In practice, 1 row of input can have more than two EDUs.
gives us a contextualized embedding for each token, and we use the [CLS] embedding as the encoding for each EDU (g E j ).
Unlike the LSTM model, we do not incorporate syntax embeddings into the transformer model as we found no empirical benefit (see Section 4.3). This observation is in line with other studies (e.g. Jawahar et al. (2019)) that have found BERT to implicit encode syntactic knowledge.
For the segmenter we use a second transformer (initialized with random weights) to capture the inter-EDU relationships for sub-sequences of EDUs during iterative segmentation: whereỹ E j gives the probability of a segmentation, and h E j is the concatenation of the output of BERT (g E j ) and the EDU type embedding (t E j ).

Nuclearity and Discourse Relation Prediction
In Figure 5, we give an example of the iterative segmentation process to construct the RST tree. In each iteration, we pop a sequence from the queue (initialized with the original sequence of EDUs in  the document) and compute the segmentation label for each EDU using an LSTM (Section 3.1) or transformer (Section 3.2). After the sequence is segmented (using the ground truth label during training, or the highest-probability label at test time), we push to the queue the two sub-sequences (if they contain at least two EDUs) and repeat this process until the queue is empty.
In addition to segmentation, we also need to predict the nuclearity/satellite relationship (3 classes) and the discourse label (18 classes) for the segmented pairs. To that end, we average the EDU encodings for the segments, and feed them to a MLP layer to predict the nuclearity and discourse labels: where ind is the index of the segmentation point (given by the ground truth during training, or argmax of the segmentation probabilitiesỹ E j at test time), and z nuc+dis gives the joint probability distribution over the nuclearity and discourse classes. 5

Segmentation Loss with Penalty
One drawback of the top-down approach is that segmentation errors incurred closer to the root can be detrimental, as the error will propagate to the rest of the sub-trees. To address this, we explore scaling the segmentation loss based on the current tree depth and the number of EDUs in the input sequence. Preliminary experiments found that both approaches work, but that the latter is marginally better, and so we present results using the latter.
Formally, the modified segmentation loss of an example (document) is given as follows: where y E i ∈ {0, 1} is the ground truth segmentation label, L(E m:n ) is the cross-entropy loss for an EDU sequence, S is the set of all EDU sequences (based on ground truth segmentation), and β is a scaling hyper-parameter.
To summarize, the total training loss of our model is a (weighted) combination of segmentation loss (L seg ) and nuclearity-discourse prediction loss (L nuc+dis ):

Dynamic Oracle
The training regimen for discourse parsing creates an exposure bias, where the parser may struggle to recover when it makes a mistake at test time. Goldberg and Nivre (2012) propose a dynamic oracle for transition-based dependency parsing to tackle this. The idea is to allow the model during training to use its predictions (instead of ground truth actions), and introduce a dynamic oracle to find the next best/optimal action sequences. It does so by comparing the current state of the constructed tree and the gold-standard tree, and aims to minimize the deviation. As the model is exposed to prediction errors during training time, it has a better chance of recovering from them at test time.
We explore a similar idea, and propose a dynamic oracle for our top-down discourse parser. A crucial question to ask when designing a dynamic oracle is: how can we compare the current state to the gold tree to obtain the next best series of actions when an error occurs during training? In transition-based parsing, Goldberg and Nivre (2012) compute a cost/loss of each transition by while queue is not empty do 8: Em:n = queue.pop() 9: id gold , r gold = match(Em:n, O, R) 10: id pred = predictSplit(Em:n) 11: r pred1 = predictLabel(Em:n, id gold ) # for loss 12: r pred2 = predictLabel(Em:n, id pred ) # ignored 13: if random() > α then 14: L, R = separate(Em:n, id gold ) 15: else 16: L, R = separate(Em:n, id pred ) 17: end if 18: queue.push(L) if len(L) > 1 19: queue.push(R) if len(R) > 1 20: end while 21: end function counting the gold arcs that are no longer reachable based on the action taken (e.g. SHIFT, REDUCE). We apply similar reasoning when finding the next best segmentation sequence in our dynamic oracle, which we illustrate below with an example.
Say we have a document with 4 EDUs (E 1:4 ), and the gold tree given in Figure 6 (left). The correct sequence of segmentation is given by O 1:4 = [2, 1, 3, −], which means we should first split at E 2 (creating E 1:2 and E 3:4 ), and then at E 1 (creating E 1 , E 2 , E 3:4 ), and lastly at E 3 , producing E 1 , E 2 , E 3 , E 4 as the leaves with the gold tree structure. We give the last EDU E 4 a "−" label (i.e. O 4 ='−') because no segmentation is needed for the last EDU.
Suppose the model predicts to do the first segmentation at E 3 . This produces E 1:3 and E 4 . What is the best way to segment E 1:3 to produce a tree that is as close as possible to the gold tree? The canonical segmentation order O 1:3 is [2, 1, −] (the label of the last EDU is replaced by '−'), from which we can see the next best segmentation is to segment at E 2 to create E 1:2 and E 3 . Creating the canonical segmentation order O, and following it as much as possible, ensures the sub-tree that we're creating for E 1:3 mimics the structure of the gold tree.
The dynamic oracle labels nuclearity-discourse relations following the same idea. We introduce R, a list of gold nuclearity-discourse relations. For our example R 1:4 = [r 2 , r 1 , r 3 , −] (based on the gold tree; see Figure 6 (left)). If the model decides to first segment at E 3 and creates E 1:3 and E 4 , when  we segment at E 2 (next best choice of segmentation), we will follow R and label the nuclearitydiscourse relation with r 1 . As before, following the original label list R ensures we keep the nuclearitydiscourse relation as faithful as possible ( Figure 6 (right bottom)).
The dynamic oracle of our top-down parser is arguably quicker than that of a transition-based parser, as we do not need to accumulate cost for every transition taken. Instead, the dynamic oracle simply follows the gold segmentation order O to preserve as many subtrees as possible when an error occurs. We present pseudocode for the proposed dynamic oracle in Algorithm 1.
The probability of using the ground truth segmentation or predicted segmentation during training is controlled by the hyper-parameter α ∈ [0, 1] (see Algorithm 1). Intuitively, this hyper-parameter allows the model to alternate between exploring its (possibly erroneous) segmentation or learning from the ground truth segmentation. The oracle reverts to its static variant when α = 0.

Data
We use the English RST Discourse Treebank (Carlson et al., 2001) for our experiments, consistent with recent studies (Ji and Eisenstein, 2014;Li et al., 2014;Feng and Hirst, 2014;. The dataset is based on the Wall Street Journal portion of the Penn Treebank (Marcus et al., 1993), with 347 documents for training, and the remaining 38 documents for testing. We use the same development set as , which consists of 35 documents selected from the training set. We also use the same 18 discourse labels. Stanford CoreNLP  is used for POS tagging. 6

Model Configurations
We experiment with two segmentation models -LSTM (Section 3.1) and transformer (Section 3.2) -both implemented in PyTorch framework. 7 As EDUs are provided in the dataset, no automatic segmentation of EDU is required in our experiments. For the LSTM model, the dimensionality of the Bi-LSTMs in the encoder is 256, while the segmenter (Bi-LSTM 4 ) is 128 (Figure 3). The embedding dimensions of words, POS tags, EDU type, and syntax features are 200, 200, 100, and 1,200, respectively, and we initialize words in EDU with GloVe embedding (Pennington et al., 2014). 8 For hyper-parameters, we use the following: batch size = 4, gradient accumulation = 2, learning rate = 0.001, dropout probability = 0.5, and optimizer = Adam (with epsilon of 1e-6). The loss scaling hyper-parameters (Equation (2)), are tuned based on the development set, and set to λ 1 = 1.0, and λ 2 = 1.0.
For the transformer model, the document length limit is set to 512 tokens, and longer documents are broken into smaller partial documents. As before, we truncate each EDU to the first 50 words. We initialize the transformer in the encoder with bert-base, and the transformer in the segmenter with random weights (Figure 4). The transformer segmenter has 2 layers with 8 heads and 2048 feedforward hidden size. The training hyper-parameters are: initial learning rate = 5e-5, maximum epochs = 250, warm up = 2000 steps, and drop out = 0.2. For the λ hyper-parameters, we use the same 6 https://stanfordnlp.github.io/CoreNLP 7 We use the Huggingface framework for the transformer models. 8 https://nlp.stanford.edu/projects/ glove configuration as for the LSTM model.
We tuned the segmentation loss penalty hyperparameter β (Section 3.4) and the dynamic oracle hyper-parameter α (Section 3.5) based on the development set. Both the LSTM and transformer models use the same β = 0.35 and α = 0.65. We activate the dynamic oracle after training for 50 epochs for both models.
In terms of evaluation, we use the standard metrics introduced by Marcu (2000): Span, Nuclearity, Relation, and Full. We report micro-averaged F-1 scores on labelled attachment decisions (original Parseval), following the recommendation of Morey et al. (2017). Additionally, we also present the evaluation with RST-Parseval procedure in Appendix A.

Results
We first perform a feature addition study over our models to find the best model configuration; results are presented in Table 1. Note that these results are computed over the development set, based on a static oracle.
For the vanilla models, the transformer model performs much better than the LSTM model. Adding syntax features (+Syntax) improves both models, although it's more beneficial for the LSTM. A similar trend is observed when we modify the segmentation loss to penalize the model if a segmentation error is made with more EDUs in the input sequence (+Penalty; Section 3.4): the transformer model sees an improvement of +0.8 while the LSTM model improves by +1.2. Lastly, when we combine both syntax features and the segmentation penalty, the LSTM model again shows an appreciable improvement, while the transformer model drops in performance marginally. 9 Given these results, we use both syntax features and the segmentation penalty for the LSTM model, but only the segmentation penalty for the transformer model in the remainder of our experiments.
We next benchmark our models against state-ofthe-art RST parsers over the test set, as presented in Table 2 (original Parseval) and Table 5 (RST-Parseval as additional result). Except , all bottom-up results are from Morey et al. (2017). We present the labelled attachment decision performance for  by running the code of the authors for three runs and taking  Nuclearity, R: Relation, F: Full) are averaged over three runs. "*" denotes reported performance. " †" and " ‡" denote that the model uses sentence and paragraph boundary features, respectively. In this evaluation, Kobayashi et al. (2020) does not report the original Parseval result. the average. 10 We also present the reported scores for the other top-down RST parsers (Zhang et al., 2020;Kobayashi et al., 2020). 11 Human performance in Table 2 and Table 5 is the score of human agreement reported by Joty et al. (2015) ad Morey et al. (2017). Overall, in Table 2 our top-down models (LSTM and transformer) outperform all bottom-up and topdown baselines across all metrics. As we saw in the feature addition study, the LSTM model outperforms the transformer model, even though the transformer uses pre-trained BERT. We hypothesize that this may be because BERT is trained over shorter texts (paragraphs or sentence pairs), while our documents are considerably longer. Also, due to memory constraints, we break long documents into partial documents (Section 3.2), limiting 10 https://github.com/yunan4nlp/ NNDisParser. 11 Neither Zhang et al. (2020) nor Kobayashi et al. (2020) released their code, so we were unable to rerun their models.  Table 3: Impact of the dynamic oracle over documents of differing length. Scores (micro-averaged F-1 on labelled attachment decisions) are averaged over three runs on the test set.
fine-grained word-to-word attention to only nearby EDUs.
In Table 2, we also present results for our model without paragraph features, and compare against other models which don't use paragraph features (each marked with " ‡"). 12 First, we observe that our best model substantially outperforms all models with paragraph boundary features in terms of the Full metric. Compared to Zhang et al. (2020), our models (without this feature) achieve an improvement of +0.1, +1.9, +3.2, and +3.1 for Span, Nuclearity, Relation, and Full respectively.

Analysis
In Table 3 we present the impact of the dynamic oracle over documents of differing length for the LSTM model. Generally, we found that the static model performs better for shorter documents, and the dynamic oracle is more effective for longer documents. For instance, for documents with 50-100 EDUs, the dynamic oracle improves the Span, Nuclearity, and Relation metrics substantially. We also observe that the longer the document, the more difficult the tree prediction is. It is confirmed by the decreasing trends of all metrics for longer documents in Table 3.
In total, our best model obtains 1,698 out of 2,308 spans of original Parseval trees, and correctly predict 1,517 segmentation points (pairs). We further analyze these pairs by presenting the confusion matrices of nuclearity and relation prediction in Figure 7 and Figure 8,   In Figure 8 we present analysis over top-7 relations and a relation other that represents the rest of 11 classes. Similar to the nuclearity prediction, the relation class distribution is also imbalance where elab accounts for 37% of the examples. Some relations are related to elab (see Table 4 for examples), such as back, cause, and list which we see some false positives. This produces the low precision of elab (74%). Unlike elab, relation attr is also a major class (represents 14% of the training data) but its precision and recall is substantially higher, at 94% and 96% respectively, suggesting it is less ambiguous. For other, its recall is 45%, and most of the errors are classified as elab (31%).

Conclusion
We introduce a top-down approach for RST parsing via sequence labelling. Our model is conceptually simpler than previous top-down discourse parsers  and can leverage pre-trained language models such as BERT. We additionally propose a dynamicoracle for our top-down parser, and demonstrate that our best model achieves a new state-of-the-art for RST parsing.
A Evaluation with RST-Parseval Procedure Method S N R F Table 5: Results over the test set calculated using micro-averaged F-1 on RST-Parseval. All metrics (S: Span, N: Nuclearity, R: Relation, F: Full) are averaged over three runs. "*" denotes reported performance. " †" and " ‡" denote that the model uses sentence and paragraph boundary features, respectively. In this evaluation, Zhang et al. (2020) Table 9: Running time of the static models during the training. The transition-based model is