W-RST: Towards a Weighted RST-style Discourse Framework

Aiming for a better integration of data-driven and linguistically-inspired approaches, we explore whether RST Nuclearity, assigning a binary assessment of importance between text segments, can be replaced by automatically generated, real-valued scores, in what we call a Weighted-RST framework. In particular, we find that weighted discourse trees from auxiliary tasks can benefit key NLP downstream applications, compared to nuclearity-centered approaches. We further show that real-valued importance distributions partially and interestingly align with the assessment and uncertainty of human annotators.


Introduction
Ideally, research in Natural Language Processing (NLP) should balance and integrate findings from machine learning approaches with insights and theories from linguistics. With the enormous success of data-driven approaches over the last decades, this balance has arguably and excessively shifted, with linguistic theories playing a less and less critical role. Even more importantly, there are only little attempts made to improve such theories in light of recent empirical results.
In the context of discourse, two main theories have emerged in the past: The Rhetorical Structure Theory (RST) (Carlson et al., 2002) and PDTB (Prasad et al., 2008). In this paper, we focus on RST, exploring whether the underlying theory can be refined in a data-driven manner.
In general, RST postulates a complete discourse tree for a given document. To obtain this formal representation as a projective consituency tree, a given document is first separated into so called Elementary Discourse Units (or short EDUs), representing clause-like sentence fragments of the input * Equal contribution. document. Afterwards, the discourse tree is built by hierarchically aggregating EDUs into larger constituents annotated with an importance indicator (in RST called nuclearity) and a relation holding between siblings in the aggregation. The nuclearity attribute in RST thereby assigns each sub-tree either a nucleus-attribute, indicating central importance of the sub-tree in the context of the document, or a satellite-attribute, categorizing the sub-tree as of peripheral importance. The relation attribute further characterizes the connection between sub-trees (e.g. Elaboration, Cause, Contradiction).
One central requirement of the RST discourse theory, as for all linguistic theories, is that a trained human should be able to specify and interpret the discourse representations. While this is a clear advantage when trying to generate explainable outcomes, it also introduces problematic, humancentered simplifications; the most radical of which is arguably the nuclearity attribute, indicating the importance among siblings.
Intuitively, such a coarse (binary) importance assessment does not allow to represent nuanced differences regarding sub-tree importance, which can potentially be critical for downstream tasks. For instance, the importance of two nuclei siblings is rather intuitive to interpret. However, having siblings annotated as "nucleus-satellite" or "satellitenucleus" leaves the question on how much more important the nucleus sub-tree is compared to the satellite, as shown in Figure 1. In general, it is unclear (and unlikely) that the actual importance distributions between siblings with the same nuclearity attribution are consistent.
Based on this observation, we investigate the potential of replacing the binary nuclearity assessment postulated by RST with automatically generated, real-valued importance scores in a new, Weighted-RST framework. In contrast with previous work that has assumed RST and developed Figure 1: Document wsj 0639 from the RST-DT corpus with inconsistent importance differences between N-S attributions. (The top-level satellite is clearly more central to the overall context than the lower-level satellite. However, both are similarly assigned the satellite attribution by at least one annotator). Top relation: Annotator 1: N-S, Annotator 2: N-N. computational models of discourse by simply applying machine learning methods to RST annotated treebanks (Ji and Eisenstein, 2014;Feng and Hirst, 2014;Joty et al., 2015;Li et al., 2016;Wang et al., 2017;Yu et al., 2018), we rely on very recent empirical studies showing that weighted "silver-standard" discourse trees can be inferred from auxiliary tasks such as sentiment analysis (Huber and Carenini, 2020b) and summarization (Xiao et al., 2021).
In our evaluation, we assess both, computational benefits and linguistic insights. In particular, we find that automatically generated, weighted discourse trees can benefit key NLP downstream tasks. We further show that real-valued importance scores (at least partially) align with human annotations and can interestingly also capture uncertainty in human annotators, implying some alignment of the importance distributions with linguistic ambiguity.

Related Work
First introduced by Mann and Thompson (1988), the Rhetorical Structure Theory (RST) has been one of the primary guiding theories for discourse analysis (Carlson et al., 2002;Subba and Di Eugenio, 2009;Zeldes, 2017;Gessler et al., 2019;, discourse parsing (Ji and Eisenstein, 2014;Feng and Hirst, 2014;Joty et al., 2015;Li et al., 2016;Wang et al., 2017;Yu et al., 2018), and text planning (Torrance, 2015;Gatt and Krahmer, 2018;Guz and Carenini, 2020). The RST framework thereby comprehensively describes the organization of a document, guided by the author's communicative goals, encompassing three components: (1) A projective constituency tree structure, often referred to as the tree span. (2) A nuclearity attribute, assigned to every internal node of the discourse tree, encoding relative importance between the nodes' sub-trees, with the nucleus expressing primary importance and a satellite signifying supplementary sub-trees. (3) A relation attribute for every internal node describing the relationship between the sub-trees of a node (e.g., Contrast, Evidence, Contradiction).
Arguably, the weakest aspect of an RST representation is the nuclearity assessment, which makes a too coarse differentiation between primary and secondary importance of sub-trees. However, despite its binary assignment of importance and even though the nuclearity attribute is only one of three components of an RST tree, it has major implications for many downstream tasks, as already shown early on by Marcu (1999), using the nuclearity attribute as the key signal in extractive summarization. Further work in sentiment analysis (Bhatia et al., 2015) also showed the importance of nuclearity for the task by first converting the constituency tree into a dependency tree (more aligned with the nuclearity attribute) and then using that tree to predict sentiment more accurately. Both of these results indicate that nuclearity, even in the coarse RST version, already contains valuable information. Hence, we believe that this coarsegrained classification is reasonable when manually annotating discourse, but see it as a major point of improvement, if a more fine-grained assessment could be correctly assigned. We therefore explore the potential of assigning a weighted nuclearity attribute in this paper.
While plenty of studies have highlighted the important role of discourse for real-world downstream tasks, including summarization, (Gerani et al., 2014;Xu et al., 2020;Xiao et al., 2020), sentiment analysis (Bhatia et al., 2015;Hogenboom et al., 2015;Nejat et al., 2017) and text classification (Ji and Smith, 2017), more critical to our approach is very recent work exploring such connection in the opposite direction. In Huber and Carenini (2020b), we exploit sentiment related information to generate "silver-standard" nuclearity annotated discourse trees, showing their potential on the domain-transfer discourse parsing task. Crucially for our purposes, this approach internally generates real-valued importance-weights for trees.
For the task of extractive summarization, we follow our intuition given in Xiao et al. (2020) and Xiao et al. (2021), exploiting the connection be- Figure 2: Three phases of our approach to generate weighted RST-style discourse trees. Left and center steps are described in section 3, right component is described in section 4. † = As in Huber and Carenini (2020b), ‡ = As in Marcu (1999), * = Sentiment prediction component is a linear combination, mapping the aggregated embedding to the sentiment output. The linear combination has been previously learned on the training portion of the dataset. tween summarization and discourse. In particular, in Xiao et al. (2021), we demonstrate that the selfattention matrix learned during the training of a transformer-based summarizer captures valid aspects of constituency and dependency discourse trees.
To summarize, building on our previous work on creating discourse trees through distant supervision, we take a first step towards generating weighted discourse trees from the sentiment analysis and summarization tasks.

W-RST Treebank Generation
Given the intuition from above, we combine information from machine learning approaches with insights from linguistics, replacing the humancentered nuclearity assignment with real-valued weights obtained from the sentiment analysis and summarization tasks 1 . An overview of the process to generate weighted RST-style discourse trees is shown in Figure 2, containing the training phase (left) and the W-RST discourse inference phase (center) described here. The W-RST discourse evaluation (right), is covered in section 4.

Weighted Trees from Sentiment
To generate weighted discourse trees from sentiment, we slightly modify the publicly available code 2 presented in Huber and Carenini (2020b) by removing the nuclearity discretization component.
An overview of our method is shown in Fig (2018) on a corpus with document-level sentiment goldlabels, internally annotating each input-unit (in our case EDUs) with a sentiment-and attention-score. After the MIL model is trained (center), a tuple (s i , a i ) containing a sentiment score s i and an attention a i is extracted for each EDU i. Based on these tuples representing leaf nodes, the CKY algorithm (Jurafsky and Martin, 2014) is applied to find the tree structure to best align with the overall document sentiment, through a bottom-up aggregation approach defined as 3 : with nodes l and r as the left and right childnodes of p respectively. The attention scores (a l , a r ) are here interpreted as the importance weights for the respective sub-trees (w l = a l /(a l + a r ) and w r = a r /(a l + a r )), resulting in a complete, normalized and weighted discourse structure as required for W-RST. We call the discourse treebank generated with this approach W-RST-Sent.

Weighted Trees from Summarization
In order to derive weighted discourse trees from a summarization model we follow Xiao et al. (2021) 4 , generating weighted discourse trees from the selfattention matrices of a transformer-based summarization model. An overview of our method is shown in Figure 2 (bottom), while a detailed view is presented in the left and center parts of Figure 4.
We start by training a transformer-based extractive summarization model (left), containing three (1) A pre-trained BERT EDU Encoder generating EDU embeddings, (2) a standard transformer architecture as proposed in Vaswani et al. (2017) and (3) a final classifier, mapping the outputs of the transformer to a probability score for each EDU, indicating whether the EDU should be part of the extractive summary.
With the trained transformer model, we then extract the self-attention matrix A and build a discourse tree in bottom-up fashion (as shown in the center of Figure 4). Specifically, the self-attention matrix A reflects the relationships between units in the document, where entry A ij measures how much the i-th EDU relies on the j-th EDU. Given this information, we generate an unlabeled constituency tree using the CKY algorithm (Jurafsky and Martin, 2014), optimizing the overall tree score, as previously done in Xiao et al. (2021).
In terms of weight-assignment, given a sub-tree spanning EDUs i to j, split into child-constituents at EDU k, then max(A i:k,(k+1):j ), representing the maximal attention value that any EDU in the left constituent is paying to an EDU in the right childconstituent, reflects how much the left sub-tree relies on the right sub-tree, while max(A (k+1):j,i:k ) defines how much the right sub-tree depends on the left. We define the importance-weights of the left (w l ) and right (w r ) sub-trees as: In this way, the importance scores of the two subtrees represent a real-valued distribution. In combination with the unlabeled structure computation, we generate a weighted discourse tree for each document. We call the discourse treebank generated with the summarization downstream information W-RST-Summ.

W-RST Discourse Evaluation
To assess the potential of W-RST, we consider two evaluation scenarios (Figure 2, right): (1) Apply weighted discourse trees to the tasks of sentiment analysis and summarization and (2) analyze the weight alignment with human annotations.

Weight-based Discourse Applications
In this evaluation scenario, we address the question of whether W-RST trees can support downstream tasks better than traditional RST trees with nuclearity. Specifically, we leverage the discourse trees learned from sentiment for the sentiment analysis task itself and, similarly, rely on the discourse trees learned from summarization to benefit the summarization task.

Sentiment Analysis
In order to predict the sentiment of a document in W-RST-Sent based on its weighted discourse tree, we need to introduce an additional source of information to be aggregated according to such tree. Here, we choose word embeddings, as commonly used as an initial transformation in many models tackling the sentiment prediction task (Kim, 2014;Tai et al., 2015;Yang et al., 2016;Adhikari et al., 2019;Huber and Carenini, 2020a). To avoid introducing additional confounding factors through sophisticated tree aggregation approaches (e.g. TreeLSTMs (Tai et al., 2015)), we select a simple method, aiming to directly compare the inferred tree-structures and allowing us to better assess the performance differences originating from the weight/nuclearity attribution (see right step in Figure 3). More specifically, we start by computing the average word-embedding for each leaf node leaf i (here containing a single EDU) in the discourse tree.
With |leaf i | as the number of words in leaf i, Emb(·) being the embedding lookup and word j i representing word j within leaf i. Subsequently, we aggregate constituents, starting from the leaf nodes (with leaf i as embedding constituent c i ), according to the weights of the discourse tree. For any two sibling constituents c l and c r of the parent sub-tree c p in the binary tree, we compute c p = c l * w l + c r * w r with w l and w r as the real-valued weightdistribution extracted from the inferred discourse tree and c p , c l and c r as dense encodings. We aggregate the complete document in bottom-up fashion, eventually reaching a root node embedding containing a tree-weighted average of the leaf-nodes. Given the root-node embedding representing a complete document, a simple Multilayer Perceptron (MLP) trained on the original training portion of the MIL model is used to predict the sentiment of the document.

Summarization
In the evaluation step of the summarization model (right of Figure 4), we use the weighted discourse tree of a document in W-RST-Summ to predict its extractive summary by applying an adaptation of the unsupervised summarization method by Marcu (1999).
We choose this straightforward algorithm over more elaborate and hyper-parameter heavy approaches to avoid confounding factors, since our aim is to evaluate solely the potential of the weighted discourse trees compared to standard RST-style annotations. In the original algorithm, a summary is computed based on the nuclearity attribute by recursively computing the importance scores for all units as: where C(N ) represents the child of N , and P rom(N ) is the promotion set of node N , which is defined in bottom-up fashion as follows: (1) P rom of a leaf node is the leaf node itself. (2) P rom of an internal node is the union of the promotion sets of its nucleus children. Furthermore, d N represents the level of a node N , computed as the distance from the level of the lowest leaf-node. This way, units in the promotion set originating from nodes that are higher up in the discourse tree are amplified in their importance compared to those from lower levels.
As for the W-RST-Summ discourse trees with real-valued importance-weights, we adapt Marcu's algorithm by replacing the promotion set with realvalued importance scores as shown here: Once S n or S w are computed, the top-k units of the highest promotion set or with the highest importance scores respectively are selected into the final summary.

Nuclearity-attributed Baselines
To test whether the W-RST trees are effectively predicting the downstream tasks, we need to generate traditional RST trees with nuclearity to compare against. However, moving from weighted discourse trees to coarse nuclearity requires the introduction of a threshold. More specifically, while "nucleus-satellite" and "satellite-nucleus" assignments can be naturally generated depending on the distinct weights, in order to assign the third "nucleus-nucleus" class, frequently appearing in Figure 5: Three phases of our approach. Left: Generation of W-RST-Sent/Summ discourse trees. Right: Linguistic evaluation RST-style treebanks, we need to specify how close two weights have to be for such configuration to apply. Formally, we set a threshold t as follows: If: |w l − w r | < t → nucleus-nucleus Else: If: w l > w r → nucleus-satellite Else: If: w l ≤ w r → satellite-nucleus This way, RST-style treebanks with nuclearity attributions can be generated from W-RST-Sent and W-RST-Summ and used for the sentiment analysis and summarization downstream tasks. For the nuclearity-attributed baseline of the sentiment task, we use a similar approach as for the W-RST evaluation procedure, but assign two distinct weights w n and w s to the nucleus and satellite child respectively. Since it is not clear how much more important a nucleus node is compared to a satellite using the traditional RST notation, we define the two weights based on the threshold t as: The intuition behind this formulation is that for a high threshold t (e.g. 0.8), the nuclearity needs to be very prominent (the difference between the normalized weights needs to exceed 0.8), making the nucleus clearly more important than the satellite, while for a small threshold (e.g. 0.1), even relatively balanced weights (for example w l = 0.56, w r = 0.44) will be assigned as "nucleus-satellite", resulting in the potential difference in importance of the siblings to be less eminent.
For the nuclearity-attributed baseline for summarization, we directly apply the original algorithm by Marcu (1999) as described in section 4.1.2. However, when using the promotion set to determine which EDUs are added to the summarization, potential ties can occur. Since the discourse tree does not provide any information on how to prioritize those, we randomly select units from the candidates, whenever there is a tie. This avoids exploit-ing any positional bias in the data (e.g. the lead bias), which would confound the results.

Weight Alignment with Human Annotation
As for our second W-RST discourse evaluation task, we investigate if the real-valued importanceweights align with human annotations. To be able to explore this scenario, we generate weighted tree annotations for an existing discourse treebank (RST-DT (Carlson et al., 2002)). In this evaluation task we verify if: (1) The nucleus in a gold-annotation generally receives more weight than a satellite (i.e. if importance-weights generally favour nuclei over satellites) and, similarly, if nucleus-nucleus relations receive more balanced weights.
(2) In accordance with Figure 1, we further explore how well the weights capture the extend to which a relation is dominated by the nucleus. Here, our intuition is that for inconsistent human nuclearity annotations the spread should generally be lower than for consistent annotations, assuming that human misalignment in the discourse annotation indicates ambivalence on the importance of sub-trees.
To test for these two properties, we use discourse documents individually annotated by two human annotators and analyze each sub-tree within the doubly-annotated documents with consistent interannotator structure assessment for their nuclearity assignment. For each of the 6 possible interannotator nuclearity assessments, consisting of 3 consistent annotation classes (namely N-N/N-N, N-S/N-S and S-N/S-N) and 3 inconsistent annotation classes (namely N-N/N-S, N-N/S-N and N-S/S-N) 5 , we explore the respective weight distribution of the document annotated with the two W-RST tasks -sentiment analysis and summarization (see Figure 5). We compute an average spread s c for each of the 6 inter-annotator nuclearity assessments classes c as: With w j l and w j r as the weights of the left and right child node of sub-tree j in class c, respectively.

Experimental Setup
Sentiment Analysis: We follow our previous approach in Huber and Carenini (2020b) for the model training and W-RST discourse inference steps (left and center in Figure 3) using the adapted MILNet model from Angelidis and Lapata (2018) trained with a batch-size of 200 and 100 neurons in a single layer bi-directional GRU with 20% dropout for 25 epochs. Next, discourse trees are generated using the best-performing heuristic CKY method with the stochastic exploration-exploitation trade-off from Huber and Carenini (2020b) (beam size 10, linear decreasing τ ). As word-embeddings in the W-RST discourse evaluation (right in Figure  3), we use GloVe embeddings (Pennington et al., 2014), which previous work (Tai et al., 2015;Huber and Carenini, 2020a) indicates to be suitable for aggregation in discourse processing. For training and evaluation of the sentiment analysis task, we use the 5-class Yelp'13 review dataset (Tang et al., 2015). To compare our approach against the traditional RST approach with nuclearity, we explore the impact of 11 distinct thresholds for the baseline described in §4.1.3, ranging from 0 to 1 in 0.1 intervals. Summarization: To be consistent with RST, our summarizer extracts EDUs instead of sentences from a given document. The model is trained on the EDU-segmented CNNDM dataset containing EDU-level Oracle labels published by Xu et al. (2020). We further use a pre-trained BERT-base ("uncased") model to generate the embeddings of EDUs. The transformer used is the standard model with 6 layers and 8 heads in each layer (d = 512). We train the extractive summarizer on the training set of the CNNDM corpus (Nallapati et al., 2016) and pick the best attention head using the RST-DT dataset (Carlson et al., 2002) as the development set. We test the trees by running the summarization algorithm in Marcu (1999) on the test set of the CNNDM dataset, and select the top-6 EDUs based on the importance score to form a summary in natural order. Regarding the baseline model using thresholds, we apply the same 11 thresholds as for the sentiment analysis task.
Weight Alignment with Human Annotation: As discussed in §4.2, this evaluation requires two parallel human generated discourse trees for every document. Luckily, in the RST-DT corpus pub-  Table 1: Statistics on consistently and inconsistently annotated samples of the 1, 354 structure-aligned subtrees generated by two distinct human annotators.
lished by Carlson et al. (2002), 53 of the 385 documents annotated with full RST-style discourse trees are doubly tagged by a second linguist. We use the 53 documents containing 1, 354 consistent structure annotations between the two analysts to evaluate the linguistic alignment of our generated W-RST documents with human discourse interpretations. Out of the 1, 354 structure-aligned subtrees, in 1, 139 cases both annotators agreed on the nuclearity attribute, while 215 times a nuclearity mismatch appeared, as shown in detail in Table 1.

Results and Analysis
The results of the experiments on the discourse applications for sentiment analysis and summarization are shown in Figure 6. The results for  sentiment analysis (top) and summarization (bottom) thereby show a similar trend: With an increasing threshold and therefore a larger number of N-N relations (shown as grey bars in the Figure), the standard RST baseline (blue line) consistently improves for the respective performance measure of both tasks. However, reaching the best performance at a threshold of 0.8 for sentiment analysis and 0.6 for summarization, the performance starts to deteriorate. This general trend seems reasonable, given that N-N relations represent a rather frequent nuclearity connection, however classifying every connection as N-N leads to a severe loss of information. Furthermore, the performance suggests that while the N-N class is important in both cases, the optimal threshold varies depending on the task and potentially also the corpus used, making further task-specific fine-tuning steps mandatory. The weighted discourse trees following our W-RST approach, on the other hand, do not require the definition of a threshold, resulting in a single, promising performance (red line) for both tasks in Figure 6. For comparison, we apply the generated trees of a standard RST-style discourse parser (here the Two-Stage parser by Wang et al. (2017)) trained on the RST-DT dataset (Carlson et al., 2002) on both downstream tasks. The fully-supervised parser reaches an accuracy of 44.77% for sentiment analysis and an average ROUGE score of 26.28 for summarization. While the average ROUGE score  of the fully-supervised parser is above the performance of our W-RST results for the summarization task, the accuracy on the sentiment analysis task is well below our approach. We believe that these results are a direct indication of the problematic domain adaptation of fully supervised discourse parsers, where the application on a similar domain (Wall Street Journal articles vs. CNN-Daily Mail articles) leads to superior performances compared to our distantly supervised method, however, with larger domain shifts (Wall Street Journal articles vs. Yelp customer reviews), the performance drops significantly, allowing our distantly supervised model to outperform the supervised discourse trees for the downstream task. Arguably, this indicates that although our weighted approach is still not competitive with fully-supervised models in the same domain, it is the most promising solution available for cross-domain discourse parsing. With respect to exploring the weight alignment with human annotations, we show a set of confusion matrices based on human annotation for each W-RST discourse generation task on the absolute and relative weight-spread in Tables 2 and 3 respectively. The results for the sentiment analysis task are shown on the top of both tables, while the performance for the summarization task is shown at the bottom. For instance, the top right cell of the upper confusion matrix in Table 2 shows that for 19 sub-trees in the doubly annotated subset of RST-DT one of the annotators labelled the subtree with a nucleus-nucleus nuclearity attribution, while the second annotator identified it as satellite-nucleus. The average weight spread (see §4.2) for those 19 sub-trees is −0.24. Regarding Table 3, we  subtract the average spread across Table 2  Moving to the analysis of the results, we find the following trends in this experiment: (1) As presented in Table 2, the sentiment analysis task tends to strongly over-predict S-N (i.e., w l << w r ), leading to negative spreads in all cells. In contrast, the summarization task is heavily skewed towards N-S assignments (i.e., w l >> w r ), leading to exclusively positive spreads. We believe both trends are consistent with the intrinsic properties of the tasks, given that the general structure of reviews tends to become more important towards the end of a review (leading to increased S-N assignments), while for summarization, the lead bias potentially produces the overall strong nucleus-satellite trend.
(2) To investigate the relative weight spreads for different human annotations (i.e., between cells) beyond the trends shown in Table 2, we normalize values within a table by subtracting the average and scaling between [−1, 1]. As a result, Table 3 shows the relative weight spread for different human annotations. Apart from the general trends described in Table 2, the consistently annotated samples of the two linguists (along the diagonal of the confusion matrices) align reasonably. The most positive weight spread is consistently found in the agreed-upon nucleus-satellite case, while the nucleus-nucleus annotation has, as expected, the lowest divergence (i.e., closest to zero) along the diagonal in Table 3. (3) Regarding the inconsistently annotated samples (shown in the triangle matrix above the diagonal) it becomes clear that in the sentiment analysis model the values for the N-N/N-S and N-N/S-N annotated samples (top row in Table 3) are relatively close to the average value. This indicates that, similar to the nucleus-nucleus case, the weights are also ambivalent, with the N-N/N-S value (top center) slightly larger than the value for N-N/S-N (top right). The N-S/S-N case for the sentiment analysis model is less aligned with our intuition, showing a strongly negative weightspread (i.e. w l << w r ) where we would have expected a more ambivalent result with w l ≈ w r (however, aligned with the overall trend shown in Table 2). For summarization, we see a very similar trend with the values for N-N/N-S and N-N/S-N annotated samples. Again, both values are close to the average, with the N-N/N-S cell showing a more positive spread than N-N/S-N. However for summarization, the consistent satellite-nucleus annotation (bottom right cell) seems misaligned with the rest of the table, following instead the general trend for summarization described in Table 2. All in all, the results suggest that the values in most cells are well aligned with what we would expect regarding the relative spread. Interestingly, human uncertainty appears to be reasonably captured in the weights, which seem to contain more fine grained information about the relative importance of sibling sub-trees.

Conclusion and Future Work
We propose W-RST as a new discourse framework, where the binary nuclearity assessment postulated by RST is replaced with more expressive weights, that can be automatically generated from auxiliary tasks. A series of experiments indicate that W-RST is beneficial to the two key NLP downstream tasks of sentiment analysis and summarization. Further, we show that W-RST trees interestingly align with the uncertainty of human annotations.
For the future, we plan to develop a neural discourse parser that learns to predict importance weights instead of nuclearity attributions when trained on large W-RST treebanks. More longer term, we want to explore other aspects of RST that can be refined in light of empirical results, plan to integrate our results into state-of-the-art sentiment analysis and summarization approaches (e.g. Xu et al. (2020)) and generate parallel W-RST structures in a multi-task manner to improve the generality of the discourse trees.