Adapting Unsupervised Syntactic Parsing Methodology for Discourse Dependency Parsing

One of the main bottlenecks in developing discourse dependency parsers is the lack of annotated training data. A potential solution is to utilize abundant unlabeled data by using unsupervised techniques, but there is so far little research in unsupervised discourse dependency parsing. Fortunately, unsupervised syntactic dependency parsing has been studied by decades, which could potentially be adapted for discourse parsing. In this paper, we propose a simple yet effective method to adapt unsupervised syntactic dependency parsing methodology for unsupervised discourse dependency parsing. We apply the method to adapt two state-of-the-art unsupervised syntactic dependency parsing methods. Experimental results demonstrate that our adaptation is effective. Moreover, we extend the adapted methods to the semi-supervised and supervised setting and surprisingly, we find that they outperform previous methods specially designed for supervised discourse parsing. Further analysis shows our adaptations result in superiority not only in parsing accuracy but also in time and space efficiency.


Introduction
Discourse parsing, aiming to find how the text spans in a document relate to each other, benefits various down-stream tasks, such as machine translation evaluation , summarization (Marcu, 2000;Hirao et al., 2013), sentiment analysis (Bhatia et al., 2015;Huber and Carenini, 2020) and automated essay scoring (Miltsakaki and Kukich, 2004;Burstein et al., 2013). Researchers have made impressive progress on discourse parsing from the constituency perspective, which presents discourse structures as constituency trees (Ji and Eisenstein, 2014;Feng and Hirst, 2014;Joty et al., 2015;Nishida and * Corresponding author. Nakayama, 2020). However, as demonstrated by Morey et al. (2018), discourse structure can also be formulated as a dependency structure. Besides that, there might exist ambiguous parsing in terms of the constituency perspective (Morey et al., 2018). All of these suggest that dependency discourse parsing is a different promising approach for discourse parsing.
One of the main bottlenecks in developing discourse dependency parsing methods is the lack of annotated training data since the labeling effort is labor-intensive and time-consuming, and needs well-trained experts with linguistic knowledge (Marcu et al., 1999). This problem can be tackled by employing unsupervised and semisupervised methods that can utilize unlabeled data. However, while unsupervised methodology has been studied for decades in syntactic dependency parsing, there is little attention paid to the counterpart in discourse dependency parsing. Considering the similarity between syntactic and discourse dependency parsing, it is natural to suggest such methodology can be adapted from the former to the latter.
In this paper, we propose a simple yet effective adaptation method that can be readily applied to different unsupervised syntactic dependency parsing approaches. Adaptation from syntactic dependency parsing to discourse dependency parsing has two challenges. First, unlike syntactic parsing which has a finite vocabulary, in discourse parsing, the number of elementary discourse units (EDUs) is unlimited. This makes it difficult if not impossible to directly apply syntactic approaches requiring enumeration of words or word categories to discourse parsing. Second, in a discourse dependency parse tree, the dependencies within a sentence or a paragraph often form a complete subtree. There is no correspondence to this constraint in syntactic parsing approaches. To address these two chal-lenges, we cluster the EDUs to produce clusters resembling Part-Of-Speech (POS) tags in syntactic parsing and we introduce the Hierarchical Eisner algorithm that finds the optimal parse tree conforming to the constraint.
We applied our adaptation method to two stateof-the-art unsupervised syntactic dependency parsing models: Neural Conditional Random Field Autoencoder (NCRFAE, Li and Tu (2020)) and Variational Variant of Discriminative Neural Dependency Model with Valences (V-DNDMV, Han et al. (2019)). In our experiments, the adapted models performs better than the baseline on both RST Discourse Treebank (RST-DT, Carlson et al. (2001)) and SciDTB (Yang and Li, 2018) in the unsupervised setting. When we extend the two models to the semi-supervised and supervised setting, we find they can outperform previous methods specially designed for supervised discourse parsing.
Further analysis indicates that the Hierarchical Eisner algorithm shows superiority not only in parsing accuracy but also in time and space efficiency. Its empirical time and space complexity is close to O(n 2 ) with n being the number of EDUs, while the unconstrained algorithm adopted by most previous work has a complexity of O(n 3 ). The code and trained models can be found at: https://github. com/Ehaschia/DiscourseDependencyParsing.

Related Work
Unsupervised syntactic dependency parsing Unsupervised syntactic dependency parsing is the task to find syntactic dependency relations between words in sentences without guidance from annotations. The most popular approaches to this task are Dependency Model with Valences (DMV, Klein and Manning (2004)), a generative model learning the grammar from POS tags for dependency predictions, and its extensions. Jiang et al. (2016) employ neural networks to capture the similarities between POS tags ignored by vanilla DMV and Han et al. (2019) further amend the former with discriminative information obtained from an additional encoding network. Besides, there are also some discriminative approaches modeling the conditional probability or score of the dependency tree given the sentence, such as the CRF autoencoder method proposed by Cai et al. (2017).
Discourse dependency parsing There is limited work focusing on discourse dependency parsing. Li et al. (2014) proposes an algorithm to convert constituency RST tree to dependency structure. In their algorithm, each non-terminal is assigned with a head EDU, which is the head EDU of its leftmost nucleus child. Then, a dependency relation is created for each non-terminal from its head to its dependent, in a procedure similar to those designed for syntactic parsing. Hirao et al. (2013) proposes another method that differs from the previous one in the processing of multinuclear relations. Yoshida et al. (2014) proposes a dependency parser built around a Maximum Spanning Tree decoder and trains on dependency trees converted from RST-DT. Their parser achieved better performance on the summarization task than a similar constituencybased parser. Morey et al. (2018) reviews the RST discourse parsing from the dependency perspective. They adapt the the best discourse constituency parsing models until 2018 to the dependency task. Yang and Li (2018) constructs a discourse dependency treebank SciDTB for scientific abstracts. To the best of our knowledge, we are the first to investigate unsupervised and semi-supervised discourse dependency parsing. Kobayashi et al. (2019) propose two unsupervised methods that build unlabeled constituent discourse trees by using the CKY dynamic programming algorithm. Their methods build the optimal tree in terms of a similarity (dissimilarity) score function that is defined for merging (splitting) text spans into larger (smaller) ones.  use Viterbi EM with a margin-based criterion to train a span-based neural unsupervised constituency discourse parser. The performance of these unsupervised methods is close to that of previous supervised parsers.

Adaptation
We propose an adaptation method that can be readily integrated with different unsupervised syntactic dependency parsing approaches. First, we cluster the element discourse units (EDU) to produce clusters resembling POS tags or words used in syntactic parsing. This is necessary because many unsupervised syntactic parsers require enumeration of words or word categories, typically in modeling multinomial distributions as we shall see in Section 4. While EDUs, which are sequences of words, cannot be enumerated, its clusters can. During parsing, we apply the Hierarchical Eisner algorithm used for parse tree, a novel modified ver- sion of the classic Eisner algorithm, used for parse tree to produce discourse dependency parse trees that conform to the constraint that every sentence or paragraph should correspond to a complete subtree.

Clustering
Given an input document represented as an EDU sequence x 1 , x 2 , . . . , x n , we can use word embedding or context sensitive word embedding to get the vector representation x i of the i-th EDU x i . Specifically, we use BERT (Devlin et al., 2019) to encode each word. Let w i be the encoding of the i-th word in the document. For an EDU x i spanning from word position b to e, we follow Toshniwal et al. (2020) and concatenate the encoding of the endpoints to form its representation: With the representations of all EDUs from the whole training corpus obtained, we use K-Means (Lloyd, 1982) to cluster them. Let c i be the cluster label of x i .

Hierarchical Eisner Algorithm
The Eisner algorithm (Eisner, 1996) is a dynamic programming algorithm widely used to find the optimal syntactic dependency parse tree. The basic idea of it is to parse the left and right dependents of an token independently and combine them at a later stage. Algorithm 1 shows the pseudo-code of the Eisner algorithm. Here C i→j represents a complete span, which consists of a head token i and all of its descendants on one side, and I i→j represent an incomplete span, which consists of a head i and its partial descendants on one side and can be extended by adding more descendants to that side.

Ratio
Train Dev. Test RST-DT 2.6 -3.0 SciDTB 0.12 0.14 0.14 Table 1: The percentage of dependencies violating the constraint that each sentence or paragraph corresponds to a subtree. demonstrate structural characteristics not taken into account by the Eisner algorithm. Specifically, a document has a hierarchical structure which divides the document into paragraphs, each paragraph into sentences, and finally each sentence into EDUs, and the discourse parse tree should be consistent with this hierarchical structure. Equivalently, in a discourse parse tree, every sentence or paragraph should be exactly covered by a complete subtree, like Figure 1. We empirically find that this constraint is satisfied by most of the gold discourse parses in the RST Discourse Treebank (RST-DT, Carlson et al. (2001)) and SciDTB (Yang and Li, 2018) datasets (Table 1).
We therefore propose the Hierarchical Eisner algorithm, a novel modification to the Eisner algorithm that incorporates the constraint. Our new algorithm has almost the same state transition formulas as the Eisner algorithm except for a few changes brought by the hierarchical constraint. Concretely, our algorithm finds the optimal parse tree in a bottom-up way and divides the process into 3 steps: intra-sentence parsing, intra-paragraph parsing, and intra-document parsing. In the intra-sentence parsing step, we run the original Eisner algorithm, except that we need not to form a tree. Then in the Algorithm 2 Modification to Algorithm 1 Here E is a set of the index of the end boundary of sentences. 9: C i←j = max i≤k≤j i∈B Here B is a set of the index of the begin boundary of sentences.
intra-paragraph step, we combine all intra-sentence spans in the paragraph. Under the constraint that there can only be one EDU in every sentence whose head is not belong to this sentence. To achieve that, we modify the state transition equations (step 6-9 in Algorithm 1) to prune invalid arcs. Figure 2 shows some cases during merge across sentence spans. Case 1 are valid because the constraint is satisfied. Case 2 is invalid because the head of EDU e 6 can not be e 4 or e 5 hence the constraint is violated. From these cases, we can find that for incomplete span I i→k and complete span C k→j across sentences, we only merge them when j is at the end boundary of a sentence as Algorithm 2 shows. After the intra-paragraph step, we move to the intra-document step to combine paragraph-level spans following the same procedure as in the intraparagraph step and form the final document-level tree.
Our method has lower time complexity than the original Eisner algorithm. Suppose a document has k p paragraphs, each paragraph has k s sentences and each sentence has k e EDUs. The time complexity of the original Eisner algorithm is O(k 3 p k 3 s k 3 e ) while the time complexity of our Hierarchical Eis-

Model
We adapt two current state-of-the-art models in unsupervised syntactic dependency parsing for discourse parsing. One is Neural CRF Autoencoder (NCRFAE, Li and Tu (2020)

Neural CRF Autoencoder
A CRF autoencoder (Ammar et al., 2014) consists of an encoder and a decoder. The encoder predicts a hidden structure, such as a discourse dependency tree in our task, from the input and the decoder tries to reconstruct the input from the hidden structure. In a neuralized CRF autoencoder, we employ neural networks as the encoder and/or decoder. We use the widely used biaffine dependency parser (Dozat and Manning, 2017) as the encoder to compute the hidden structure distribution P Φ (y|x), parameterized with Φ. Here y represents the hidden structure and x is input document. We feed the input document x into a Bi-LSTM network to produce the contextual representation of each EDU segmentation r i , and then feed r i to two MLP networks to produce two continuous vectors v , representing i-th EDU segmentation being used as dependency head and dependent respectively.
A biaffine function is used to compute the score matrix s. Each matrix element s ij , the score for a dependency arc pointing from x i to x j , is computed as follows: where W is the parameter matrix and b is the bias.
Following Dozat and Manning (2017) we formulate P Φ (y|x) as a head selection problem process that selects the dependency head of each EDU in-5786 dependently: where h i is the index of the head of EDU x i and P (h i |x) is computed by softmax function with score s ij : The decoder parameterized with Λ computes P Λ (x|y), the probability of the reconstructed documentx given the parse tree y. Following Cai et al. (2017) and Li and Tu (2020), we independently predict each EDUx i from its head specified by y. Since EDUs cannot be enumerated, we reformulate the process as predicting the EDU clusterĉ i given its dependency head cluster c h i . Our decoder simply specifies a categorical distribution P (ĉ i |c h i ) for each possible EDU cluster and compute the reconstruction probability as follows: We achieve the final reconstruction distribution by cascading the encoder and decoder distribution: The best parsing is obtained by maximizing P Φ,Λ (x, y|x): We consider the general case of training the CRF autoencoder with dataset D containing both labelled data L and unlabelled data U. Purely supervised or unsupervised learning can be seen as special cases of this setting. The loss function L(D) consists of a labelled loss L l (L) and an unlabelled loss L u (U): where α is the hyperparameter weighting the importance of the two parts.
For the labelled data, where the gold parse trees y * are known, labelled loss is: For the unlabelled data where the gold parses are unknown, the unlabelled loss is: We optimize the encoder parameter Φ and decoder parameter Λ together with gradient descent methods.

Variational Variant of DNDMV
V-DNDMV is a variational autoencoder model composed of both an encoder and a decoder. The encoder is a Bi-LSTM that takes the input document and produces parameters of a Gaussian distribution from which a continuous vector s summarizing the document sampled.
The decoder models the joint probability of the document and its discourse dependency tree condition on s with a generative grammar. The grammar is defined on a finite set of discrete symbols, so in our adapted model, input documents are represented by EDU clusters instead of EDUs that are infinite in number. There are three types of grammar rules, each associated with a set of probabilistic distributions: ROOT,CHILD and DECISION. To generate a document, we firstly sample from the ROOT distribution P ROOT (chd|s) to determine the cluster label of the head EDU of the document and then recursively decide whether to generate a new child EDU cluster and what child EDU cluster to generate by sampling from the DECISION distribution P DECISION (dec|h, dir, val, s) and CHILD distribution P CHILD (chd|h, dir, val, s). dir denotes the generation direction (i.e, left or right), val is a binary variable denoting whether the current EDU already has a child in the direction dir or not. dec is a binary variable indicating whether to continue generating a child EDU, and h and chd denote the parent and child EDU cluster respectively. We use neural networks to calculate these distributions. The input of the networks is the continuous vector or matrix representations of grammar rule components such as h, chd, val and dir as well as document vector s produced by the encoder.
The training objective for learning the model is the probability of the training data. The intermediate continuous vector s and the hidden variable representing the dependency tree are both marginalized. Since the marginalized probability cannot be calculated exactly, V-DNDMV maximizes the Evidence Lower Bound (ELBO), a lower bound of the marginalized probability. ELBO consists of the conditional likelihood of the training data and an regularisation term given by the KL divergence between P Θ (s|x) and P (s) (which is a standard Gaussian). The conditional likelihood is shown as follows: Here N is the number of training samples, y is the dependency tree and Y(x) is the set of all possible dependency tree in x. Θ is the parameters of the neural networks. We can rewrite the conditional probability as following: where r is the grammar rule involved in generating x along with y. We optimize ELBO using the expectationmaximization (EM) algorithm, alternating the Estep and the M-step. In the E-step, we fix rule parameters and use our Hierarchical Eisner algorithm to compute the expectation of possible dependency tree y, which gives the expected count of rules used in the training samples. In the M-step, expected count of rules computed in the E-step is used to train the prediction neural networks with gradient descent methods. The regularisation term is also optimized using gradient descent methods in the M-step. After training, the parsing result y * of a new test case x is obtained as: Hyper-parameter For our NCRFAE model, we adopt the hyper-parameters of Li and Tu (2020). For our V-NDNMV model we adopt the hyperparameters of Han et al. (2019). We use Adam (Kingma and Ba, 2015) to optimize our objective functions. Experimental details are provided in Appendix A.

Main Result
We compared our methods with the following baselines: Right Branching (RB) is a rule based method. Given a sequence of elements (i.e., EDUs or subtrees), RB generates a left to right chain structure, like x 1 → x 2 , x 2 → x 3 · · · . In order to develop a strong baseline, we include the hierarchical constraint introduced in Section 3.2 in this procedure. That is, we first build sentence-level discourse trees using the right branching method based on sentence segmentation. Then we build paragraph-level trees using the right branching method to form a left to right chain of sentencelevel subtrees. Finally we obtain document-level trees in the same way. Since this method has three stages, we call it "RB RB RB". This simple procedure forms a strong baseline in terms of performance. As Nishida and Nakayama (2020) reports, the unlabeled F1 score of constituent structures of RB RB RB reaches 79.9 on RST-DT. Correspondingly, the performance of the supervised method proposed by (Joty et al., 2015) is 82.5.
NISHIDA20 is a neural model for unsupervised discourse constituency parsing proposed by Nishida and Nakayama (2020). This model runs a CKY parser that uses a Bi-LSTM model to learn representations of text spans, complemented with lexical, syntactic and structural features. We convert its result to dependency structure using the same conversation method of Li et al. (2014). To make a fair comparison, we use RB RB RB to initialize their model instead of RB * RB RB as in their paper, where RB * means using predicted syntactic structures for initialization at the sentence level.
Compared with baselines , our two adapted models NCRFAE and V-DNDMV both achieve better performance on the two datasets. Results also show that the generative model V-DNDMV is better than the discriminatve model NCRFAE in the unsupervised setting.
We also investigate the semi-supervised setting   on the SciDTB dataset of our adapted models with varied ratios of labeled/unlabeled data. Experimental results are shown in Figure 3, which indicate that NCRFAE outperforms V-DNDMV for all the ratios. Even when trained with only a few labeled data (0.01 of labeled data in SciDTB, only about 7 samples), the discriminative model already outperforms the generative model significantly. Besides that, we also find our semi-supervised methods reach higher UAS scores than their supervised versions (trained with labeled data only) for all the labeled/unlabeled data ratios. Inspired by the promising results in the semisupervised setting, we also investigate the performance of our adapted NCRFAE and V-DNDMV in the fully supervised setting. The results are shown in Table 3. We evaluate our models on the RST-DT and SciDTB datasets and compare them with eight models. NIVRE04 (Nivre et al., 2004) and WANG17 (Wang et al., 2017) are two transitionbased models for dependency parsing. Yang and Li (2018) adapts them to discourse dependency parsing. FENG14 (Feng and Hirst, 2014), JI14 ‡ We correct their evaluation metrics, so the result is different from the original paper (Li et al., 2014).   (Ji and Eisenstein, 2014), JOTY15 (Joty et al., 2015) and BRAUD17 (Braud et al., 2017) are methods for discourse constituent parsing and they are adapted for discourse dependency parsing by Morey et al. (2018). LI14 (Li et al., 2014) and MOREY18 (Morey et al., 2018) are graph-based and transition-based methods specially designed for discourse dependency parsing, respectively. These models are statistical or simple neural models, and they do not use pretrained language models (like BERT, ELMo (Peters et al., 2018)) to extract features.
As Table 3 shows, the performance of our NCR-FAE is significantly better than the baseline models. Especially, the UAS and LAS of NCRFAE are 8.9 points and 11.5 points higher than the best baseline models on the SciDTB dataset, respectively. Besides that, we find that V-DNDMV also beats baselines on the SciDTB dataset and reaches comparable results on RST-DT. We also test our approaches without using BERT and find that they still outperform the baselines. For example, the performance of NCRFAE with GloVe (Pennington et al., 2014) on Scidtb averaged over 5 runs is: UAS: 73.9 LAS: 55.5. These results again give evidence for our success in adapting unsupervised syntactic dependency parsing methods for discourse dependency parsing as the adapted methods not only work in the unsupervised setting, but also reach state-of-the-art in the supervised setting.
As for the performance gap between V-DNDMV and NCRFAE, we believe that the main reason is their different abilities to extract contextual features from the input text for the parsing task. As a generative model, the decoder of V-DNDMV follows a strong assumption that each token in the input text is generated independently, which prevents the contextual features from being directly used. Instead, contextual features are mixed with other information in the document representation which acts as the condition of the generation process in the model. NCRFAE, on the other hand, employs a discriminative parser to leverage contextual features for dependency structure prediction directly. Thus, as long as there is sufficient labeled data, NCRFAE can achieve much better results than V-DNDMV. We have observed a similar phenomenon in syntactic parsing.
Significance test We investigate the significance of the performance improvement in every setting. For unsupervised parsing, we perform a t-test between the strongest baseline RB RB RB and V-DNDMV. The t-value and p-value calculated on 10 runs are 2.86 and 0.00104, which shows the significance of the improvement. For the semisupervised results, we also perform significance tests between the semi-supervised and supervisedonly results. The results show that our semisupervised method significantly outperforms the supervised-only method. For example, on the 0.5:0.5 setting, the t-value is 2.13 and the p-value is 0.04767. For the fully supervised setting, due to a lack of code from previous work, it is currently difficult for us to carry out a significance analysis. Instead, we show that our models are very stable and consistently outperform the baselines by running our models for 10-times. For example, our NCRFAE UAS score is 78.95±0.29 on the Scidtb dataset.

Eisner vs. Hierarchical Eisner
In the left part of Figure 4 we show the curves of the time cost of the hierarchical and traditional Eisner algorithms against the RST-DT document length.
Clusters 10 30 50 100 UAS 52.7 53.9 54.6 53.5  The experiments are run on servers equipped with NVIDIA Titan V GPUs. We can observe clearly that the curve of the Hierarchical Eisner algorithm always stays far below that of the Eisner algorithm, which verifies our theoretical analysis on the time complexity of the hierarchical Eisner algorithm in section 3.2. The right part of Figure 4 demonstrates a similar phenomenon where we illustrate the memory usage of the hierarchical and traditional Eisner algorithms against the training document length in the same computing environment. From the curves of these two figures we can conclude that our Hierarchical Eisner algorithm has advantage over the traditional one in both time and space efficiencies.
Besides the superiority in computational efficiency, our experiments also indicate that our Hierarchical Eisner algorithm can achieve better performance than the traditional one. With other conditions fixed, the UAS produced by Hierarchical Eisner is 79.1 in the task of supervised discourse parsing on the SciDTB dataset while the corresponding result of the Eisner algorithm is 78.6.

Number of clusters
To explore the suitable number of clusters of EDUs, we evaluate our NCRFAE model with different cluster numbers from 10 to 100. As table 4 shows, there is an upward trend while the number of clusters increases from 10 to 50. After reaching the peak, the UAS decreases as the number of cluster continues to increase. We thus choose 50 for our experiments.

Label analysis
In order to inspect if there exist any coherent relations between the clusters of EDUs obtained for adaptation in discourse parsing and the labels of dependency arcs, similar to that between POS tags and syntactic dependency labels, we compute the co-appearance distribution of cluster labels and dependency arc labels. In Figure 5, we show the probabilities of the clusters being used as heads p head (c k |r m ) and children p child (c k |r m ) given different dependency types respectively. Here c k and r m represent different type of clusters and relations. We cluster EDUs to 10 clusters and only show a subset of them. Detailed heat-map can be found in Appendix B.
By observing the two heat-maps, we notice obvious trends that for each dependency arc label, the co-appearance probabilities are concentrated at certain cluster labels. For example, when the cluster is used as dependency heads, more than 60% of the co-appearance probability for arc label COMPARISON and SAME-UNIT is concentrated at cluster type 9 and 6 respectively; when the cluster is used as dependency children, cluster type 1 receives more than 40% of the co-appearance probability for certain arc labels. The property displayed by the adaptation clusters is very similar to that of POS tags, which justifies our clustering strategy adopted for discourse parsing.
To further quantify the coherence between the adaptation clusters and dependency arcs, we evaluate the mutual information between two discrete random variables in the training set of SciDTB: one is the tuple consists of two cluster labels for a pair of EDUs in the training sample, representing dependency head and child respectively; and the other is the binary random variable indicating whether there exists a dependency arc between a EDU pair in the training data. Besides our adaptation clusters, we also evaluate this metric for two other clustering strategies, random clustering and NICE proposed by He et al. (2018), for comparison and show the results in Table 5. We see that measured by mutual information, clusters produced by our clustering strategy is much more coherent with dependencies than the other strategies.

Conclusion
In this paper, we propose a method to adapt unsupervised syntactic parsing methods for discourse dependency parsing. First, we cluster the element discourse units (EDU) to produce clusters resembling POS tags. Second, we modify the Eisner algorithm used for finding the optimal parse tree with hierarchical constraint. We apply the adaptations to two unsupervised syntactic dependency parsing methods. Experimental results show that our method successfully adapts the two models for discourse dependency parsing, which demonstrate advantages in both parsing accuracy and running efficiency.

A Experimental Details for Our NCRFAE and V-DNDMV
We implement our NCRFAE and V-DNDMV models by Pytorch 1.6 and Python 3.8.3. We run our experiments on a server with Intel(R) Xeon(R) Gold 5115 CPU and NVIDIA Titan V GPU. Based on these software and hardware environments, our NCRFAE and V-DNDMV models trained on the SciDTB dataset use about 30 and 45 minutes, respectively. Moreover, our NCRFAE and V-DNDMV models trained on the RST-DT dataset use about 4 and 18 hours, respectively. The number of parameters in NCRFAE is about 8.26 million, and the number of parameters in V-DNDMV is 0.47 million. The hyperparameter configurations of the result report in our paper are shown in table 6. We choose the hyperparameter configurations by manual tuning and the UAS score on the development dataset is used to select among them. Due to the lack of development set of RST-DT, we prepare a development set with 20 instances randomly sampled from the training set. The size of each dataset is shown in Table 7.