A Language Model-based Generative Classifier for Sentence-level Discourse Parsing

Discourse segmentation and sentence-level discourse parsing play important roles for various NLP tasks to consider textual coherence. Despite recent achievements in both tasks, there is still room for improvement due to the scarcity of labeled data. To solve the problem, we propose a language model-based generative classifier (LMGC) for using more information from labels by treating the labels as an input while enhancing label representations by embedding descriptions for each label. Moreover, since this enables LMGC to make ready the representations for labels, unseen in the pre-training step, we can effectively use a pre-trained language model in LMGC. Experimental results on the RST-DT dataset show that our LMGC achieved the state-of-the-art F1 score of 96.72 in discourse segmentation. It further achieved the state-of-the-art relation F1 scores of 84.69 with gold EDU boundaries and 81.18 with automatically segmented boundaries, respectively, in sentence-level discourse parsing.


Introduction
Textual coherence is essential for writing a natural language text that is comprehensible to readers. To recognize the coherent structure of a natural language text, Rhetorical Structure Theory (RST) is applied to describe an internal discourse structure for the text as a constituent tree (Mann and Thompson, 1988). A discourse tree in RST consists of elementary discourse units (EDUs), spans that describe recursive connections between EDUs, and nuclearity and relation labels that describe relationships for each connection. Figure 1 (a) shows an example RST discourse tree. A span including one or more EDUs is a node of the tree. Given two adjacent non-overlapping spans, their nuclearity can be either nucleus or satellite, denoted by N and S, where the nucleus represents a more salient or essential piece of information than the satellite. Furthermore, a relation label, such as Attribution and Elaboration, is used to describe the relation between the given spans (Mann and Thompson, 1988;Carlson and Marcu, 2001). To build such trees, RST parsing consists of discourse segmentation, a task to detect EDU boundaries in a given text, and discourse parsing, a task to link spans for detected EDUs.
In discourse segmentation, Carlson et al. (2001) proposed a method for using lexical information and syntactic parsing results. Many researchers (Fisher and Roark, 2007;Xuan Bach et al., 2012;Feng and Hirst, 2014b) utilized these clues as features in a classifier although automatic parsing errors degraded segmentation performance. To avoid this problem, Wang et al. (2018b) used BiLSTM-CRF (Huang et al., 2015) to handle an input without these clues in an end-to-end manner. Lin et al. (2019) jointly performed discourse segmentation and sentence-level discourse parsing in their pointer-network-based model. They also intro-duced multi-task learning for both tasks and reported the state-of-the-art results for discourse segmentation and sentence-level discourse parsing in terms of F 1 scores. Despite these achievements, there is still room for improvement for both tasks due to the scarcity of labeled data. It is important to extract more potential information from the current dataset for further performance improvement.
Under this motivation, in this research, we propose a language model-based generative classifier (LMGC) as a reranker for both discourse segmentation and sentence-level discouse parsing. LMGC can jointly predict text and label probabilities by treating a text and labels as a single sequence, like Figure 1 (b). Therefore, different from conventional methods, LMGC can use more information from labels by treating the labels as an input. Furthermore, LMGC can enhance label representations by embedding descriptions of each label defined in the annotation manual (Carlson and Marcu, 2001), that allows us to use a pre-trained language model such as MPNet (Song et al., 2020) effectively, since we can already have the representations for labels, that were unseen in the pre-training step.
Experimental results on the RST-DT dataset (Carlson et al., 2002) show that LMGC can achieve the state-of-the-art scores in both discourse segmentation and sentence-level discourse parsing. LMGC utilizing our enhanced label embeddings achieves the best F 1 score of 96.72 in discourse segmentation. Furthermore, in sentence-level discourse parsing, LMGC utilizing our enhanced relation label embeddings achieves the best relation F 1 scores of 84.69 with gold EDU boundaries and 81.18 with automatically segmented boundaries, respectively.

Related Work
Discourse segmentation is a fundamental task for building an RST discourse tree from a text. Carlson et al. (2001) proposed a method for using lexical information and syntactic parsing results for detecting EDU boundaries in a sentence. Fisher and Roark (2007); Xuan Bach et al. (2012); Feng and Hirst (2014b) utilized these clues as features in a classifier, while Wang et al. (2018b) utilized BiLSTM-CRF (Huang et al., 2015 in an endto-end manner to avoid performance degradation caused by syntactic parsing errors. Sentence-level discourse parsing is also an important task for parsing an RST discourse tree, as used in many RST parsers (Joty et al., 2013;Feng and Hirst, 2014a;Joty et al., 2015;Wang et al., 2017;Kobayashi et al., 2020). Recently, Lin et al. (2019) tried to jointly perform discourse segmentation and sentence-level discourse parsing with pointer-networks and achieved the state-ofthe-art F 1 scores in both discourse segmentation and sentence-level discourse parsing.
In spite of the performance improvement of these models, a restricted number of labeled RST discourse trees is still a problem. In the discourse segmentation and parsing tasks, most prior work is on the basis of discriminative models, which learn mapping from input texts to predicted labels. Thus, there still remains room for improving model performance by considering mapping from predictable labels to input texts to exploit more label information. To consider such information in a model, Mabona et al. (2019) introduced a generative model-based parser, RNNG (Dyer et al., 2016), to document-level RST discourse parsing. Different from our LMGC, this model unidirectionally predicts action sequences.
In this research, we model LMGC for the discourse segmentation and sentence-level discourse parsing tasks. LMGC utilizes a BERT-style bidirectional Transformer encoder (Devlin et al., 2019) to avoid prediction bias caused by using different decoding directions. Since LMGC is on the basis of generative models, it can jointly consider an input text and its predictable labels, and map the embeddings of both input tokens and labels onto the same space. Due to this characteristic, LMGC can effectively use the label information through constructing label embeddings from the description of a label definition (Carlson and Marcu, 2001). Furthermore, recent strong pre-trained models such as MPNet (Song et al., 2020) are available for any input tokens in LMGC.

Base Models
Our LMGC reranks the results from a conventional discourse segmenter and parser, which can be constructed as discriminative models. In this section, we explain these base models and introduce our mathematical notations.

Discourse Segmenter
In discourse segmentation, given an input text x = {x 1 , · · · , x n }, where x i is a word, a segmenter detects EDUs e = {e 1 , · · · , e m } from x. Since there is no overlap or gap between EDUs, discourse segmentation can be considered as a kind of sequential labeling task, which assigns labels l = {l 1 , · · · , l n }, where each l i ∈ {0, 1} indicates whether the word is the start of an EDU or not. By using a discriminative model, such as BiLSTM-CRF (Wang et al., 2018b) and pointer-networks (Lin et al., 2019), the probability of predicting EDUs from x can be P (l|x) or P (e|x). Because of its simple structure and extensibility, we choose BiLSTM-CRF as our base model for discourse segmentation. In BiLSTM-CRF, P (l|x) is formulated as follows: is the potential function, h t is the hidden state at time step t, W is a weight matrix, b is a bias term, and Y is the set of possible label sequences. We inherit top-k Viterbi results of BiLSTM-CRF, scored by Eq.(1), to our LMGC, as described in Section 4.

Discourse Parser
In discourse parsing, given an input text x and its EDUs e, we can build a binary tree p = {p 1 , · · · , p 2n−1 }, where each node p i ∈ p has three kinds of labels: span s i , nuclearity u i , and relation r i . The sequences of span s and nuclearity u can be predicted simultaneously, as in 2-stage Parser (Wang et al., 2017), or span s can be predicted in advance for labeling nuclearity u and relation r, as in pointer-networks (Lin et al., 2019) and span-based Parser (Kobayashi et al., 2020). Because of its better performance, we choose 2stage Parser as our base model for sentence-level discourse parsing. 2-stage Parser extracts several features and does classification with SVMs in two stages. In the first stage, it identifies the span and nuclearity simultaneously to construct a tree based on the transition-based system with four types of actions: Shift, Reduce-NN, Reduce-NS, and Reduce-SN. In the second stage, for a given node p i , r i is predicted as the relation between the left and right children nodes of p i by using features extracted from p i and its children nodes. In spite of its limited features, it achieves the best results compared with pointer-networks and span-based Parser. Since 2-stage Parser utilizes SVMs, we normalize the action scores and inherit top-k beam search results of 2-stage Parser for LMGC to perform discourse parsing.

Language Model-based Generative Classifier (LMGC)
In this section, we introduce our generative classifier, LMGC, that utilizes a masked and permuted language model to compute sequence probabilities in both discourse segmentation and sentencelevel discourse parsing tasks. More specifically, as we mention in Section 5, we can utilize our LMGC in three tasks, (a) discourse segmentation, (b) sentence-level discourse parsing with gold segmentation, and (c) sentence-level discourse parsing with automatic segmentation. Figure 2 shows the overview of our LMGC for the whole task (c). As shown in the figure, the prediction process in LMGC is the following. We assume that, in task (c), discourse segmentation and sentence-level discourse parsing are performed in a pipeline manner with models trained for tasks (a) and (b).
1. Predict top-k s EDU segmentations {e 1 , · · · , e ks } from a given sentence x with the base discourse segmenter, described in Section 3.1.
2. Compute joint probability P (x, e i ) and select the best segmentation e from {e 1 , · · · , e ks } with a language model, as we describe below.
3. Parse and rank top-k p trees {p 1 , · · · , p kp } from x and best segmentation e with the base discourse parser, described in Section 3.2.
In task (a), we apply Step 2 to predict the best segmentation after Step 1. In task (b), we skip Steps 1 and 2, and apply just Steps 3 and 4 for gold segmentation to yield the best parse tree.

Tree Representations
To calculate joint probabilities for a discourse tree with a language model, we need to represent a tree as a linear form, like Figure 1 (b). Since there are several predictable label sets in discourse segmentation and parsing tasks, as shown in Figure 3, we prepare linearized forms for each label set. 1 In discourse segmentation, we can consider joint probability P (x, e) for a sequence with inserting a symbol, [EDU], at an EDU boundary (Figure 3 (a)). In discourse parsing, a discourse tree is represented as a sequence with several kinds of label sets: span labels s, nuclearity labels u including span labels, and relation labels r including span and nuclearity labels (Figures 3 (b)-(d)). To investigate the effectiveness of each label set in the reranking step, we consider P (x, e, s), P (x, e, u), and P (x, e, r) for each label set to represent P (x, e, p) in this paper. To build a sequence, we combine each label in a tree with brackets to imply the boundary for the label. For example, "(N" and ") N " stand for the start and end of a nucleus EDU. For a node p i of the tree, r i describes the relation between its children nodes, leading to r i of leaf nodes being "Null". When the child nodes of p i are nucleus and satellite, we assign label "Span" to the nucleus child node of p i and label r i to the satellite child node of p i , respectively. When the child nodes of p i are both nucleus, we assign label r i to both child nodes of p i .
For simpler illustration, in Figure 1 (b), we show the linearized discourse tree only with nuclearity and relation labels, since the nuclearity labels can also show span and EDU boundary labels. "Null" labels for leaf nodes are also omitted in the figure.

Joint Probabilities
To calculate joint probabilities in the last subsection with a language model, we consider probability P (z) for a sequence z = (z 1 , · · · , z a ), which corresponds to the probabilities for the sequential representations P (x, e), P (x, e, s), P (x, e, u), and P (x, e, r).
According to Song et al. (2020), masked and permuted language modeling (MPNet) takes the advantages of both masked language modeling and permuted language modeling while overcoming their issues. Compared with Bert (Devlin et al., 2019) and XLNet (Yang et al., 2019), MPNet considered more information about tokens and positions, and achieved better results for several downsteam tasks (GLUE, SQuAD, etc). Taking into account its better performance, we choose pre-trained MPNet (Song et al., 2020) as our language model. Because considering all possible inter-dependence between z t is intractable, we follow the decomposition of pseudo-log-likelihood scores (PLL) (Salazar et al., 2020) in the model. Thus, we decompose and calculate logarithmic P (z) as follows: where z <t is the first sub-sequence (z 1 , · · · , z t−1 ) in z and z >t is the latter sub-sequence is computed by two-stream self-attention (Yang et al., 2019). In inference, we select z based on 1 a P LL(z; θ). This model converts z into continuous vectors w = {w 1 , · · · , w a } through the embedding layer. Multi-head attention layers further transform the vectors to predict each z t in the softmax layer.
Since pre-trained MPNet does not consider EDU, span, nuclearity, and relation labels in the pretraining step, we need to construct vectors w for these labels from the pre-trained parameters to enhance the prediction performance. We describe the details of this method in the next subsection.

Label Embeddings
In LMGC, we embed input text tokens and labels in the same vector space (Wang et al., 2018a) of the embedding layer. Under the setting, to deal with unseen labels in the pre-trained model, we compute the label embeddings by utilizing token embeddings in the pre-trained model.
We try to combine the input text with four kinds of labels, EDU, span, nuclearity, and relation labels, which were defined and clearly described in the annotation document (Carlson and Marcu, 2001) (See Appendix B for the descriptions). In taking into account the descriptions for the labels as additional (a) Sentence with EDU boundary labels Figure 3: Example joint representations of an input text and labels for sentence We've got a lot to do, he acknowledged. e i represents the corresponding EDU, and "_" is whitespace. information, we adopt two different methods, Average and Concatenate, for representing the label embeddings. Average: We average the embeddings of tokens that appear in the definition of a label and assign the averaged embedding to the label. Concatenate: We concatenate a label name with its definition and insert the concatenated text to the end of sequence z, 2 so that the label embedding can be captured by self-attention mechanisms (Vaswani et al., 2017). Note that we do not try it in the parsing task, because the length of a sequence increases in proportion to the increase of the number of labels, that causes a shortage of memory space.

Objective Function
Because the search space for sequences of a text and its labels is exponentially large, instead of considering all possible sequences Z(x) for x, we assume Z (x) as a subset of sequences based on top-k results from the base model. We denote z g ∈ Z(x) as the correct label sequence of x. To keep pre-trained information in MPNet, we continue masking and permutation for training model parameter θ. Assuming that O a lists all permutations of set {1, 2, · · · , a}, the number of elements in O a satisfies | O a |= a!. For z ∈ Z (x) ∪ {z g }, we train the model parameter θ in LMGC by maximizing the following expectation over all permuta-2 Note that the concatenated text of the label name and its definition is not masked during training.
tions: (3) where I z is the indicator function, defined as follows: c, denoting the number of non-predicted tokens z o<=c , is set manually.

Experiments
In this section, we present our experiments in three tasks, (a) discourse segmentation, (b) sentencelevel discourse parsing with gold segmentation, and (c) sentence-level discourse parsing with automatic segmentation.

Datasets
Following previous studies (Wang et al., 2017(Wang et al., , 2018bLin et al., 2019), we used the RST Discourse Treebank (RST-DT) corpus (Carlson et al., 2002) as our dataset. This corpus contains 347 and 38 documents for training and test datasets, respectively. We divided the training dataset into two parts, following the module RSTFinder 3 (Heilman and Sagae, 2015), where 307 documents were used to train models and the remaining 40 documents were used as the validation dataset. We split the documents into sentences while ignoring footnote sentences, as in Joty et al. (2012). There happens two possible problematic cases for the splitted sentences: (1) The sentence consists of exactly an EDU, and so it has no tree structure. (2) The tree structure of the sentence goes across to other sentences. Following the setting of Lin et al.
(2019), we did not filter any sentences in task (a). In task (b), we filtered sentences of both cases. In task (c), we filtered sentences of case (2). Table 1 shows the number of available sentences for the three different tasks.

Evaluation Metrics
In task (a), we evaluated the segmentation in microaveraged precision, recall, and F 1 score with respect to the start position of each EDU. The position at the beginning of a sentence was ignored.
In task (b), we evaluated the parsing in microaveraged F 1 score with respect to span, nuclearity, and relation. In task (c) for parsing with automatic segmentation, we evaluated both the segmentation and parsing in micro-averaged F 1 score. We used the paired bootstrap resampling (Koehn, 2004) for the significance test in all tasks when comparing two systems.

Compared Methods
As our proposed methods, we used LMGC e , LMGC s , LMGC u , and LMGC r , which respectively model probability P (x, e), P (x, e, s), P (x, e, u), and P (x, e, r) with initialized label embeddings. We represent LMGC with Average and Concatenate label embeddings as Enhance and Extend, respectively.
We used the base discourse segmenter and parser described in Section 3 as our baseline. We reproduced the base discourse segmenter BiLSTM-CRF 4 (Wang et al., 2018b). Because BiLSTM-CRF adopted the hidden states of ELMo (Peters et al., 2018) as word embeddings, we also tried the last hidden state of MPNet as the word embeddings for BiLSTM-CRF for fairness. We retrained the segmenter in five runs, and the experimental results are showed in Appendix C. The publicly shared BiLSTM-CRF by Wang et al. (2018b) is our base segmenter in the following experiments.
Furthermore, for comparing LMGC with an unidirectional generative model (Mabona et al., 2019), we constructed another baseline method which utilizes a GPT-2 (Radford et al., 2019) based reranker. This method follows an unidirectional language model-based generative parser (Choe and Charniak, 2016), and considers top-k results from the base model by an add-1 version of infinilog loss (Ding et al., 2020) during training. We denote this baseline as GPT2LM hereafter. GPT2LM models P (x, e) for task (a) and P (x, e, r) for tasks (b) and (c), respectively. Both LMGC and GPT2LM are the ensemble of 5 models with different random seeds. See Appendix D for a complete list of hyperparameter settings.

Number of Candidates
As described in Section 4, LMGC requires parameters k s and k p for the number of candidates in the steps for different tasks. We tuned k s and k p based on the performance on the validation dataset. 7 In task (a), k s was set to 20 and 5 for training and prediction, respectively. In task (b), k p was set to 20 and 5 for training and prediction, respectively. In task(c), k s and k p were both set to 5 for prediction. The set of parameters was similarly tuned for GPT2LM on the validation dataset. We list all of them in Appendix E.   Table 3: Results for the sentence-level discourse parsing task with gold segmentation. * indicates the reported score by Lin et al. (2019). The best score in each metric among the models is indicated in bold. † and ‡ indicate that the score is significantly superior to the base parser with a p-value < 0.01 and < 0.05, respectively.

Discourse Segmentation
coders. Using Average label embeddings is more helpful than using Concatenate label embeddings for LMGC e . Enhance e achieved the state-of-theart F 1 score of 96.72, which outperformed both the base segmenter and the pointer-networks.

Sentence-level Discourse Parsing
Gold Segmentation: Table 3, Figures 4 and 5 show the experimental results for the sentence-level discourse parsing task with gold segmentation. In Table 3, LMGC u achieved the highest span and nuclearity F 1 scores of 98.31 and 94.00, respectively. Enhance r achieved the state-of-the-art relation F 1 score of 84.69, which is significantly superior to the base parser. Although using Average label embeddings improved LMGC r , it can provide no or only limited improvement for LMGC u and LMGC s . We

Predicted relation
True relation Figure 5: Confusion matrix for Enhance r in the sentence-level discourse parsing task with gold segmentation. We show the ratio of the number of instances with predicted labels (for a column) to the number of instances with gold labels (for a row) in the corresponding cell.
guess that this difference is caused by the number of different kinds of labels in span, nuclearity, and relation. The performance of GPT2LM r is even worse than the base parser. We think this is because we added the relation labels to the vocabulary of GPT-2 and resized the pre-trained word embeddings. Figure 4 shows the comparison between the base parser and Enhance r with respect to each ralation label. In most relation labels, Enhance r outperformed 2-stage Parser except for the labels Explanation, Evaluation, and Topic-Comment. 2-stage Parser achieved the F 1 score of 17.14 for label Temporal while Enhance r achieved the F 1 score of 44.44 by reranking the parsing results from 2-stage Parser. Such great improvement with Enhance r can also be found for labels such as Contrast, Back-  ground, and Cause. Obviously, Enhance r tends to improve the performance for labels whose training data is limited. Figure 5 shows a confusion matrix of Enhance r for each relation label. It shows that the relation labels Comparison, Cause, and Temporal were often predicted wrongly as Contrast, Joint, and Joint or Background, respectively, by Enhance r , even though these labels have at least 100 training data. We guess this might be due to some similarities between those labels.
By using the t-SNE plot (Van der Maaten and Hinton, 2008), we visualize the trained relation label embeddings of LMGC r and Enhance r . Figures 6a and 6b show the results. Figure 6a shows a clearer diagonal that divides labels with parenthesis  Table 4: Results for the sentence-level discourse parsing task with automatic segmentation. * indicates the reported score by Lin et al. (2019). The best score in each metric among the models for each block is indicated in bold. We used the discourse segmentation results of Enhance e as the input of the discourse parsing stage for all models, for fair comparison of sentencelevel discourse parsing. † and ‡ indicate that the score is significantly superior to the base parser with a p-value < 0.01 and < 0.05, respectively.
"(" from the ones with ")", while Figure 6b shows more distinct divisions between labels.
Automatic Segmentation: Table 4 shows the experimental results for the sentence-level discourse parsing task with automatic segmentation. The second and third blocks in the table show the results for the first and second stages, discourse segmentation and sentence-level discourse parsing, respectively. 9 Enhance r achieved the highest relation F 1 score of 81.18, which is a significant improvement of 2.43 points compared to the base parser. Enhance s and LMGC u achieved the highest span and nuclearity F 1 scores of 94.00 and 89.90, respectively. Since LMGC * and Enhance * were the models trained in task (b), and Enhance e achieved the F 1 score of 96.79 in discourse segmentation, it is not surprising to find that the tendency of those results is similar to that in sentence-level discourse parsing with gold segmentation.

Conclusion
In this research, we proposed a language modelbased generative classifier, LMGC. Given the top-k discourse segmentations or parsings from the base model, as a reranker, LMGC achieved the state-of-the-art performances in both discourse segmentation and sentence-level discourse parsing. The experimental results also showed the potential of constructing label embeddings from token embeddings by using label descriptions in the manual. In the future, we plan to apply LMGC to other diverse classification tasks.

A Experimental Results of LMGC with Tree
Since the raw s-expression-style tree is longer than our joint representations with span, nuclearity and relation, we transformed the raw tree into a sequence as Figure 7 shows, where the nuclearity and relation labels are connected together by the colons.
To construct the label embedding for P (x, e, p), we combined the descriptions of the nuclearity and relation (see descriptions in Appendix B), and assigned the combination to the corresponding node. For example, the description of "(Attribution:S" is the start of a supporting or background piece of information attribution, attribution represents both direct and indirect instances of reported speech. : Example joint representation of an input text with all tree labels for sentence We've got a lot to do, he acknowledged. e i represents the corresponding EDU, and "_" is whitespace. LMGC p models the joint probability P (x, e, p) with initialized label embedding. The experimental results of LMGC p and Enhance p for the sentencelevel discourse parsing task with gold segmentation are showed in Table 5. LMGC p and Enhance p are the ensemble of 5 models with different random seed, although the training loss of Enhance p in 2 of 5 models did not decrease.

Model
Span Nuclearity Relation

B Label Descriptions
We list our extracted label descriptions from Carlson and Marcu (2001) in Table 6. For parsing symbols with brackets "(" and ")" like "(N" and ") N ", we inserted the position phrase, the start of and the end of, to the beginning of their label definitions. So the description of ") N " is the end of a more salient or essential piece of information.
C Experiment Results of Reproduced Base Model Table 7 shows the experimental results of BiLSTM-CRF in discourse segmentation, where the results of our reproduced BiLSTM-CRF are averaged in five runs. Table 8 shows the experimental results of different parsers in the sentence-level discourse parsing task with gold segmentation.

D Hyperparmeters
For LMGC, we used the source code shared in the public github 10 of Song et al. (2020). We used the uploaded pre-trained MPNet and same setup as illustrated in Table 9. 15% tokens as the predicted tokens were masked by replacement strategy 8:1:1. Relative positional embedding mechanism (Shaw et al., 2018) was utilized. Since the vocab we used is same as the one of BERT (Devlin et al., 2019), we used the symbol [SEP] to represent [EDU] and symbol [unused#] starting from 0 to represent parsing labels such as "(N" and "(Attribution". For GPT2LM, we used the source code shared in the public github 11 (Ott et al., 2019). Following the steps in Choe and Charniak (2016), we utilized Eq (5) (Jurafsky, 2000) to compute the joint distribution, where P (z t |z 1 , . . . , z t−1 ) was computed by GPT-2 (Radford et al., 2019). And in inference, we selected z based on 1 a log P (z). An add-1 version of infinilog loss (Ding et al., 2020) was utilized for training GPT2LM as follows: elaboration, elaboration provides specific information or details to help define a very general concept Enablement enablement, enablement presentes action to increase the chances of the unrealized situation being realized. Evaluation evaluation, interpretation, conclusion or comment Explanation evidence, explanation or reason Joint list, list contains some sort of parallel structure or similar fashion between the units Manner-Means explaining or specifying a method , mechanism , instrument , channel or conduit for accomplishing some goal Topic-Comment problem solution, question answer, statement response, topic comment or rhetorical question Summary summary or restatement Temporal situations with temporal order, before, after or at the same time Topic change topic change Textual-organization links that are marked by schemata labels Same-unit links between two non-adjacent parts when separated by an intervening relative clause or parenthetical  Table 7: Performances of BiLSTM-CRF (Wang et al., 2018b) in the discourse segmentation task. The best score in each metric among the models is indicated in bold. * indicates the reported score by Lin et al. (2019). Shared is the publicly shared model by Wang et al. (2018b). Reproduced (ELMo) and Reproduced (MP-Net) are our reproduced models with different word embeddings.  where f (z) = exp( 1 a log P (z)) z ∈Z (x) exp( 1 a log P (z )) .

Model Span Nuclearity Relation
We used the uploaded pretrained "gpt2" model (Wolf et al., 2020) and same setup as illustrated in Table 10. We used symbol "=====" in vocab to represent the symbol [EDU]. Because the vocab of GPT-2 has no available symbol for representing an unseen symbol, we added <pad> and our relation symbols to the vocab of GPT-2 and resized the pre-trained word embeddings. Table 11 shows the setting of candidates for different tasks. As described in Section 4.4, we do data augmentation by using additional top-k results generated by a base method, a larger k during training is expected to bring more promotion for LMGC. However, a larger k during prediction step introduces more candidates and may make the prediction more difficult. Taking this into consideration, we tuned k s and k p for training and prediction separately based on the performance on the validation dataset.   In task (a), we used the Viterbi-topk algorithm for the base segmenter to select top-k s segmentations. We tuned k s ∈ {0, 10, 20} for training while k s for prediction was fixed as 5. Note that we used only gold segmentations for training when k s was set to 0. Table 12 shows the experimental results, where both LMGC e and GP2TLM e are the ensemble of 5 models. Then we tuned k s ∈ {5, 10, 20} for prediction by using the LMGC e and GP2TLM e trained with top-20 candidates, Table 13 shows the results.

E Setting of Candidates
In task (b), we utilized beam search in each stage  of the base parser and after two stages we computed the perplexity to keep top-k p parsings. We tuned k p ∈ {0, 10, 20} for training while k p for prediction was fixed as 5. Note that we used only gold parsings for training when k p was set to 0. Table 14 shows the experimental results, where both LMGC r and GPT2LM r are the ensemble of 5 models. Then we tuned k p ∈ {5, 10, 20} for prediction by using the LMGC r and GPT2LM r trained with top-20 candidates, Table 15 shows the results. In task (c), same as in task (a), we tuned k s ∈ {5, 10, 20} for predicting discourse segmentation by using the LMGC e and GP2TLM e trained with top-20 candidates for task (a), Table 16 shows the result. We utilized LMGC e to select the best segmentation from top-5 segmentations for following discourse parsing. Then same as in task (b), we tuned k p ∈ {5, 10, 20} for predicting discourse parsing by using the LMGC r and GPT2LM r trained with top-20 candidates for task (b), Table 17 shows the result.
In tasks (b) and (c), LMGC s and Enhance s cannot distinguish the candidates with the same span labels but different nulearity or relation labels, LMGC u and Enhance u cannot distinguish the candidates with the same nulearity labels but different relation labels. Under this condition, the indistinguishable parsings would be ranked by the base parser. And in task (b), for training data with span or nuclearity labels, we used the beam sizes 20 and 1 in the first and second stages of the base parser, respectively.       Table 17: Results of tuning k p for prediction in task (c). The best score in each metric among different k p for prediciton is indicated in bold.