Discourse Structure Extraction from Pre-Trained and Fine-Tuned Language Models in Dialogues

Discourse processing suffers from data sparsity, especially for dialogues. As a result, we explore approaches to infer latent discourse structures for dialogues, based on attention matrices from Pre-trained Language Models (PLMs). We investigate multiple auxiliary tasks for fine-tuning and show that the dialogue-tailored Sentence Ordering task performs best. To locate and exploit discourse information in PLMs, we propose an unsupervised and a semi-supervised method. Our proposals thereby achieve encouraging results on the STAC corpus, with F1 scores of 57.2 and 59.3 for the unsupervised and semi-supervised methods, respectively. When restricted to projective trees, our scores improved to 63.3 and 68.1.


Introduction
In recent years, the availability of accurate transcription methods and the increase in online communication have led to a vast rise in dialogue data, necessitating the development of automatic analysis systems. For example, summarization of meetings or exchanges with customer service agents could be used to enhance collaborations or analyze customers issues Feng et al., 2021); machine reading comprehension in the form of question-answering could improve dialogue agents' performance and help knowledge graph construction (He et al., 2021;Li et al., 2021). However, simple surface-level features are oftentimes not sufficient to extract valuable information from conversations (Qin et al., 2017). Rather, we need to understand the semantic and pragmatic relationships organizing the dialogue, for example through the use of discourse information.
Along this line, several discourse frameworks have been proposed, underlying a variety of annotation projects. For dialogues, data has been primarily annotated within the Segmented Discourse Representation Theory (SDRT) (Asher et al., 2003). Discourse structures are thereby represented as dependency graphs with arcs linking spans of text and labeled with semantico-pragmatic relations (e.g. Acknowledgment (Ack) or Question-Answer Pair (QAP)). Figure 1 shows an example from the Strategic Conversations corpus (STAC) . Discourse processing refers to the retrieval of the inherit structure of coherent text, and is often separated into three tasks: EDU segmentation, structure building (or attachment), and relation prediction. In this work, we focus on the automatic extraction of (naked) structures without discourse relations. This serves as a first critical step in creating a full discourse parser. It is important to note that naked structures have already been shown to be valuable features for specific tasks. Louis et al. (2010) mentioned that they are the most reliable indicator of importance in content selection. ; Xiao et al. (2020) on summarization, and Jia et al. (2020) on thread extraction, also demonstrated the advantages of naked structures. Data sparsity has always been an issue for discourse parsing both in monologues and dialogues: the largest and most commonly used corpus annotated under the Rhetorical Structure Theory, the RST-DT (Carlson et al., 2001) contains 21, 789 discourse units. In comparison, the largest dialogue discourse dataset (STAC) only contains 10, 678 units. Restricted to domain and size, the performance of supervised discourse parsers is still low, especially for dialogues, with at best 73.8% F 1 for the naked structure on STAC (Wang et al., 2021). As a result, several transfer learning approaches have been proposed, mainly focused on monologues. Previous work demonstrate that discourse information can be extracted from auxiliary tasks like sentiment analysis (Huber and Carenini, 2020) and summarization (Xiao et al., 2021), or represented in language models (Koto et al., 2021) and further enhanced by fine-tuning tasks (Huber and Carenini, 2022). Inspired by the latter approaches, we are pioneering in addressing this issue for dialogues and introducing effective semi-supervised and unsupervised strategies to uncover discourse information in large pre-trained language models (PLMs). We find, however, that the monologueinspired fine-tuning tasks are not performing well when applied to dialogues. Dialogues are generally less structured, interspersed with more informal linguistic usage (Sacks et al., 1978), and have structural particularities . Thus, we propose a new Sentence Ordering (SO) fine-tuning task tailored to dialogues. Building on the proposal in Barzilay and Lapata (2008), we propose crucial, dialogue-specific extensions with several novel shuffling strategies to enhance the pair-wise, inter-speech block, and inter-speaker discourse information in PLMs, and demonstrate its effectiveness over other fine-tuning tasks.
In addition, a key issue in using PLMs to extract document-level discourse information is how to choose the best attention head. We hypothesize that the location of discourse information in the network may vary, possibly influenced by the length and complexity of the dialogues. Therefore, we investigate methods that enables us to evaluate each attention head individually, in both unsupervised and semi-supervised settings. We introduce a new metric called "Dependency Attention Support" (DAS), which measures the level of support for the dependency trees generated by a specific self-attention head, allowing us to select the optimal head without any need for supervision. We also propose a semi-supervised approach where a small validation set is used to choose the best head.
Experimental results on the STAC dataset reveal that our unsupervised and semi-supervised methods outperform the strong LAST baseline (F 1 56.8%, Sec. 4), delivering substantial gains on the com-plete STAC dataset (F 1 59.3%, Sec. 5.2) and show further improvements on the tree-structured subset (F 1 68.1%, Sec. 6.3).
To summarize, our contributions in this work are: (1) Discourse information detection in pre-trained and sentence ordering fine-tuned LMs; (2) Unsupervised and semi-supervised methods for discourse structure extraction from the attention matrices in PLMs; (3) Detailed quantitative and qualitative analysis of the extracted discourse structures.

Related Work
Discourse structures for complete documents have been mainly annotated within the Segmented Discourse Representation Theory (SDRT) (Asher et al., 2003) or the Rhetorical Structure Theory (RST) (Mann and Thompson, 1988), with the latter leading to the largest corpora and many discourse parsers for monologues, while SDRT is the main theory for dialogue corpora, i.e., STAC  and Molweni (Li et al., 2020). In SDRT, discourse structures are dependency graphs with possibly non-projective links (see Figure 1) compared to constituent trees structures in RST. Early approaches to discourse parsing on STAC used varied decoding strategies, such as Maximum Spanning Tree algorithm (Muller et al., 2012;Li et al., 2014; or Integer Linear Programming (Perret et al., 2016). Shi and Huang (2019) first proposed a neural architecture based on hierarchical Gated Recurrent Unit (GRU) and reported 73.2% F 1 on STAC for naked structures. Recently, Wang et al. (2021) adopted Graph Neural Networks (GNNs) and reported marginal improvements on the same test set (73.8% F 1 ).
Data sparsity being the issue, a new trend towards semi-supervised and unsupervised discourse parsing has emerged, almost exclusively for monologues. Huber and Carenini (2019, 2020) leveraged sentiment information and showed promising results in cross-domain setting with the annotation of a silver-standard labeled corpus. Xiao et al. (2021) extracted discourse trees from neural summarizers and confirmed the existence of discourse information in self-attention matrices. Another line of work proposed to enlarge training data with a combination of several parsing models, as done in Jiang et al. (2016); Kobayashi et al. (2021); Nishida and Matsumoto (2022). In a fully unsupervised setting, Kobayashi et al. (2019) used similarity and dissimilarity scores for discourse tree creation, a method that can not be directly used for discourse graphs though. As for dialogues, transfer learning approaches are rare. Badene et al. (2019a,b) investigated a weak supervision paradigm where expert-composed heuristics, combined to a generative model, are applied to unseen data. Their method, however, requires domain-dependent annotation and a relatively large validation set for rule verification. Another study by Liu and Chen (2021) focused on cross-domain transfer using STAC (conversation during online game) and Molweni (Ubuntu forum chat logs). They applied simple adaptation strategies (mainly lexical information) on a SOTA discourse parser and showed improvement compared to bare transfer: trained on Molweni and tested on STAC F 1 increased from 42.5% to 50.5%. Yet, their model failed to surpass simple baselines. Very recently, Nishida and Matsumoto (2022) investigated bootstrapping methods to adapt BERT-based parsers to out-of-domain data with some success. In comparison to all this previous work, to the best of our knowledge, we are the first to propose a fully unsupervised method and its extension to a semi-supervised setting.
As pre-trained language models such as BERT (Devlin et al., 2019), BART (Lewis et al., 2020) or GPT-2 (Radford et al., 2019) are becoming dominant in the field, BERTology research has gained much attention as an attempt to understand what kind of information these models capture. Probing tasks, for instance, can provide fine-grained analysis, but most of them only focus on sentencelevel syntactic tasks (Jawahar et al., 2019;Hewitt and Manning, 2019;Mareček and Rosa, 2019;Kim et al., 2019;Jiang et al., 2020). As for discourse, Zhu et al. (2020) and Koto et al. (2021) applied probing tasks and showed that BERT and BART encoders capture more discourse information than other models, like GPT-2. Very recently, Huber and Carenini (2022) introduced a novel way to encode long documents and explored the effect of different fine-tuning tasks on PLMs, confirming that pre-trained and fine-tuned PLMs both can capture discourse information. Inspired by these studies on monologues, we explore the use of PLMs to extract discourse structures in dialogues.
3 Method: from Attention to Discourse

Problem Formulation and Simplifications
Given a dialogue D with n Elementary Discourse Units (EDUs) {e 1 , e 2 , e 3 , ..., e n }, which are the minimal spans of text (mostly clauses, at most a sentence) to be linked by discourse relations, the goal is to extract a Directed Acyclic Graph (DAG) connecting the n EDUs that best represents its SDRT discourse structure from attention matrices in PLMs 1 (see Figure 2 for an overview of the process). In our proposal, we make a few simplifications, partially adopted from previous work. We do not deal with SDRT Complex Discourse Units (CDUs) following Muller et al. (2012) and Afantenos et al. (2015), and do not tackle relation type assignment. Furthermore, similar to Shi and Huang (2019), our solution can only generate discourse trees. Extending our algorithm to non-projective trees (≈ 6% of edges are non-projectives in treelike examples) and graphs (≈ 5% of nodes with multiple incoming arcs) is left as future work.

Which kinds of PLMs to use?
We explore both vanilla and fine-tuned PLMs, as they were both shown to contain discourse information for monologues (Huber and Carenini, 2022).
Pre-Trained Models: We select BART (Lewis et al., 2020), not only because its encoder has been shown to effectively capture discourse information, but also because it dominated other alternatives in preliminary experiments, including DialoGPT  and DialogLM (Zhong et al., 2022) -language models pre-trained with conversational data 2 . Fine-Tuning Tasks: We fine-tune BART on three discourse-related tasks: (1) Summarization: we use BART fine-tuned on the popular CNN-DailyMail (CNN-DM) news corpus (Nallapati et al., 2016), as well as on the SAMSum dialogue corpus (Gliwa et al., 2019).
(3) Sentence Ordering: we fine-tune BART on the Sentence Ordering task -reordering a set of shuffled sentences to their original order. We use an in-domain and an out-of-domain dialogue datasets (Sec. 4) for this task. Since fully random shuffling showed very limited improvements, we considered additional strategies to support a more gradual training tailored to dialogues. Specifically, as shown in Figure 3, we explore: (a) partial-shuf : randomly picking 3 utterances in a dialogue (or 2 utterances if the dialogue is shorter than 4) and shuffling them while maintaining the surrounding context. (b) minimal-pair-shuf : shuffling minimal pairs, comprising of a pair of speech turns from 2 different speakers with at least 2 utterances. A speech turn marks the start of a new speaker's turn in the dialogue. (c) block-shuf : shuffling a block containing multiple speech turns. We divide one dialogue into [2, 5] blocks based on the number of utterances 3 and shuffle between blocks. (d) speakerturn-shuf : grouping all speech productions of one speaker together. The sorting task consists of ordering speech turns from different speakers' production. We evenly combine all permutations mentioned above to create our mixed-shuf data set and conduct the SO task.
Choice of Attention Matrix: The BART model contains three kinds of attention matrices: encoder, decoder and cross attention. We use the encoder attention in this work, since it has been shown to capture most discourse information (Koto et al., 2021) and outperformed the other alternatives in preliminary experiments on a validation set.

How to derive trees from attention heads?
Given an attention matrix A t ∈ R k×k where k is the number of tokens in the input dialogue, we derive the matrix A edu ∈ R n×n , with n the number of EDUs, by computing A edu (i, j) as the average of the submatrix of A t corresponding to all the tokens of EDUs e i and e j , respectively. As a result, A edu captures how much EDU e i depends on EDU e j and can be used to generate a tree connecting all EDUs by maximizing their dependency strength. Concretely, we find a Maximum Spanning Tree (MST) in the fully-connected dependency graph A edu using the Eisner algorithm (Eisner, 1996). Conveniently, since an utterance cannot be anaphorically and rhetorically dependent on following utterances in a dialogue, as they are previously unknown , we can further simplify the inference by applying the following hard constraint to remove all backward links from the attention matrix A edu : a ij = 0, if i > j.
3.4 How to find the best heads? Xiao et al. (2021) and Huber and Carenini (2022) showed that discourse information is not evenly distributed between heads and layers. However, they do not provide a strategy to select the head(s) containing most discourse information. Here, we propose two effective selection methods: fully unsupervised or semi-supervised.

Unsupervised Best Head(s) Selection
Dependency Attention Support Measure (DAS): Loosely inspired by the confidence measure in Nishida and Matsumoto (2022), where the authors define the confidence of a teacher model based on predictive probabilities of the decisions made, we propose a DAS metric measuring the degree of support for the maximum spanning (dependency) tree (MST) from the attention matrix. Formally, given a dialogue g with n EDUs, we first derive the EDU matrix A edu from its attention matrix A g (see Sec. 3.3). We then build the MST T g by selecting n−1 attention links l ij from A edu based on the tree generation algorithm. DAS measures the strength of all those connections by computing the average score of all the selected links: . Note that DAS can be easily adapted for a general graph by removing the restriction to n − 1 arcs.
Selection Strategy: With DAS, we can now compute the degree of support from each attention head h on each single example g for the generated tree DAS(T g h ). We therefore propose two strategies to select attention heads based on the DAS measure, leveraging either global or local support. The global support strategy selects the head with highest averaged DAS score over all the data examples: where M is the number of examples. In this way, we select the head that has a generally good performance on the target dataset. The second strategy is more adaptive to each document, by only focusing on the local support. It does not select one specific head for the whole dataset, but instead selects the head/tree with the highest support for each single example g, i.e.,

Semi-Supervised Best Head(s) Selection
We also propose best heads selection using a few annotated examples. In conformity with real-world situations where labeled data is scarce, we sample three small subsets with {10, 30, 50} data points (i.e., dialogues) from the validation set. We examine every attention matrix individually, resulting in 12 layers × 16 heads candidate matrices for each dialogue. Then, the head with the highest micro-F 1 score on the validation set is selected to derive trees in the test set. We also consider layer-wise aggregation, with details in Appendix A.

Experimental Setup
Datasets: We evaluate our approach on predicting discourse dependency structures using the STAC corpus , a multi-party dialogue dataset annotated in the SDRT framework.
For the summarization and question-answering  fine-tuning tasks, we use publicly available Hug-gingFace models (Wolf et al., 2020) (see Appendix F). For the novel sentence ordering task, we train BART model on the STAC corpus and the DailyDialog corpus (Li et al., 2017). The key statistics for STAC and DailyDialog can be found in Table 1. These datasets are split into train, validation, and test sets at 82%, 9%, 9% and 85%, 8%, 8% respectively. The Molweni corpus (Li et al., 2020) is not included in our experiments due to quality issues, as detailed in Appendix B.
Baselines: We compare against the simple yet strong unsupervised LAST baseline (Schegloff, 2007), attaching every EDU to the previous one. Furthermore, to assess the gap between our approach and supervised dialogue discourse parsers, we compare with the Deep Sequential model by Shi and Huang (2019) and the Structure Self-Aware (SSA) model by Wang et al. (2021).

Metrics:
We report the micro-F 1 and the Unlabeled Attachment Score (UAS) for the generated naked dependency structures.
Implementation Details: We base our work on the transformer implementations from the Hugging-Face library (Wolf et al., 2020) and follow the textto-marker framework proposed in Chowdhury et al.
(2021) for the SO fine-tuning procedure. We use the original separation of train, validation, and test sets; set the learning rate to 5e − 6; use a batch size of 2 for DailyDialog and 4 for STAC, and train for 7 epochs. All other hyper-parameters are set following Chowdhury et al. (2021). We do not do any hyper-parameter tuning. We omit 5 documents in DailyDialog during training since the documents lengths exceed the token limit. We replace speaker names with markers (e.g. Sam → "spk1"), following the preprocessing pipeline for dialogue utterances in PLMs.

Results with Unsupervised Head Selection
Results using our novel unsupervised DAS method on STAC are shown in Table 2 for both the global (H g ) and local (H l ) head selection strategies. These are compared to: (1) the unsupervised LAST baseline (at the top), which only predicts local attachments between adjacent EDUs. LAST is considered a strong baseline in discourse parsing (Muller et al., 2012), but has the obvious disadvantage of completely missing long-distance dependencies which may be critical in downstream tasks. (2) The supervised Deep Sequential parser by Shi and Huang (2019) and Structure Self-Aware model by Wang et al. (2021) (center of the table), both trained on STAC, reaching resp. 71.4% 4 and 73.8% in F 1 . In the last sub-table we show unsupervised scores from pre-trained and fine-tuned LMs on three auxiliary tasks: summarization, questionanswering and sentence ordering (SO) with the mixed shuffling strategy. We present the global head (H g ) and local heads (H l ) performances selected by the DAS score (see section 3.4.1). The best possible scores using an oracle head selector (H ora ) are presented for reference.
Comparing the values in the bottom sub-table, we find that the pre-trained BART model underperforms LAST (56.8), with global head and local heads achieving similar performance (56.6 and 56.4 resp.). Noticeably, models fine-tuned on the summarization task ("+CNN", "+SAMSum") and question-answering ("+SQuAD2") only add marginal improvements compared to BART. In the last two lines of the sub-table, we explore our novel sentence ordering fine-tuned BART models. We find that the BART+SO approach surpasses LAST when using local heads (57.1 and 57.2 for Daily-Dialog and STAC resp.). As commonly the case, the intra-domain training performs best, which is further strengthened in this case due to the special vocabulary in STAC. Importantly, our PLMbased unsupervised parser can capture some longdistance dependencies compared to LAST (Section 6.2). Additional analysis regarding the chosen heads is in Section 6.1.

Results with Semi-Sup. Head Selection
While the unsupervised strategy only delivered minimal improvements over the strong LAST baseline, Table 3 shows that if a few annotated examples are provided, it is possible to achieve substantial gains. In particular, we report results on the vanilla BART model, as well as BART model fine-   tuned on DailyDialog ("+SO-DD") and STAC itself ("+SO-STAC"). We execute 10 runs for each semisupervised setting ([10, 30, 50]) and report average scores and the standard deviation. The oracle heads (i.e., H ora ) achieve superior performance compared to LAST. Furthermore, using a small scale validation set (50 examples) to select the best attention head remarkably improves the F 1 score from 56.8% (LAST) to 59.3% (+SO-STAC). F 1 improvements across increasingly large validation-set sizes are consistent, accompanied by smaller standard deviations, as would be expected.  We now take a closer look at the performance degradation of our unsupervised approach based on DAS in comparison to the upper-bound defined by the performance of the oracle-picked head. To this end, Figure 4 shows the DAS score matrices (left) for three models with the oracle heads and DAS selected heads highlighted in green and yellow, respectively. These scores correspond to the global support strategy (i.e., H g ). It becomes clear that the oracle heads do not align with the DAS selected heads. Making a comparison between models, we find that discourse information is consistently located in deeper layers, with the oracle heads (light green) consistently situated in the same head for all three models. It is important to note that this information cannot be determined beforehand and can only be uncovered through a thorough examination of all attention heads.
While not aligning with the oracle, the top performing DAS heads (in yellow) are among the top 10% best heads in all three models, as shown in the box-plot on the right. Hence, we confirm that the DAS method is a reasonable approximation to find discourse intense self-attention heads among the 12 × 16 attention matrices.

Document and Arc Lengths
The inherent drawback of the simple, yet effective LAST baseline is its inability to predict indirect arcs. To test if our approach can reasonably predict distant arcs of different length in the dependency trees, we analyze our results in regards to the arc lengths. Additionally, since longer documents tend to contain more distant arcs, we also examine the performance across different document lengths.
Arc Distance: To examine the extracted discourse structures for data sub-sets with specific arc lengths, we present the UAS score plotted against different arc lengths on the left side in Figure 5. Our analysis thereby shows that direct arcs achieve high UAS score (> 80%), independent of the model used. We further observe that the performance drops considerably for arcs of distance two and onwards, with almost all models failing to predict arcs longer than 6. BART+SO-STAC model correctly captures an arc of distance 13. Note that the presence for long-distance arcs (≥ 6) is limited, accounting for less than 5% of all arcs.
We further analyze the precision and recall scores when separating dependency links into direct (adjacent forward arcs) and indirect (all other non-adjacent arcs), following Xiao et al. (2021). For direct arcs, all models perform reasonably well (see Figure 6 at the bottom). The precision is higher (≈ +6% among all three BART models) and recall is lower than the baseline (100%), indicating that our models predict less direct arcs but more precisely. For indirect arcs (top in Figure 6), the best model is BART+SO-STAC (20% recall, 44% precision), closely followed by original BART (20% recall, 41% precision). In contrast, the LAST baseline model completely fails in this scenario (0 precision and recall). Document Length: Longer documents tend to be more difficult to process because of the growing number of possible discourse parse trees. Hence, we analyze the UAS performance of documents in regards to their length, here defined as the number of EDUs. Results are presented on the right side in Figure 5, comparing the UAS scores for the three selected models and LAST for different document lengths. We split the document length range into 5 even buckets between the shortest (2 EDUs) and longest (37 EDUs) document, resulting in 60, 25, 16, 4 and 4 examples per bucket.  (  For documents with less than 23 EDUs, all fine-tuned models perform better than LAST, with BART fine-tuned on STAC reaching the best result. We note that PLMs exhibit an increased capability to predict distant arcs in longer documents. However, in the range of [23,30], the PLMs are inclined to predict a greater number of false positive distant arcs, leading to under-performance compared to the LAST baseline. As a result, we see that longer documents (≥ 23) are indeed more difficult to predict, with even the performance of our best model (BART+STAC) strongly decreasing.

Projective Trees Examination
Given the fact that our method only extracts projective tree structures, we now conduct an additional analysis, exclusively examining the subset of STAC containing projective trees, on which our method could in theory achieve perfect accuracy. Table 4 gives some statistics for this subset ("proj. tree"). For the 48 projective tree examples, the document length decreases from an average of 11 to 7 EDUs, however, still contains ≈ 40% indirect arcs, keeping the task difficulty comparable. The scores for the extracted structures are presented in Table 5. As shown, all three unsupervised models outperform LAST. The best model is still BART fine-tuned on STAC, followed by the inter-domain   fine-tuned +SO-DD and BART models. Using the semi-supervised approach, we see further improvement with the F 1 score reaching 68% (+6% than LAST). Degradation for direct and indirect edges' precision and recall scores see Appendix C. Following Ferracane et al. (2019), we analyze key properties of the 48 gold trees compared to our extracted structures using the semi-supervised method. To test the stability of the derived trees, we use three different seeds to generate the shuffled datasets to fine-tune BART. Table 6 presents the averaged scores and the standard deviation of the trees. In essence, while the extracted trees are generally "thinner" and "taller" than gold trees and contain slightly less branches, they are well aligned with gold discourse structures and do not contain "vacuous" trees, where all nodes are linked to one of the first two EDUs. Further qualitative analysis of inferred structures is presented in Appendix D. Tellingly, on two STAC examples our model succeeds in predicting > 82% of projective arcs, some of which span across 4 EDUs. This is encouraging, providing anecdotal evidence that our method is suitable to extract reasonable discourse structures.

Performance with Predicted EDUs
Following previous work, all our experiments have started with gold-standard EDU annotations. However, this would not be possible in a deployed dis-   course parser for dialogues. To assess the performance of such system, we conduct additional experiments in which we first perform EDU segmentation and then feed the predicted EDUs to our methods.
To perform EDU segmentation, we employ the DisCoDisCo model (Gessler et al., 2021), pretrained on a random sample of 50 dialogues from the STAC validation set. We repeat this process three times to accommodate instability. Our results, as shown in Table 7, align with those previously reported in Gessler et al. (2021) (94.9), with an F-score of 94.8. In the pre-training phase, we utilize all 12 hand-crafted features 5 , and opt for treebanked data for enhanced performance (94.9 compared to 91.9 for plain text data). The treebanked data is obtained using the Stanza Toolkit (Qi et al., 2020).
For evaluation, we adapt the discourse analysis pipeline proposed by Joty et al. (2015). The results are shown in Table 8, comparing the predicted and gold EDUs. The best head (i.e., H ora ) performance decreases by ≈ 7 points, from 59.5 to 52.6, as well as unsupervised and semi-supervised results. Despite the drop, our unsupervised and semi-supervised models still outperform the LAST baseline. A similar loss of ≈ 6 points is also observed for RST-style parsing in monologues, as reported in Nguyen et al. (2021).

Conclusion
In this study, we explore approaches to build naked discourse structures from PLMs attention matrices to tackle the extreme data sparsity issue in dialogues. We show sentence ordering to be the best 5 Such as POS tag, UD deprel, sentence length, etc.. fine-tuning task and our unsupervised and semisupervised methods for selecting the best attention head outperform a strong baseline, delivering substantial gains especially on tree structures. Interestingly, discourse is consistently captured in deeper PLMs layers, and more accurately for shorter links.
In the near future, we intend to explore graphlike structures from attention matrices, for instance, by extending treelike structures with additional arcs of high DAS score and applying linguistically motivated constraints, as in Perret et al. (2016). We would also like to expand shuffling strategies for sentence ordering and to explore other auxiliary tasks. In the long term, our goal is to infer full discourse structures by incorporating the prediction of rhetorical relation types, all while remaining within unsupervised or semi-supervised settings.

Limitations
Similarly to previous work, we have focused on generating only projective tree structures. This not only covers the large majority of the links (≈ 94%), but it can also provide the backbone for accurately inferring the remaining non-projective links in future work. We focus on the naked structure, as it is a significant first step and a requirement to further predict relations for discourse parsing.
We decided to run our experiments on the only existing high quality corpus, i.e., STAC. In essence, we traded-off generalizability for soundness of the results. A second corpus we considered, Molweni, had to be excluded due to serious quality issues.
Lastly, since we work with large language models and investigate every single attention head, computational efficiency is a concern. We used a 4-core GPU machine with the highest VRAM at 11MiB. The calculation for one discourse tree on one head was approximately 0.75 seconds (in STAC the averaged dialogue length is 11 EDUs), which quickly summed up to 4.5 hours with only 100 data points for 192 candidate trees in one LM. When dealing with much longer documents, for example AMI and conversational section in GUM (in average > 200 utterances/dialogue), our estimation shows that one dialogue takes up to ≈ 2 minutes, which means 6.5 hours for 192 candidate trees. Even though we use parallel computation, the exhaustive "head" computation results in a tremendous increase in time and running storage. One possibility is to investigate only those "discourse-rich" heads, mainly in the deeper layers, for future work.

Ethical Considerations
We carefully select the dialogue corpora used in this paper to control for potential biases, hatespeech and inappropriate language by using human annotated corpora and professionally curated resources. Further, we consider the privacy of dialogue partners in the selected datasets by replacing names with generic user tokens.
Since we are investigating the nature of the discourse structures captured in large PLMs, our work can be seen as making these models more transparent. This will hopefully contribute to avoid unintended negative effects, when the growing number of NLP applications relying on PLMs are deployed in practical settings.
In terms of environmental cost, the experiments described in the paper make use of RTX 2080 Ti GPUs for tree extraction and A100 GPUs for BART fine-tuning. We used up to 4 GPUs for the parallel computation. The experiments on corpus STAC took up to 1.2 hours for one language model, and we tested a dozen models. We note that while our work is based on exhaustive research on all the attention heads in PLMs to obtain valuable insights, future work will able to focus more on discourserich heads, which can help to avoid the quadratic growth of computation time for longer documents.

A Semi-sup. Layer-Wise Results
We consider both layer-wise attention matricesaveraging 16 attention heads for every layer which gives 12 candidate layers -, and head-wise attention matrices -taking each attention matrix individually which results in 192 candidate matrices. Here we show results completed with layer-wise matrices for the whole test set and treelike examples in Table 9 and Table 10.
Considering the complexity of Ubuntu chat logs (multiple speakers, entangled discussion with various topics), we first conduct an examination of the corpus. Disappointingly, we found heavy repetition within sequential documents and inconsistency in discourse annotation among the same utterances. We thus decide not to include it in this work.    Table 11: Quantitative resume of link and relation inconsistency in Molweni test set. "Theor =arc": number of arcs between the same utterances, a priori should be linked in the same way; "Theor =rel": number of relations between the linked utterances.
Clusters: Among 500 dialogues in discourse augmented test set, we found 105 "clusters". One cluster groups all the documents with only one or two different utterances. For instance, document id 10 and 11 are in the same cluster since only the second utterance is different ( Figure 10) Figure 10: apart from EDU 2 , we expect the same links and relations among other EDUs. However, we observe one link inconsistency (in red) and two relation inconsistencies (in blue). In total, we find 6% of link errors (#Err arc) within the same EDUs and 14% of relation errors (#Err rel) in the test set 6 . The scores are shown in Table 11. The Ubuntu Chat Corpus contains long dialogues with entangled discussion. A pre-processing had been made to generate shorter dialogues. While these slightly different short dialogues could be interesting for other dialogue studies in the field. Our focus on the discourse structure request more various data points and most importantly, the coherent discourse annotation.

C Precision and Recall Scores for Direct and Indirect Arcs in STAC Tree Set
To compare the performance of the whole test set and tree-structured subset, we present the recall and 6 For validation and train sets we find similar error rates. 2575 Figure 7: Recall and precision metrics in whole test set (darker color) vs. projective tree subset (brighter color), with BART model. precision scores of BART (Fig. 7), BART+SO-DD (Fig. 8), and BART+SO-STAC ( Fig. 9) separately.

D Qualitative Analysis in STAC
We show a few concrete tree examples: 3 well predicted ( Figure 11, 12, 13), 3 badly predicted ( Figure 14, 15, 16), and 2 random examples (Figure 17, 18). Some patterns observed from badly predicted structures: (1) chain-style prediction: as shown in Figure 15 and 18 where only adjacent EDUs are linked together; (2) inaccurate indirect arc prediction: especially for long documents such as the one in Figure 16.

E Results with other PLMs
We test with RoBERTa (Liu et al., 2019), Di-aloGPT , and DialogLED (Di-alogLM with Longformer) (Zhong et al., 2022) to see how different language models encode discourse information. As shown in Table 12, the most discourse-rich head in RoBERTa slightly under-  perform BART (−0.2%), so does the DialogLED (−0.4%) and DialoGPT (−1.4%). Sentence ordering fine-tuned DialogLED model outperforms the original one, proving that our proposed SO task can help encoding the discourse information. Table 13 shows the models and the sources we obtained from Huggingface library (Wolf et al., 2020).