Multitask Semi-Supervised Learning for Class-Imbalanced Discourse Classification

As labeling schemas evolve over time, small differences can render datasets following older schemas unusable. This prevents researchers from building on top of previous annotation work and results in the existence, in discourse learning in particular, of many small class-imbalanced datasets. In this work, we show that a multitask learning approach can combine discourse datasets from similar and diverse domains to improve discourse classification. We show an improvement of 4.9% Micro F1-score over current state-of-the-art benchmarks on the NewsDiscourse dataset, one of the largest discourse datasets recently published, due in part to label correlations across tasks, which improve performance for underrepresented classes. We also offer an extensive review of additional techniques proposed to address resource-poor problems in NLP, and show that none of these approaches can improve classification accuracy in our setting.

However, even as recent advances in NLP allow us to achieve impressive results across a variety of tasks, discourse learning, a supervised learning task, faces the following challenges: (1) discourse datasets tend to be very class-imbalanced. 2 (2) Discourse learning is a complex task: human annotators require training and conferencing to achieve moderate agreement (Das et al., 2017). (3) Discourse learning tends to be resource-poor, as annotation complexities make large-scale data collection challenging (Table 1). Compounding the problem, a schema often evolves across different annotation efforts, preventing the compilation of smaller datasets into larger ones. 3 We observe, however, that certain discourse schemata appear to offer complementary information. For example, Penn Discourse and Rhetorical Structure Theory Treebanks offer intrasentential, low-level discourse information (Prasad et al., 2008;Carlson et al., 2003), while news discourse schemas offer intersentential, high-level, domain-specific discourse information (Choubey et al., 2020;Yarlott et al., 2018). Inspired by Collobert and Weston (2008)'s finding that lower-level NLP tasks (e.g. part of speech tagging) could aid higher-level tasks (e.g. semantic role labeling), we hypothesize that a multitask approach incorporating multiple discourse datasets can address the challenges listed above. Specifically, by introducing complementary information from auxiliary discourse tasks, we can increase performance for a primary discourse task's underrepresented classes.
We propose a multitask neural architecture (Section 2) to address this hypothesis. We construct tasks from 6 discourse datasets, an events dataset, and an unlabeled news dataset (Section 3), including a novel discourse dataset we introduce in this work. Although different datasets are developed under divergent schemas and have different goals, our framework learns correlations between schemas, and does not "waste" labeling work done by generations of NLP researchers.
Our experiments show that a multitask approach can help us improve discourse classification on a primary task, NewsDiscourse (Choubey et al., 2020), from a baseline performance of 62.8% Micro F1 to 67.7%, an increase of 4.9 points (Section 4), with the biggest improvements seen in underrepresented classes. On the contrary, two data augmentation approaches, Training Data Augmentation (TDA) and Unsupervised Data Augmentation (UDA), fail to improve performance.
We give insight into why this occurs (Section 5). In the multitask approach, the primary task's underpresented labels are correlated with labels in other datasets. However, if we only provide more data without any correlated labels (TDA and UDA), we overpredict the overrepresented labels. We test many other approaches proposed to address classimbalance and observe similar negative results (Appendix F). Taken together, this analysis indicates that the signal from labeled datasets is essential for boosting performance in class-imbalanced settings.
In summary, our core contributions are: • We show a 4.9 F1-score improvement above state-of-the-art on the NewsDiscourse dataset and introduce a novel dataset with 67 labeled articles based on an expanded Van Dijk news discourse schema (Van Dijk, 2013).
• What worked and why: we show that different discourse datasets in a multitask framework complement each other; correlations between labels in divergent schemas provide support for underrepresented classes in a primary task.
• What did not work and why: training data augmentation and semi-supervised data augmentation failed to improve above baseline because they overpredict overrepresented classes, thus hurting overall performance.

Methodology
We formulate a multitask approach to discourse learning with the NewsDiscourse dataset as our primary task (Section 3). Our multitask architecture uses shared encoder layers and task-specific classification heads 4 . Our objective is to minimize the weighted sum of losses across tasks: 4 Our framework can be seen as a multitask feature learning architecture (Zhang and Yang, 2017). where D = {D t } T t=1 is our joined dataset, D t = {(x i [, y i ])} Nt i=1 are task-specific datasets for tasks t = 1, ..., T , each of size N t (labeled and unlabeled). L t is the task-specific loss, and hyperparameter α = {α t } T t=1 , a coefficient vector that weights the loss from each task. In each training step, we randomly sample one task t and one datum (x i [, y i ]) t 5 from that task's dataset, D t .

Neural Architecture
Our neural architecture (Figure 1) consists of a sentence-embedding layer and, in some experimental variations, embedding augmentations; a classification layer for the primary task; and separate classification layers for auxiliary supervised tasks. The architecture we use to model our supervised tasks is inspired by previous work in sentence-level tagging and discourse learning (Choubey et al., 2020;Li et al., 2021). We use RoBERTa-base (Liu et al., 2019) to generate sentence embeddings ( Figure 1). Sentences in each document are read sequentially by the same model, and the </s> token from each sentence is used as the sentence-level embedding. The sequence of sentence embeddings is passed through a Bi-LSTM layer to provide context. These layers are shared between tasks. 6 Additionally, we experiment with concatenating different embeddings to the sentence embeddings to provide document-level and sentence-positional information. We concatenate headline embeddings and document embeddings, generated as described in Choubey et al. (2020), and sentence-positional embeddings, described in Vaswani et al. (2017). 7 Each output embedding is classified using a taskspecific feed-forward layer. 8 Some of our tasks (including our primary task) are multiclass and others are multilabel. We discuss our datasets (and tasks) in the next section.

Datasets
We use 8 datasets in our multitask setup, shown in Table 1. Four datasets contain sentence-level labels and no relational labels; two contain annotations of clausal relations; one is an events-nugget dataset 5 [, yi] indicates that for some tasks, labels yi are not present. See Section 4.3: we decompose UDA into a supervised head and an unsupervised head. 6 Variations on our method for generating sentence embeddings are reported in Appendix F.1 7 For more detail, see Appendix E.1. 8 Variations both of the classification tasks and the loss function, aimed at addressing the class-imbalance inherent in the VD2 dataset, are reported in Appendix F.2. . n j is size of class j; n 1 > ... > n k ).

Dataset
Figure 1: Sentence-Level classification model used for each prediction task. The </s> token in the RoBERTa model is used to generate sentence-level embeddings, </s> i . Bi-LSTM is used to contextualize these embeddings, c i . Finally, FF is used to make class predictions, p i . RoBERTa and Bi-LSTM are shared between tasks. FF is the only task-specific layer.
where labels denote the presence of events in sentences; and one is an unlabeled news dataset. See Tables 4 and 5 (Craig, 2006) and released a dataset, NewsDiscourse (VD2), consisting of 802 articles from 3 outlets 9 . We take VD2 as our primary task due to its size. As shown in Table 1 (VD3) following the Van Dijk Schema. We expand the schema to capture discourse elements related to "Explanatory Journalism" (Forde, 2007). VD3 contains 67 news articles with sentence-level labels, sampled from the ACE corpus without redundancy to VD1. We additionally label 10 articles from VD1 and find an interannotator agreement of κ = .69 11 .
A substantial volume of news discourse is not factual assertion, but analysis, explanation, and prediction (Steele and Barnhurst, 1996). We thus include the Argumentation dataset (ARG) (Al Khatib et al., 2016), a dataset consisting of 5 labels applied to 300 news editorials. 12 Each of these four datasets assigns a single label to each sentence. We treat them as multiclass datasets, as shown in Table 1. Penn Discourse Treebank (PDTB) and Rhetorical Structure Theory Treebank (RST) These discourse datasets each consist of spans of text in articles; labels indicate how different spans relate to each other. We process each so that sentences are annotated with the set of all relations occurring at least once in the sentence 13 . Then, we downsample documents in each of these dataset so that the distribution of document length matches VD2. 14 We match document lengths to control for biases introduced by shorter documents, as the full PDTB and RST consist of a large amount of short documents that are not representative of documents in VD2.
Some of Van Dijk's discourse elements differ 11 For more information on the dataset we introduce in this paper, see Appendix B.1. 12 This dataset contains articles from 3 news outlets: aljazeera.com, foxnews.com and theguardian.com 13 For more details, see Appendix B.2. 14 Specifically, if pm(n) and pa(n) are the likelihood of a document d with n sentences in the main and auxiliary datasets respectively, we sample with weight w d = pm(n)/pa(n) (Austin and Stuart, 2015). pm(n) and pa(n) were determined empirically by Nn/N total (Nn: # of docs with sentencelength n in a or m, N total : # of docs in a or m). based on temporal relation: for example, some elements describe events occurring before a main event (e.g. Previous Event (C2)) while others describe events occurring after (e.g. Consequence (M2)). To introduce more information about temporality, we use PDTB's tags pertaining to Temporal relations (we call this filtered dataset PDTB-t).
When processed as described above, each of these datasets assign multiple labels to each sentence. We treat them as multilabel datasets. Knowledge Base Population (KBP) 2014/2015 Some of Van Dijk's discourse elements differ based on the presence or absence of an event. For example, the elements Previous Event (C2) and Current Context (C1) both describe the context before a main event, but the former describes events while the latter describes general circumstances. We hypothesize that a dataset identifying event occurrence can help our model differentiate these elements. We collect an additional non-discourse dataset, the KBP 2014/2015 Event Nugget dataset, which annotates the trigger words for events by event-type. We treat this as a multilabel dataset. All-The-News (U) For semi-supervised dataablation experiments, described in Section 4.3, we sample 6,000 documents from an unlabeled news dataset. 15 We downsample in the manner described above for PDTB and RST.

Experiments and Results
In this section, we briefly discuss experiments using VD2 as a single classification task. Then, we discuss the experiments using VD2 in a multitask setting. Finally, we discuss our experiments with data augmentation as ablations. We leave a more detailed analysis of single-task experiments for Appendix F, focusing here on multi-task experiments.

Single Task Experiments
We observe, perhaps unsurprisingly, a 2-point F1score improvement by using RoBERTa as a contextualized embedding layer rather than Choubey et al. (2020)'s baseline, ELMo (Peters et al., 2018) (Roberta in Table 2). We observe an additional 1.5 F1 score improvement by freezing layers in RoBERTa (+Frozen in Table 2). We find that freezing layers closer to the input results in greater improvement, replicating Lee et al. (2019). Finally, Figure 2: Loss coefficient weightings (α vector) across tasks and Macro vs. Micro F1 Score shown for: (a) a mix of trials, (First two blue bars; MT-Micro and MT-Macro trials) (b) pairwise multitask tasks (other blue bars), (c) baseline (red bar) (d) data ablation (yellow bar; UDA and TDA). Tasks are green in strength proportional to their α value. When U is used, it is used with UDA head. Hashed VD2, for TDA, is dataaugmented as described in Section 4.3. Pairwise tasks shown in some rows to emphasize that a soft-weighting α achieves maximal F1 scores.

Multi-Task Experiments
As shown in Table 2, multitask achieves better results than any single-task experiment. We conduct our multitask experiment by performing a gridsearch over loss-weighting, α (defined in Equation 1). We select top-performing α for Micro F1-score as well as Macro F1-score based on a validation split, and report results on a test split 17 . As can be seen, in Figure 2, the weighting achieving the top Micro F1-score includes datasets VD2, ARG, RST and PDTB-t, while the weighting achieving the top Macro F1-score includes datasets VD2, ARG, VD3, and RST.
To understand the effect of each dataset individually, we run linear regression on the α and F1scores found in our grid search 18 . The regression coefficients, β, displayed in Table 3, approximate the effect each dataset has. We conduct over 600 trials in our grid search and thus have confidence in these results.

Data Ablation Experiments
To test our hypothesis that labeled information in the multitask setup helps us achieve higher accuracy, we perform the following ablation: we test using additional data that does not contain new label information. We test two methods of data augmentation: Training Data Augmentation (TDA) and Unsupervised Data Augmentation (UDA). TDA enhances supervised learning (DeVries and Taylor, 2017) by increasing the size of the training dataset through data augmentations on the training data; it exploits the smoothness assumption in semi-supervised learning to help our model be more robust to local data perturbations (Van Enge-  len and Hoos, 2020). For each datapoint (x i , y i ) in our primary dataset, we generate k = 10 noisy samples (x i1 , y i ), ..., (x ik , y i ). We use a samplingbased backtranslation function to generate augmentations for TDA and UDA. (Edunov et al., 2018). 19 UDA is a form of semisupervised learning that propagates signal from labeled to unlabeled datapoints, making use of the manifold assumption in semi-supervised learning (Xie et al., 2020;Van Engelen and Hoos, 2020). UDA seeks to promote consistency between model predictions on unlabeled datapoints p θ (x i ) and their augmentations {p θ (x i )} k j=1 by minimizing their KL-divergence. 20 19 To perform backtranslation, we use Fairseq's English to German and English to Russian models (Ott et al., 2019). Inspired by Chen et al. (2020), we generate backtranslations using random sampling with a tunable temperature parameter instead of beam search, to ensure diversity in augmented sentences. 20 KL-divergence is minimized via consistency loss: Both techniques were chosen as they have been shown to boost performance of low-resource NLP classifiers above other semi-supervised methods (DeVries and Taylor, 2017;Berthelot et al., 2019;Chen et al., 2020;Xie et al., 2020;Hyun et al., 2020). Because both techniques introduce more data without introducing more labels, they address the question: did multitask learning improve accuracy only by introducing more data?
As shown in Table 2 and Figure 2, TDA and UDA fail to improve performance above singletask experiments (RoBERTa+EmbAug). To interrogate further, we explored approaches introduced by Xie et al. (2020) and Hyun et al. (2020) to improve convergence of UDA. Specifically, we use a confidence threshold, r, to mask out uncertain unlabeled data; Training Signal Annealing (TSA), to mask out uncertain labeled data; suppression coefficient β, to decrease unsupervised loss contributions for low-support classes; and other methods. 21 We test a range of values for each of these hyperparameters. In particular, we find that TSA with a Linear schedule has a dramatic effect on accuracy, nearly rescuing the performance of UDA. We show UDA with and without TSA ( Figure 3, Table 2) to demonstrate, yet we are unable to achieve a setting whereby UDA or TDA beats multitask. Additionally, we add UDA as an unsupervised head in our multitask setup, similar to Collobert and Weston (2008) introducing language modeling as an unsupervised head. We find only one setting where it contributes to our multitask accuracy (MT-Macro in Figure 2 and Table 3).

Discussion
As shown in Figure 3, a multitask approach significantly increases performance for underrepresented classes while not hurting performance for others. This is in contrast to pure data augmentation approaches, like UDA or TDA. Improving performance in low-support classes improves overall Macro F1, as expected, and Micro F1 (Table 2).
Multitask learning can help learn part of the data manifold where an underrepresented class exists by learning signal from a class which is correlated. Ta- 21 See Appendix G for a detailed discussion on these approaches and our reported explorations. The top-performing hyperparameters we found were: r = .8, T SA = Linear, β = 0, k = 5, p = 8, αUDA = .8, τ = .8,; Xie et al. (2020) do not share their explorations; we find that the choice of p (the number of unlabeled data) and k (the number of augmentations per datum) have significant impact on performance.   bles 4 and 5 show the correlation between class labels predicted by our multitask model on the same dataset using different heads. One insight from Table 4 is a simple sanity check: the Van Dijk datasets largely agree on the labels that share similar definitions. For example, there is a strong correlation between sentences labeled Main Event (M1) by the VD2 head and those labeled Main Event by the VD3 head.
However, a more interesting insight is the strong correlation existing between underrepresented classes in the VD2 dataset and classes in other datasets. Classes Consequence (M2) and Anecdotal Event (D2) are two of the lowest-support classes, yet they each have strong correlations with labels in every other dataset.
We pause to comment on the differences in task weightings observed in Figure Table 4: Spearman correlations between labels predicted with VD2 head and Argumentation, VD3 and VD1 heads. Note that the two Van Dijk datasets have high correlations between most labels that they have in common. Correlations above |r| > .2 shown.   MT-Macro. In class imbalanced settings, Micro F1-score is weighted more towards high-support classes while Macro F1-score favors each class equally. Because different auxiliary tasks boost performance for different classes, it is reasonable to assume that the same α will lead to different Macro F1 and Micro F1 scores 22 One future direction is to identify criteria for including promising discourse tasks in a multitask framework. Bingel and Søgaard (2017) performed such an analysis for multitask setups including POS-tagging and Keyphrase detection and the present work demonstrates the impact such criteria could have in aiding discourse tagging. One criteria for inclusion might be based on the label correlations between the main discourse task and a candidate task. However, obtaining correlations would require training a multitask model; at that point, directly calculating the accuracy boost would be trivial. Identifying discourse-relevant features in the input data, x, as Bingel and Søgaard (2017) did in their work, might be more fruitful. 22 For more information, see Appendix D.
A competing explanation to our hypothesis that multitask improves performance through label correlations is that additional datasets simply expose the model to more of the data-input space, x. Both UDA and TDA serve as ablation studies for this. Hyun et al. (2020) show that, for classimbalanced problems, regions of the data manifold that contain the underrepresented classes generalize poorly when data augmentation is used. Indeed, we show in Figure 4 that TDA and UDA overpredict overrepresented classes, perhaps showing that the algorithms misjudge the extent of underrepresented classes on the data manifold.
One approach to improving semi-supervision would be to consider a more sophisticated annealing algorithm. As discussed in Section 4.3, TSA nearly rescued UDA's performance for all labels. Another would be to generate more augmentations for underrepresented classes (Shorten and Khoshgoftaar, 2019); on the training data for TDA (Chawla et al., 2002) or using a model to identify promising unlabeled points for UDA. Upsampling underrepresented labels in sequences, which our data are, presents a challenge because we can only sample the entire sequence (i.e. the document). Thus, if we try to upsample individual underrepresented classes (i.e. sentences), we will also be upsampling overrepresented classes in the sequence.
As a final piece of analysis on our multitask setup, we show the reduction of confusion between MT-Macro and Baseline in Figure 5. 23 We identify reductions in two main classes of confusion: Temporal confusion, or confusion between temporal ordering of discourse elements (i.e. Previous Event and Consequence); and Event-based confusion, or confusion between tags semantically similar except for the presence of an event (i.e. Current Context and Previous Event). While we hypothesize the reduction is due to the addition of temporal information in PDTB-t and event information in RST, more experimentation is needed to confirm.
We close our discussion with an analysis of VD2's task difficulty. We ask expert annotators to relabel VD2 data. Our annotators read Choubey et al. (2020)'s annotation guidelines and labeled a few trial examples. Then they sampled and annotated 30 documents from VD2 without observing VD2's labels. Annotations in this Blind pass were significantly worse than predictions made by our best model (Table 2). Then, our annotators observed VD2's labels on the 30 articles, discussed, and changed where necessary. Surprisingly, even in this Post-Reconciliation pass, our annotators rarely scored more than 80% F1-score.
Thus, Van Dijk labeling task might face an inherent level of legitimate disagreement, which MT-Macro seems to be approaching. However, there are two classes, M1 and M2, where MT-Macro underperformed even the Blind annotation. For these classes, at least, we expect that there is further room for modeling improvement through: (1) annotating more data, (2) incorporating more auxiliary tasks in the multitask setup, or (3)

Related Work
Ruder (2017) gives a good overview of multitask learning in NLP more broadly. A major early work by Collobert and Weston (2008) uses a single CNN architecture to jointly learn 5 different supervised NLP tasks (e.g. Part-of-Speech Tagging) and one 23 For a more extended analysis, see Appendix C unsupervised task (Language Modeling), improving performance in their main task. Our work differs in several key aspects: (1) we are concerned with sentence-level tasks; (2) we consider a softer approach to task inclusion, α; (3) we perform a deeper analysis of why multitask helps, including examining inter-task prediction correlations and class-imbalance.
Aside from using different datasets that share the same language, researchers have also used datasets from one language to perform tasks in another. From Information Extraction  2020), English datasets have been translated into a target language, a target language has been translated into English, or a joint multilingual space has been learned. Our task may also have benefited from multilingual discourse datasets.
Most state-of-the-art research in discourse analysis specifically has focused on classifying the discourse relations between pairs of clauses, as is practice in the Penn Discourse Treebank (PDTB) (Prasad et al., 2008) and Rhetorical Structure Theory (RST) dataset (Carlson et al., 2003). Corpora and methods have been developed to predict explicit discourse connectives (Miltsakaki et al., 2004;Lin et al., 2009;Das et al., 2018;Malmi et al., 2018;Wang et al., 2018) as well as implicit discourse relations (Rutherford and Xue, 2016;Liu et al., 2016;Lan et al., 2017;Lei et al., 2017). Choubey et al. (2020) built a news article corpus where each sentence was labeled with a discourse label defined in Van Dijk schema (Van Dijk, 2013).
Since discourse analysis has limited resources, some work has explored a multitask framework to learn from more than one discourse corpus. Liu et al. (2016) propose a CNN based multitask model and Lan et al. (2017) propose an attention-based multitask model to learn implicit relations in PDTB and RST. The main difference in our work is the coverage and flexibility of our framework. This work is able to learn both explicit and implicit discourse relations; multilabel and multiclass tasks; and labeled data and unlabeled data in one framework, which makes it possible to fully take advantage of corpora like PDTB and RST as well as corpora developed using the Van Dijk schema.

Conclusion
We have shown a state-of-the-art improvement of 4.9 Micro F1-score above baseline, from 62.8% F1-score to 67.7% F1-score, for discourse tagging on the NewsDiscourse dataset, the largest dataset currently available for Van Dijk discourse tagging. This dataset has a number of challenges: distinctions between discourse labels are complex and multifaceted and this dataset is class-imbalanced, with the overrepresented classes being 3 times more likely than the underrepresented classes.
We showed that a multitask approach is especially helpful in this circumstance, improving performance for underrepresented labels. One reason might be the high correlations observed between label predictions between tasks, indicating that auxiliary tasks are giving signal to our primary task's underrepresented labels. This includes a novel dataset that we introduce based on the same schema with some minor alterations. We show an additional benefit that our approach can reconcile datasets with slightly different schema, allowing NLP researchers not to "waste" valuable annotations.
Finally, we perform a comparative analysis of other strategies proposed in the literature for dealing with small datasets or class-imbalanced problems. We show in exhaustive experiments, in Appendix F, that these approaches do not help us improve above baseline. These negative experiments include extensive analyses and provide a justification for the necessity of our multitask approach.

Acknowledgements
This work was performed while Alexander Spangher interned at Bloomberg. We would like to thank Nanyun Peng and Temma Choji for generous and helpful discussions throughout the ideation and execution of this research. We would like to thank all of our anonymous reviewers for very thoughtful and helpful advice throughout the review process (including previous review cycles during which earlier versions of this paper was rejected). We would like to thank Ruihong Huang for generous discussions before this project started, and for sharing datasets with us. We would like to thank all the Bloomberg interns of the class of 2020, as well as students in Nanyun Peng and Jon May's labs for comments and feedback during public discussions of this work. Finally, the first author would like to thank Bloomberg for a generous 3 year fellowship, which made this research possible.

Ethics Statement
The source material for pre-existing annotated corpora and VD3, the annotation we provide, is either (1) archived by LDC, which, through an arrangement with the original publisher has licensed the data for use by members, (2) is granted a CCby-attribution license and is released by Google Datasets, or (3) is collected by archive.org and is licensed freely for academic purposes. We only release annotations on the data, not the data itself. Annotation was done by the authors of this paper, who have been compensated for their work as part of their research roles. All the datasets are in the domain of news and news discourse and in formal news English; we would expect degradation of performance from that presented in this work were the models to be evaluated on other domains (i.e. non-news English or any domain of non-English) though the degree of the degradation has not been measured, as this work is chiefly concerned with English language news discourse.

A Appendices Overview
The appendices convey two broad areas of analysis: (1) Additional explanatory information for our multitask setup and (2) Negative Experiments and Results. Appendix B contains more information on the datasets used, including labelsets for previously published work, the schema and annotation guidelines for the novel dataset we introduce, and processing information for RST and PDTB. Appendix C and D contain explanatory analysis. Appendix C shows that our multitask setup is reducing confusion between several important pairs of tags, giving further information and discussion beyond Figure 5 in the main body. Appendix D shows, for each tag, which α-weighting across tasks yields the highest score.
Appendix F provides more information about the negative results we obtained throughout our research and the explorations we performed, including details and mathematical definitions characterizing the additional experiments we ran. We believe that it is important to publish about negative results, to help fight against publication bias (Easterbrook et al., 1991) and to help other researchers considering similar techniques. Where possible, we conducted explorations to understand why such results were negative, and what hyperparameters might be tuned to produce a positive results.

B Additional Information on Multi-task Datasets
We summarize the tag-set in each of the datasets we used in Table 6. For all previously published datasets, the tag schema can be found in reference datasets.

B.1 Schema Definition Introduced in VD3
We provide additional information into VD3, the novel dataset we provide in this work. Tagging was done by the first author, who has worked at The New York Times, a major newspaper, for 4 years. We consider him an expert annotator, and as mentioned in Section 3, he checked his process by relabeling 10 articles from VD1, finding an inteannotator agreement of κ = .69. The schema used for VD3 was based off the schema introduced by Van Dijk (2013). As such, the classification guidelines were: Lede: A hook to engage the reader in the main event: can be an anecdote, question or observation.

Main Event:
The major subject of the news report. It can be the most recent event that gave rise to the news report, or, in the case of an analytical news report, it can be a general phenomenon, a projected event, or a subject.
Consequence: An event or phenomenon that is caused by the main event or that directly succeeds the main event.
Previous Event: A specific event that occurred shortly before the main event. It either directly caused the main event, or provides context and understanding for the main event.
Circumstances: The general context or worldstate immediately preceding the main event. Similar to Previous Event, but not necessarily tied to a specific event.
Secondary Event: An event occurring in parallel to the main event, also succeeding and/or being caused by previous events or circumstances, usually used discursively to illustrate a trend. For example, "lax oversight" (circumstance) might be the cause of "major oil spill #1" (main event), and also "minor oil spills #2, #3 and #4" (secondary events).
Historical Event: An event occurring more than 2 weeks prior to the main event. Might still impact or cause the main event, but is more distal.
Expectation: An analytical insight into future consequences or projections made by the journalist.
Evaluation: A summary, opinion or comment made by the journalist on any of the other discourse components.
Explanation: A comment or opinion made by the journalist or source seeking to either establish a causal relation or justify in some other manner why events are occurring.
Verbal Reaction: A comment made by a source in a news article that does not necessarily serve another discursive purpose. Note: VD2 discards this category and includes another dimension (y i,2 = "Speech" or "Not Speech") on each sentence to capture this tag.

B.2 Additional Information on PDTB and RST Processing
Both PDTB and RST are relational discourse datasets, which provide span-level annotations and relational links between spans. We process each as shown in Figure 6 to better fit these datasets into our multitask framework. We process each so that relational labels are mapped onto sentences if a span Figure 6: We processed the Penn Discourse Treebank and Rhetorical Structure Theory datasets, which are both hierarchical and relation-focused, to be sentencelevel annotation tags.
exists within that sentence that is originally part of that relation. As shown in the figure, this holds for intersentential and intrasentential relations, and it results in a multilabel schema. In Table 7, we show the heuristic mapping scheme that we developed to reduce the dimensionality of the RST dataset.

C Confusion Matrices
We identify two main classes of error, Temporal and Event-based error, from the confusion matrix shown in Figure 7a.
In the first case, temporal error, we observe confusion based on the temporal relation of events in discourse elements. For example, Previous Events, Historical Events and Current Contexts happen before the Main Event, while Consequences and Expectations happen after. The confusion between Previous Event and Consequence is one example of a temporal confusion, as is the confusion between Expectation and Previous Event. To address this confusion, we introduced a filtered down PDTB to include temporal relations. As can be seen in Table 4, PDTB-t is positively correlated with Consequence, and as shown in Table 5, PDTB-t contributes to temporal tags, like Previous Event and Expectation.
In the second case, Event-based error, we observe confusion between discourse elements with similar meaning except for the present or absence of an event. For example, Current Context and Previous Event contextualize a Main Event, but Previous Event contains the literal description of an event, while Current Context does not. A similar confusion can be seen between Anecdotal Event and Evaluation. We hypothesized that adding an KBP, a dataset specifically focused on identifying events, would reduce this error type, however, that was not observed. Further work tuning the size of the event dataset, or further tuning α, might yield more favorable results.    Overall, the addition of the multitask datasets decreased confusion in these two main error classes, as shown in Figure 7b.

D Interrogating Multitask Dataset Contributions
In the main body of the paper, we interpreted the effects of the multitask setup by examining the overall increase in performance (Figure 2), the regressive effects of each dataset (Table 3) and the correlations between tag-predictions (Tables 4, 5). Another way to examine the contributions of each task is to analyze which combination of datasets results in the highest F1-score for each tag.
In Table 8, we show the α-weighting that results in the optimal F1-score for each tag. This gives us a sense of which datasets are important for that tag and how much of an improvement they give over  Elaboration-additional, Elaboration-general-specific, Elaboration-set-member, Example, Definition, Elaboration-object-attribute, Elaboration-part-whole, Elaboration-process-step Temporal Temporal-before, Temporal-after, Temporal-same-time, Sequence, Inverted-sequence Enablement Purpose, Enablement Topic Change Topic-shift, Topic-drift Table 7: The mapping we developed to reduce dimensionality of the RST Treebank. The left column shows the tag-class which we ended up using for classification and the right column shows the RST tags that we mapped to that category. We determined this tag-mapping heuristically.
the baseline MT-Micro.
For instance, a strong .3 weight for PDTB-t increases the performance for the Expectation tag and a strong .27 weight for RST increases the performance of the Historical Event tag. This is possibly because both the Expectation tag and the Historical Event tag describes events either far in the future or far in the past relative to the Main Event, and both PDTB-t and RST contain information about temporal relations.
Interestingly, and perhaps conversely, a strong α-weighting for the ARG dataset (> .25) increases performance for Main Event, Previous Event, and Current Context. This set of tags might seem counterintuitive, since they are all dealing with factual statements and events, and by definition contain less commentary and opinion than tags like Expectation and Evaluation. However, if we crossreference Table 8 with Table 4, we see strong positive correlations between these tags and ARG tags like Common Ground, Statistics and Anecdote 24 E Additional Explanatory Results for Single Task Experiments

E.1 Embedding Augmentations
We experiment with concatenating different embeddings to our sentence-level embeddings. These help us incorporate information on documenttopic and sentence-position: headline embeddings (H i ) generated via the same method as sentenceembeddings; sentence-level positional embeddings (vanilla (P i,j ) and sinusoidal (P (s) i,j ) (Vaswani et al., 2017)); document embeddings (D i ), and document arithmetic (A i,j ). 25 24 We were surprised that ARG's Anecdote tag does not correlate with VD2's Anecdotal Event tag, but perhaps the definitions are different enough that, despite the semantic similarity between the labels, they are in fact capturing different phenomena. 25 Di and Ai,j are generated for sentence j of document i by using self-attention on input sentence-embeddings to   Table 9: Sample of embedding augmentation combinations. Micro F1-score increase gained by adding the embedding augmentation above +Frozen. P Headline embeddings are generated for documents with a headline via the same method as sentence-embeddings, and treated as sentence 0. Vanilla positional embeddings and sinusoidal embeddings are as described in (Vaswani et al., 2017), but on the sentence-level rather than the word level. Table 9 shows the results of these embedding augmentation experiments. As can be seen, these embeddings interact to increase accuracy: while no embedding along increases the accuracy, combinations of different additional embeddings have a higher increase in F1 improvement. Such augmentations, as we and others have demonstrated, are very important for document-level tasks such as discourse analysis, likely because they increase the amount of document-level information that is available (Choubey et al., 2020;Li et al., 2021). generate a document-level embedding, and performing the following arithmetic: Di = Self-Att({Si,j} N i j=1 ), and Ai,j = Di * Si,j ⊕ Di − Si,j, as described in Choubey et al. (2020). Si,j is the sentence-embedding for sentence j of document i, and self-attention is defined by Cheng et al. (2016). Figure 8: Here we show a sample of the different layerwise freezing that we performed. "Emb." block is the embedding lookup table for word-pieces. "Encoder" blocks closer to the input are visualized on the left, and blocks to the right are closer to the output. The red bar indicates unfrozen RoBERTa.

E.2 Layer-Wise Freezing
As explored by Lee et al. (2019), layerwisefreezing for BERT-based architectures can have a dramatic effect on the training accuracy. This is especially true when the datasets are small. We experimented with freezing different layers of our RoBERTa architecture. As shown in Figure 8, we observed a 1.5 F1-score boost from freezing all but the top-two layers. We found that freezing combinations of higher-level layers yielded similar boosts, while freezing combinations of lower-level layers was detrimental. As suggested by Lee et al. (2019), this is likely due to the higher-level semantic information contained in the higher-level layers. This finding is especially relevant in a discourse task, where the labels convey abstract semantic information.

F Additional Negative Results
In this section, we describe additional negative experiments. We hope that by sharing our exploration in this Appendix, we might inspire researchers working with similar tasks to consider these methods, or advancements of them. Table 11 shows the results of the experiments described in this section.

F.1 Sentence Embedding Variations
There are, as of this writing, three different techniques that use BERT-based word embeddings to perform sentence-embeddings in the literature (i.e. they go beyond simply using BERT's [CLS] token): Sentence-BERT, Sentence Weighted-BERT (Reimers and Gurevych, 2019), and SBERT-WK (Wang and Kuo, 2020). Sentence-BERT trains a Siamese network to directly update the [CLS] token. Sentence Weighted BERT learns a weight function for the word embeddings in a sentence. SBERT-WK proposes heuristics for combining the word embeddings to generate a sentenceembedding.
None of the sentence-embedding variations yielded any improvement above the RoBERTa <s> token. It's possible that these models, which were designed and trained for NLI tasks, do not generalize well to discourse tasks. Additionally, we test two baselines: using the CLS token from BERT-base embeddings and generating sentence-embeddings using self-attention on Elmo word-embeddings, as described in (Choubey et al., 2020). These baselines show no improvement above RoBERTa. We see a need for a general pretrained sentence embedding model that can transfer well across tasks. We envision a sort of maskedsentence model, instead of a masked-word model. Such a model would extend next sentence prediction (Devlin et al., 2019); instead of simply predicting the next sentence based on the previous embedding, we would predict arbitrarily masked sentences from a sequence of sentences, thus giving greater contextualization. We leave this direction to future research.

F.2.1 Classification Task Variations
For variations on the classification task, we consider using a Conditional Random Field layer instead of a simple FF layer, which has been shown to improve results (Li et al., 2021). However, we do not see an improvement in this case, possibly be-cause the Bi-LSTM layer prior to classification already induces sequential information to be shared.
We also experiment with a hierarchical classification approach. Inspired by Silva-Palacios et al. (2017), we construct K clusters, c 0 , ..., c k , of semantically-related labels in labelset Y such that each class falls into one cluster of size N c 0 , ...N c k . 26 We construct variables from each class-label where L = N c 0 + ... + N c k−1 is the original number of labels. We try modeling these variables two ways.
(1) As a 2-level hierarchy, where the top-level,ŷ i (c) , is one task and each sublayer, y i (c 0 ) ...ŷ i (c k ) , is a separate task or (2) as a multilabel classification task ofŷ . Our hierarchical classification shows no improvement above vanilla multiclass classification. It's possible that the transformer architecture is already learning the label hierarchy implicitly, and the information we try to pass in by structuring the output space does not improve the prediction.

F.2.2 Loss Variations
Across a dataset, Binary Dice Loss can be expressed as DL(X) = 1 − A multiclass Dice Loss for k classes can be derived either through macro-averaging, microaveraging, or a squared sum, GDL(X) = k j=1 1 N 2 j * DL(p j , y j ), introduced by (Sudre et al., Table 10, Dice Loss (DL) and Self-Adjusting Dice Loss (SDL) fail to improve above Cross-Entropy Loss. The top-scoring loss was the Vanilla DL formulation, with Sudre et al.
The addition in ADL of the term (1−p i,1 ) downweights tags that the model is more confident about. This idea has a similar aim as TSA (Xie et al., 2020), which excludes high-confidence predictions. The model becomes more confident as it is trained further; however, under ADL, it thus gets downweighted more. It's possible that with a TSA-like decay schedule, ADL would not underperform.

F.3 Multitask Head Freezing
Additionally, we experiment with freezing auxiliary heads (heads for tasks that are not VD2) in order to propagate more of the gradient into the shared layers. Note, according to Figure 1, that this is only the FF layer, which is not a major architectural change. We find that this yields no improvement.

G Unsupervised Data Augmentation: Analysis
Semi-supervised learning approaches can often achieve high accuracy with a only a small labeled dataset (Van Engelen and Hoos, 2020 Figure 10: The effect of increasing the unsupervised/supervised dataset ratio p, on class predictions. As the p increases, the number of underrepresented classes predicted approaches the true number, Y true .

G.1 Dataset size exploration
In their original paper, (Xie et al., 2020) do not give insight into how many unlabeled datapoints researchers should use in their semisupervised setups. Here we explore that by varying the size of our semi-supervised dataset in two dimensions: (a) the size of the unlabeled set relative to the labeled set, p and (b) the number of augmentations, k per unlabeled datum. We show our results in Figure 9. As shown in Figure 9a, we reach a plateau between p = 6-10. We do not observe a plateau for the number of augmentations per datum (Figure 9b).
We hypothesize that the effect of increasing p is to help the model better predict underrepresented classes. As shown in Figure 10, as p increases, UDA is much better able to generalize the data manifold, and not to overpredict overrepresented classes. We did not explore, however, the effects of varying p for different classes; there are still many underepresented classes where the optimal p is higher than 10.
We hypothesize that adding more augmentations per unlabeled datapoint is helpful in training because more augmentations might help the model more robustly explore the region of space around each unlabeled datapoint, thus mapping that region better. It's also possible that with more augmentations, say, k = 10, 20, 30, we would have enough signal to propagate to even more unlabeled data. We leave this question to future work.

G.2 Learning-techniques
Minimizing consistency loss, as a specific approach to semi-supervised learning, has been explored prior to the proposal of UDA, most notably with the Mean Teacher method (Tarvainen and Valpola, 2017) and the Π method (Laine and Aila, 2017), and UDA mirrors such methods in the core opti- mization setup. However, simply minimizing consistency loss along with supervised loss fails to converge to a global optimum.
To address this problem, Xie et al. (2020) introduce curriculum learning techniques, including: (1) Training Signal Annealing (TSA), (2) softmax temperature sharpening, and (3) confidence-based thresholding. Authors do not show how parameters to these effect the training output, so we produce an analysis here. Based on our analysis, we find that their most important tool for our task is TSA.

G.2.1 TSA
TSA is defined as: In other words, training examples are only considered if the model is confidet in them, p θ (y * |x), is above a certain value, η t . η t is increased throughout training; it is set to η t = α t * (1 − 1 K ) + 1 K , where K is the number of classes (y ∈ {1, ..., K}) and α t increases either with a linear ( t T ), log (1−exp(− t T * 5)) or exponential (exp(( t T −1) * 5)) schedule, where T is the number of training iterations.
We show the results of using a linear-decay schedule, an exponential-decay schedule and a logarithmic-decay schedule in Figure 11. As can be seen, both linear-decay and exponential decay achieve the same optimum, but the linear schedule arrives faster; the log schedule achieves the same optimum as UDA without TSA.

G.2.2 UDA Coefficient, ζ
We show the effects of other UDA hyperparameter tuning in Figure 12  Overall, parameters either do not affect the performance (like r) or hurt performance. For example, when ζ is 0, then UDA is a supervised task; when β is 1, SCL is just CL.
in Figure 12a, simply weights the consistency loss contribution to the overall loss: L U DA = L CE + αL con where L CE is cross-entropy loss and k is the number of data augmentations. A lower ζ, which results in a higher-performing model, corresponds to less contribution by consistency loss.

G.2.3 Softmax Temperature, τ
The next two parameters, softmax temperature, τ and confidence threshold, r are designed to increase the weight of the unlabeled dataset. The softmax temperature sharpens the predictions on original unlabeled datapoint (or in one implementation, the augmented datapoint 27 ) through the following operation: 27 https://github.com/SanghunYun/UDA_ pytorch/blob/master/main.py#L113 p (sharp) θ (y|x) = exp(z y /τ ) y exp(z y /τ ) where z y is the logit output of the neural network. So, a lower temperature increases the values in each exponent, and sharpens the probability distribution over the classes, resulting in a higher consistency loss. According to Figure 12b, the performance increases as τ increases, peaking at .8.

G.2.4 Confidence Threshold, r
The confidence threshold, r, masks out predictions on unlabeled data that the model is not confident about.
L U DA = L CE + I(max y p θ (y |x) > r)L con (It is important to note that TSA does exactly the same thing but in reverse, but TSA is on the supervised data while the confidence threshold is on the unlabeled data.) There is essentially no pattern observed between changing r and the model performance, according to Figure 12c.

G.2.5 Suppressed Consistency Loss (SCL)
We try a simple alteration to semi-supervised learning with consistency loss (UDA) called suppressed consistency loss, (SCL). SCL was suggested by Hyun et al. (2020) to reduce the impact of consistency training on lower-represented classes, where, authors claim, the manifold of the latent space is underlearned and semi-supervised learning can be harmful.. L SCL (X i ) = g(N c ) * L con (X i ) where c = argmax(f θ (X i )) and g(z) is a function inversely proportional to z: g(z) = β 1− z Nmax (with β ∈ (0, 1]). N c is the number of training samples in the class predicted by the model and N m ax is the number of samples of the class with the most frequency.
The higher β is the more class imbalance is used to downweight consistency loss. As can be seen, performance roughly increases as β approaches 0, indicating that suppressed consistency loss is not helpful.