EDTC: A Corpus for Discourse-Level Topic Chain Parsing

Discourse analysis has long been known to be fundamental in natural language processing. In this research, we present our insight on discourse-level topic chain (DTC) parsing which aims at discovering new topics and in-vestigating how these topics evolve over time within an article. To address the lack of data, we contribute a new discourse corpus with DTC-style dependency graphs annotated upon news articles. In particular, we ensure the high reliability of the corpus by utilizing a two-step annotation strategy to build the data and ﬁltering out the annotations with low conﬁdence scores. Based on the annotated corpus, we introduce a simple yet robust system for automatic discourse-level topic chain parsing.


Introduction
Topic information as a crucial auxiliary for text understanding has drawn great attention in recent decades (Wu et al., 2019;Wang et al., 2020;Sahlgren, 2020). In the literature, previous studies on topic modeling usually extract topics by introducing latent variables for tokens for topic assigning (Hofmann, 1999;Blei et al., 2003;Yishu et al., 2017). Similarly, researches on text-tilling achieve topic segments through lexical cohesion modeling (Hearst, 1997;Purver et al., 2006). Instead of lexical cohesion measuring, Rahimi et al. (2015) put their attention on evaluating the organization and cohesion of pieces of evidence and build topic chains on related text units. Besides, recent studies on argument mining explore to build links or clusters for topic-dependent arguments (Wachsmuth et al., 2018;Shnarch et al., 2018;Reimers et al., 2019). Obviously, more and more researches show that there are certain structures among topic segments that deserve deeper exploration.
In this work, we aim to explore the cohesion of topic-related text segments. Different from Rahimi * Corresponding author et al. (2015), we show great interest in uncovering how fine-grain topics emerge, evolve, and disappear in an article, which is referred to as discourselevel topic chain (DTC) parsing. Since the DTC structure can provide relatively rich and low-noise information about certain topic aspects of articles, it is meaningful for various NLP tasks like summarization (Perez-Beltrachini et al., 2019), document similarity measuring (Gong et al., 2018), and response generation (Dziri et al., 2019).
In the literature, topic detection and tracking (TDT) (Allan, 2002) is a research area most similar to DTC parsing which aims at identifying new events and tracking how they change over time. However, the events in the TDT task refer to happenings at certain places and times which only compose a small subset of general topics. Recently, Xi and Zhou (2017) manually annotate the first Chinese DTC corpus based on the theme-rheme theory (Halliday and Matthiessen, 2004). By contrast, due to the lack of corpus, previous study on English DTC parsing usually uses unsupervised methods (Kim and Oh, 2011) to explore the structure and trends of important topics hidden within news articles. Obviously, one intractable problem facing DTC parsing is the lack of data.
This research is primarily motivated by (Polanyi and Scha, 1984;Kim and Oh, 2011) on the topic chain concept, (Xi and Zhou, 2017) on DTC corpus construction, and (Reimers et al., 2019) on topic-dependent argument linking. And our contributions mainly include two aspects: (i) building an English corpus of discourse-level topic chain (EDTC) through a two-step annotation method and (ii) lunching a simple but robust Bert-based baseline system for automatic DTC parsing. Moreover, as implied in recent researches on discourse rhetorical structure (DRS) parsing (Zhang et al., 2020;Kobayashi et al., 2021;Zhang et al., 2021), discourse parsing remains challenging due to the lack of data. Under this circumstance, we annotate the

Corpus Annotation
Before detailing the annotation process, we give a formal introduction to the "topic" mentioned in this paper. In topic modeling, a topic is usually viewed as a probability distribution over a fixed vocabulary (Liu et al., 2016). In addition, previous studies on argument mining usually manually define some coarse-grain topic categories for either topic-dependent argument classification or clustering (Reimers et al., 2019). Different from previous work, topics in this study refer to fine-grained topic categories that fit the context. For example, given the sentence "House prices are expected to be fragile.", the coarse-grained topic label of it could be "economics" and the fine-grained label is "house price". Comparing the two kinds of labels, the first one seems more like the theme of an article which is useful in text-clustering or text-tilling, and the second one gives us more detailed description on the topic itself which is more practical in discourselevel topic chaining. For better understanding of our annotation, we present some preliminary definitions as following: Discourse Topic Unit (DTU) refers to the elementary topic unit in our annotated DTC structure. In the literature, Xi and Zhou (2017) hold the view that each sentence is composed of multiple DTUs with different sub-topics which they refer to as elementary discourse topic unit (EDTU). Different from them, we study macro DTC structures in this work where each sentence is taken as an independent DTU 1 . It is worth mentioning that not all the 1 Although we built the corpus based on RST-DT, it remains DTUs are topic-bearing, there are also some sentences with no topic meaning, e.g., the sentence "Oops!". Topic Object (TO) could be subject, object, or other noun or noun phrase in the DTU which can provide a certain basis for topic chain parsing. Usually, each TO is closely related to the topic of its DTU, and each DTU maintains an independent TO set. Notably, the "TO" mentioned here is not directly equivalent to the "entity" in co-reference resolution, the judgment of TO requires a comprehensive consideration of document context. For example, given the DTU "Drexel Burnham Lambert Inc. is the adviser on the transaction.", if the surrounding context of the DTU is mainly about the company, then we choose "Drexel Burnham Lambert Inc." as a TO; if the context is mainly about the transaction, then we choose "transaction" as a TO, and we can also select both of them if necessary. It is worth mentioning that the TOs could also be implicit ones which require human judgments. Topic Event (TE) refers to the main phrase which most clearly expresses an event occurrence or a description of the TOs in the DTU. For the DTU u4 in Figure 1, we select "develop vaccines against the virus" as the topic event of the DTU. With the above-mentioned definitions in mind, we argue that each DTU is composed of a set of TOs and a core TE. Based on this concept, we give the following four annotation suggestions: • Given two adjacent DTUs in a topic chain, their TO sets should have an intersection in the topic space. For the two DTUs u3 and u4 in Figure 1, although the two corresponding TO sets, {WHO, risky to directly take each elementary discourse unit (EDU) as a DTU since there are many competing hypotheses about what constitutes an EDU but without "topic" (Carlson et al., 2001). Previous work on topic-dependent argument mining usually take each independent sentence as an elementary unit, and this work is inspired by these researches.
global expert networks} and {researchers in various countries, vaccines}, have no vocabulary intersection, they are highly related in the topic space on "international response". In a sense, the relationship between TO sets is similar to that between mentions in co-reference resolution or tokens in lexical chains. The difference is that DTC parsing requires not only the correlation between TO sets but also the topic transitivity between DTUs. Therefore, for any two adjacent DTUs on a topic chain, the TE in the second DTU should evolve from the TEs in the established chain where the first DTU is located.
• Sometimes, a DTU may have topic relevance to multiple subsequent DTUs, we only opt for the closest and most relevant one for annotation. To achieve this, we follow two principles to build each arc in a topic chain: (1) For each DTU, we search its topic-related DTU from near to far; (2) We label topic links for DTUs in order and the annotated DTC structure is dynamically optimized during the human annotation process. For example, when comparing the current DTU (U-j) with previous ones, we directly replace the previously annotated arc (U-i, U-k) with (U-i, U-j) if the topic relevancy between U-i and U-j obviously surpasses that between U-i and U-k. In other words, we do not require all topic chains to be labeled, but we try to ensure the accuracy of the annotated chains as much as possible. This labeling strategy can enhance the value of this small-scale corpus to some extent.
• In news articles, many DTUs are organized in an overview-example format where similarities among the examples do exist but the evolution of topics is unseen. In this study, we do not consider simple juxtapositions like this. Taking wsj_2349 for example, "u1: The following issues were recently filed with the Securities and Exchange Commission: u2: American Cyanamid Co., offering of 1,250,000 common shares, via Merrill Lynch Capital Markets. u3: Limited Inc., offering of up to $300 million of debt securities. ... u8: Trans World Airlines Inc., offering of ...". There is a certain textual structure in between the DTUs from u2 to u8 (e.g., they share the multinuclear relation List in the RST theory (Mann and Thompson, 1988)), but the topic transitivity is weak. Therefore, we do not mark any topic chains among the DTUs.
• Due to the principle of saving words and avoiding repetitions, ellipsis and co-reference occur frequently. Under this condition, we need to manually fill in the ellipsis and clarify the co-reference for better annotation.
Here we take the example in Figure 1 to illustrate the annotation process. Simply put, the annotation process is also the process of comparing the TO and TEs of the current DTU with that of the previous ones. According to the annotation instructions, we do the comparison from near to far aiming to obtain the closest path for two adjacent DTUs on the chain. For the DTU u1, its TO set contains two topic objects, i.e., "coronavirus" and "COVID-19", and its core topic event can be sketched as "coronavirus outbreak in Wuhan". Correspondingly, the TO set of u2 contains a pronoun object "it", referring to "coronavirus", and its core TE is manually detected as "there is still no idea how to beat it". Obviously, the two TO sets have an intersection (i.e., "coronavirus") and the TE in u2 does evolve from that in u1. Consequently, we mark a topic link between the two DTUs. For u3, both the TO set and TE do not meet our annotation requirements, so we neither link it to u1 nor u2. For u4, the TO set is relevant to that of u3 as international institutions and the two TEs are also interrelated, we therefore build a link between them. In this way, the overall vein of topic chains will be built after several rounds of comparison. Notably, from the resulting graph we find that the topic chain with u1, u2, u5, and u-k on it does provide rich and low-noise information about the evolution of COVID-19, which reflects the practical value of our annotated DTCs.

Subjective Differences in Manual Annotation.
A Chinese saying about Shakespeare is that "There are a thousand Hamlets in a thousand people's eyes". From the above annotation process we find that one intractable problem of DTC annotation is the high subjective differences between annotators. More precisely, judging whether the temporary TE evolves from the previous one is really a very subjective problem, and it is hard to make a strict regulation for the annotators. In this case, we tackle the issue from two aspects: (i) using a well pretrained topic model to assist manual annotation in a two-step fashion and (ii) calculating the confidence scores of the annotations for data filtrating.
Two-Step Annotation: The two-step method consists of two phases: first automatically building  topic links between topic-related DTUs 2 and then manually refining the automatic annotations for DTC structures. As depicted in Figure 2, each DTU is preceded by an index pair (i, j) according to which u-i and u-j are connected through a topic link. And u-i is an ending unit when j equals -1. The solid arcs in the example refer to the topic links generated in the first stage. On this basis, we bring in an auxiliary marker to refine the chain structures where "×" means that the initial topic arcs (either machine-labeled or manually labeled links) are unreasonable and should be deleted directly, and "=" means that the original arcs should be replaced with more proper topic links predicted by the human annotators, e.g., the dashed arcs in the example. In this way, we can dynamically optimize the DTC structures during the human annotation process thus determining the most relevant DTUs for annotation. Our statistics show that around 37.4% of the automatic annotations are retained in the corpus and 62.6% of them are invalid and re-annotated by our annotators. According to this, although there is a great dissimilarity between automatic and manually annotated structures, the topic links of the pre-trained model do provide a good 2 Recently, Reimers et al. (2019) use superior contextualized language models for argument linking, which has proven to have great capabilities in aggregating arguments for unseen topics (https://github.com/UKPLab). To improve the reliability of the initial chains, we only keep the topic links with topic similarity higher than 0.9 in the first stage.  reference for better annotation consistency. Annotation Confidence: As stated before, considering the problem of subjective difference, it's really challenging to build a topic link between two DTUs because we're not sure if they're the most relevant. Although it is hard to strictly regulate the annotators' subjectivities, it is feasible to calculate the reliability of each annotation item. Therefore, we aim to ensure the quality of the corpus by filtering out the annotations with low confidence scores. Specifically, given the annotation results of the pre-trained topic model, (τ, ι), and that of three annotators, (τ, ν), (τ, ι), and (τ, ν), on the DTU τ , we set the confidence of the pre-trained topic model to 0.5 and that of human annotators to 1, then the confidence score of each annotation on τ can be calculated as: (τ, ι) → (0.5 + 1)/3.5, (τ, ν) → 2/3.5. Based on the results, the annotation (τ, ν) with the highest confidence score of 0.57 is determined as the result. Following this way, we can greatly alleviate the "subjectivity" problem by retaining annotations with high confidence. According to our statistics, the averaged confidence score of each DTU annotation is around 0.73. Data Details. The annotated corpus contains 385 news articles (7962 DTUs) from RST-DT (Carlson and Marcu, 2001). We annotate 4122 topic links corresponding to 1757 topic chains in the corpus, and the chain length distribution is presented in Table 1. Obviously, the distribution of chain langths is uneven and most chains have less than 5 topic arcs. For supervised learning, we have divided the dataset into three parts (the test corpus is consist with that of RST-DT), as shown in Table 2. Based on the test corpus, we calculate the annotation consistency with an averaged Cohen's kappa value of 0.72. Concretely, we compare three groups of manual annotations on DTUs with each other for kappa value calculation and report the average score. The data and codes are published at https://github. com/NLP-Discourse-SoochowU/DTCP.  Following previous work, we also fine-tuned the pre-trained language model parameters during the training process. For the convenience of calculation, a zero-initialized vector u z is added at the end of the DTU sequence for the tail DTUs of the topic chains or the isolated DTUs to point to, obtaining U = (u 1 , . . . , u k−1 , u z ). For dependency parsing, we simply build a bi-linear function between U and its duplicate to achieve it, as following:

Baseline
where U α and U β are (D×k) matrices representing U and its duplicate, W ∈ R D×D denotes the parameters of the bilinear term, and s ∈ R k×k refers to the scores for each DTU upon its candidate successor DTUs. The detailed system configuration is presented in the Appendix. We measure the micro-averaged F1 scores of both topic links and chains for performance, and we do not take those isolated DTUs into consideration to avoid the overestimation of performance. For human performance, we asked 5 other researchers majoring in human language analysis to manually annotate the test set and took the averaged F1 scores as human performance. Experimental results in Table 3 show that fine-tuning the contextualized Bert model can achieve a great performance close to human level. By observing the model outputs (sampled in Appendix), we find that the automatically parsed chain structures are highly consistent with the manual annotations, which indicates the 3 The pre-trained models are borrowed from https:// huggingface.co/transformers.

Method
Link  high reliability of our corpus. Notably, the obtained system has good generalization and robustness, and can be easily migrated to other NLP tasks for DTC structure incorporation.

Conclusion
In this research, we explored how fine-grain topics emerge, evolve, and disappear within an article. To address the lack of data, we built an English DTC corpus through a two-step annotation method, and filtered out the annotations with low confidence scores to ensure the high reliability of the corpus. During annotation, we found that each annotated topic chain does provide relatively low-noise information about a certain aspect of the article and the complete DTC structure can well describe the overall vein of topics in an article. With this in mind, we introduced a simple and robust baseline system, and the parsing model we trained can be straightforwardly harnessed in downstream topic-sensitive NLP tasks to boost performance. It is worth mentioning that we annotated the WSJ articles in the RST-DT corpus also aim to allow the discourse researchers to explore the potential correlation between RST-and DTC-style discourse analysis in future work. Inc. said it downgraded its rating to B-2 from Ba-3 on less than $20 million of this thrift's senior subordinated notes.
[u7] The rating concern said Franklin's "troubled diversification record in the securities business" was one reason for the downgrade, citing the troubles at its L.F. Rothschild subsidiary and the possible sale of other subsidiaries. "They perhaps had concern that we were getting out of all these," said Franklin President Duane H.
Hall. "I think it was a little premature on their part." wsj_2375 u1 u2 u3 u4 u5 u6 u7 [u7] MedChem said the court's ruling was issued as part of a "firstphase trial" in the patent-infringement proceedings and concerns only one of its defenses in the case.
[u8] It said it is considering "all of its options in light of the decision, including a possible appeal." The medical-products company added that it plans to "assert its other defenses" against Pharmacia's lawsuit, including the claim that it hasn't infringed on Pharmacia's patent.
[u9] MedChem said that the court scheduled a conference for next Mondayto set a date for proceedings on Pharmacia's motion for a preliminary injunction. wsj_2336 u1 u2 u3 u4 u5 u6 u7 u8 u9 B.4.
[u1] ALBERTA ENERGY Co., Calgary, said it filed a preliminary prospectus for an offering of common shares.
[u2] The natural resources development concern said proceeds will be used to repay long-term debt, which stood at 598 million Canadian dollars (US$510.6 million) at the end of 1988.
[u3] The company plans to raise between C$75 million and C$100 million from the offering, according to a spokeswoman at Richardson Greenshields of Canada Ltd., lead underwriter.
[u8] Allied Capital is a closed-end management investment company that will operate as a business development concern. wsj_0607 u1 u2 u3 u4 u5 u6 u7 u8