End-to-End Argument Mining over Varying Rhetorical Structures

Rhetorical Structure Theory implies no single discourse interpretation of a text, and the limitations of RST parsers further exacerbate inconsistent parsing of similar structures. Therefore, it is important to take into account that the same argumentative structure can be found in semantically similar texts with varying rhetorical structures. In this work, the differences between paraphrases within the same argument scheme are evaluated from a rhetorical perspective. The study proposes a deep dependency parsing model to assess the connection between rhetorical and argument structures. The model utilizes rhetorical relations; RST structures of paraphrases serve as training data augmentations. The method allows for end-to-end argumentation analysis using a rhetorical tree instead of a word sequence. It is evaluated on the bilingual Microtexts corpus, and the first results on fully-fledged argument parsing for the Russian version of the corpus are reported. The results suggest that argument mining can benefit from multiple variants of discourse structure.


Introduction
The goal of argument mining is to automatically identify the premises, claims, and conclusions in an argument.Another field of NLP aiming to recognize structure in a complex text is discourse parsing.It involves identifying the author's point of view, the central idea, and the relations between discourse units.Rhetorical Structure Theory (Mann and Thompson, 1988) depicts text structure as a tree spanning the entire text, with rhetorical relations connecting adjacent text spans from elementary discourse units (EDUs) to paragraphs.Many efforts (Azar, 1999;Villalba and Saint-Dizier, 2012;Green, 2010;Peldszus and Stede, 2016;Stede et al., 2016;Accuosto and Saggion, 2019) have been devoted to finding correla-tions between the two structure descriptions.The studies examine a single rhetorical parsing result or a single manual annotation for each text.However, the same argumentative structure can be found in semantically similar texts with varying rhetorical structures, especially when retrieved by automatic parsing.This must be taken into account when probing discourse against argumentation.
According to Morey et al. (2017), the human baseline score on the news-domain RST-DT (Carlson et al., 2001) benchmark is 55.0%Parseval F1 for gold segmentation.An analyzer's inevitable mispredictions exacerbate inconsistent parsing of similar structures.Interpreting discourse accurately may require sophisticated skills, such as reasoning over general knowledge and assessing the subjective significance of particular statements.Endto-end discourse tree prediction recently achieved 50.1% F1 on the RST-DT corpus (Liu et al., 2021).Discourse parsing is also significantly affected by domain shift.For an isolated subtask of RSTrelation classification for pairs of adjacent EDUs in news, academic texts, TED talks, Reddit posts, and fiction (annotated in the GUM RST corpus (Zeldes, 2017)), Atwell et al. (2022) report an averaged transfer error of 60%.Liu and Zeldes (2023) demonstrate that unlabeled RST tree construction performance degrades significantly when training on the WSJ-only RST-DT corpus and testing on the multidomain GUM.It degrades by ∼ 11 points on average for spans only and by ∼ 16 with nuclearities attached.These points are halved when testing only on the Wikinews-sourced news part of GUM.
We argue that the analysis of the correlations between RST and argumentation is biased by the use of a single rhetorical annotation.These correlations can therefore be better assessed by using multiple rhetorical annotations of the same argumentative structures.In this work, we propose a simple neural model, Discourse-driven Biaffine Parser (DBAP), to estimate the utility of labeled rhetorical struc-ture for argument mining on short argumentative texts.We use the Argumentative Microtexts corpus proposed by Peldszus and Stede (2015a).In this corpus, an argumentative text is seen as a hypothetical dialectical exchange between the author, who introduces and defends their claim, and their opponent.The argumentation can be represented by a graph with nodes corresponding to propositions expressed in textual segments, and edges indicating various supporting and attacking moves.We obtain two RST structures for each document by back translating over the parallel corpus of argument annotations.Then, we use the predicted RST structures in biaffine dependency parsing to estimate the general effect of rhetorical features.
To the best of our knowledge, this is the first endto-end argument parser trained on a small corpus of Argumentative Microtexts and the first application using multiple versions of rhetorical structures to explore the relationship between discourse and argumentation.We also report the first results on fully-fledged argumentation mining for the Russian version of the corpus.

Background and Related Work
A number of studies have examined the relationship between discourse and argumentation in monological texts.Azar (1999) suggests treating the five relations of the original RST as argumentative: MO-TIVATION, ANTITHESIS and CONCESSION, EVI-DENCE, and JUSTIFY.According to the hypothesis, one discourse unit is expected to influence the reader in relation to the other discourse unit.Another investigation of the argumentativeness of rhetorical relations was carried out by Villalba and Saint-Dizier (2012).The regularities in expressing persuasive arguments in the support function through certain cases of rhetorical relations ELAB-ORATION, JUSTIFICATION, RESTATEMENT, and COMPARISON and in the attack function through CONTRAST are demonstrated by a thorough analysis of online textual reviews.Green (2010) combines some RST relations with the argumentative relations of Toulmin (1958) and Walton (2011) for a hybrid -ArgRST -manual annotation in a biomedical corpus of patient letters.In the later paper on the annotation of full-text biomedical research papers, Green (2015) concludes that in a text of arbitrary genre, argumentation and discourse coherence should be represented separately.A hybrid representation of both schemes can also be achieved by annotating the rhetorical trees with communicative actions (Galitsky et al., 2018) or enriching existing RST-dependency annotations with an argumentative structure layer (Accuosto and Saggion, 2019).
The extended Microtexts corpus presented by Stede et al. (2016) allows for the exploration of correlations between discourse and argumentation.It includes manual RST, PDTB, and Segmented Discourse annotation for 112 texts from the first version of the Microtexts corpus.They found, in particular, that 60% of the argumentation arcs match those in RST; REASON, CAUSE, and EVIDENCE RST relations are all most likely to match the support argumentation function; almost any RST relation can be found within the argumentative discourse unit (ADU).Peldszus and Stede (2016) use the same manual RST annotations to train the argument parser.They construct a structure aligner and train evidence graph model (Peldszus and Stede, 2015b), but using discourse rather than lexical features.Such features include the absolute and relative position of the segment in the text, whether the segment has incoming/outgoing RST edges, the number of edges, and the corresponding relations.For subgraphs of length > 2 also all chains of relations including this segment.The best performance is achieved when considering a subgraph of depth 3. RST parsing is first used to analyze arguments in Microtexts by Hewett et al. (2019).The texts were analyzed with multiple earlier parsers, and the one proposed by Feng and Hirst (2014) was chosen based on the manual evaluation of the results.The features used in the classifiers are the number of DUs of higher and lower levels; the same for the preceding and following DUs; the distance to the parent node; whether the segment is in a multinuclear relation.The proposed features insignificantly improved the argument analysis performance on the gold segmentation.
The earlier work examined an expert or early RST parser annotation of each document, while our work focuses on applying modern rhetorical parsers to explore the discourse variation in the short argumentative texts in English and Russian.

Methods
To analyze Argumentative Microtexts, we follow the classical Evidence Graphs approach of Peldszus and Stede (2015b), where the argumentation graphs are directly converted into dependency trees.
However, unlike the Evidence Graphs method inferring the labeled argumentative structure from the results of complex cooperation between the structure, function, role, and central claim classifiers, our method is based on the direct prediction of dependencies between text spans, where the roles and central claim are derived automatically from the obtained dependencies through simple rules.
Biaffine Argument Parser.Each task we presently address is a dependency tree construction task.The terminal nodes in the tree can be handcrafted ADUs or elementary discourse units predicted by a discourse parser.In the latter case, an additional structural function is introduced to combine several elementary DUs into one argumentative DU.Given a sequence of n discourse units u 1 , u 2 , ...u n , elementary or argumentative, we first encode each discourse unit with CLS-pooling of a pretrained transformer into a vector v i ∈ R d LM : and over the obtained representations run the biaffine dependency parsing model proposed by Dozat and Manning (2016).In our model, the arc labels are argumentative functions, such as "support" or "attack".The central claim is encoded as an extra function, "cc", and it is the only function that is allowed to be assigned to the parentless node (the root).The additional root node, which is a fictional parent of the real tree root, is randomly encoded into vector v 0 .The matrix V ∈ R (n+1)×d LM = [v 0 , v 1 , ..., v n ] is then passed through four feedforward layers to get the parent-wise and dependentwise arcs and functions hidden representations: Those are used to score each possible parent for each dependent with bilinear attention: where U and b (arc) are trainable.
Due to the fact that a statement's role is directly related to its function towards its parent, roles are not predicted in a learnable way.Instead, roles are inferred directly from dependencies.Since the predicted central claim is the proponent's claim by definition, we traverse the predicted function-labeled dependency tree, assigning the role ("pro" = "opp") to the visiting node i with parent j as follows: We examine performance on two main methods.Biaffine Argument Parser (BAP) uses a biaffine dependency parser as described above.Discoursedriven Biaffine Argument Parser (DBAP) additionally takes into account the discourse relations in a rhetorical tree.
Discourse-driven BAP.In order to incorporate rhetorical structures into argument parsing, we enhance the arc scores (3) with the corresponding discourse coefficients C (RST) obtained from the rhetorical tree: S (arc) = S (arc) • C (RST) .The RST constituency trees are converted into RST dependencies.
In the simplest case, discourse coefficients are predicted from the n × n binary adjacency matrix A (RST-adj) , where a (RST-adj) ij = 1 if there is a discourse relation going from discourse unit i to the nucleus2 DU j: where θ and b (RST) are trainable scalar parameters controlling the effect of any discourse relation on the arc scores.
The type of rhetorical relation between the two DUs should also be considered when learning discourse coefficients, since some relations may not reflect the argumentative structure.This is accomplished by encoding the rhetorical label of each possible arc into a scalar value with an additional trainable layer.For this, we represent the labeled rhetorical dependency tree as the n × n × k adjacency matrix A (RST-full) with a (RST-full) ij being one-hot encoded rhetorical relation going from discourse unit i to the nucleus j.The discourse coefficients are then computed as 1 Actually it would be justified if all German universities charged tuition fees. 2 As long as it is ensured that the funds really benefit the universities directly, one can continue to regard this as social justice.3 Those who study later decide this early on, anyway.4 It's always possible to take out a student loan or to earn a scholarship.5 To oblige non-academics to finance others' degrees through taxes, however, is not just.
Ru→En 1 In fact, it would be justified if all German universities charged tuition fees. 2 As long as it is guaranteed that the funds really benefit the universities directly, we can continue to regard it as social justice.3 In any case, the question of further training must be decided in advance.4 You can always take a student loan or get a scholarship.5 However, it is unfair to oblige people who do not belong to scientific circles to pay for someone else's education by collecting additional taxes.where Θ contains the trainable weights of specific rhetorical relations.As an activation function σ we use ReLU to prevent negative coefficients.Finally, it is important to consider that for certain discourse relations the nuclearity-defined RST arc direction may contradict the direction of the argument.For this case, we also examine the inverted rhetorical relations: Apart from penalizing predictions that contradict argumentative rhetorical relations, it also rewards inverting discourse relations that naturally oppose argument (e.g.PREPARATION).

Collecting the Structure Variations
RST annotations are known to have a low intraannotation agreement due to the ambiguity of discourse.Differences in annotation, magnified by the intrinsic limitations of statistical models in language understanding, lead to the unstable behavior of rhetorical analyzers.To identify discourse variations, we use paraphrases of the annotated ADUs.

Semi-automated Back Translation
As pointed out by Da Cunha and Iruskieta (2010), the use of translation strategies has a noticeable impact on rhetorical structures.In order to paraphrase, we use back translation over a parallel corpus of argumentative annotation.
In the Russian-language version of the Argumentative Microtexts, both parts of the original corpus have been manually translated from English into Russian by ADU (Fishcheva and Kotelnikov, 2019).It is a literary translation and often does not correspond to the original in the number of clauses and sentences in ADU.Such paraphrases introduce pronounced differences in rhetorical structure from the original.They also cause an unstable parser to change its prediction most significantly.In order to get different retellings of the same argumentative structures, we additionally obtained the literal Human English → Russian and Human Russian → English machine translation preserving original ADU boundaries.We use the recent multilingual NLLB model nllb-200-distilled-1.3B (Costa-jussà et al., 2022) for both directions, achieving 31.6%BLEU in En → Ru and 29.2% BLEU in Ru → En translation measured against handcrafted argumentative texts.
Table 1 shows an example of the resulting paraphrase for a simple argumentative structure.In Figure 1, it is shown that ADU #2-5 support the central claim (#1) independently.The semi-automated back translation helps to rephrase the individual statements within an argument slightly (ADU #1, #2, #4) or significantly (ADU #3, #5).

Analyzing Paraphrases from a Discourse Perspective
In this study, we employ the recent end-to-end RST parsers for English3 (Zhang et al., 2021) and Russian (Chistova et al., 2020).
First, we assess the diversity of rhetorical structures, guided by the gold ADU segmentation in the corpus.Figure 2 illustrates4 variations of the rhetorical structure for the paraphrases obtained for the example in Table 1, assuming that the leaves of the discourse tree are the annotated ADUs.None of the obtained RST trees matches the expert argument annotation (Figure 1), although in each variant the most nuclear discourse unit in the RST tree (ADU#1) naturally corresponds to the central claim in the argumentative structure.A comparison using Iruskieta et al. (2015)'s method reveals that two variants predicted by the same parser for English (Figures 1a and 1b) have Fleiss' Kappa of 0.06 for nuclearity annotation and -0.04 for constituency annotation.Nuclearity agreement for two variants predicted by the same parser for Russian (Figures 1c and 1d) is 0.6 while constituency annotation agreement is 0.32.The agreement values are obtained with the RST-Tace tool proposed by Wan et al. (2019).The original trees from Figure 2 with EDUs intact are additionally shown in Appendix A, Figure 4; these illustrate how RST structure varies within individual ADUs.
Table 2 shows the pairwise Kappas for each language averaged over the corpus.Rarely, when an ADU does not entirely belong to an isolated RST discourse unit, its label is assigned to several DUs.According to the results, Fleiss' Kappa values yield moderate agreement for unlabeled tree construction (Constituent) and fair agreement for nuclearity and relation assignments.Coherence of nuclearity, the feature directly related to identifying the central idea in the text, is the lowest on average.Constituent Kappa equals 1.0 in 22% of English and 18% of Russian text pairs.The perfectly same rhetorical structure is found in 4% of text pairs in English and 8% in Russian.According to the results, the chosen strategy of paraphrasing helps to collect rhetorical structures with high variability.

Experiments
We collected additional versions of discourse structures and describe our experiments on the original and augmented training data.All the experiments are conducted on the first two of the ten 5-fold cross-validation splits from the experiments of Peldszus and Stede (2015b).Since there was no validation data in the original splitting, we leave 15% of the training data in each fold for validation.The training data is supplemented with the second crowd-sourced part of the corpus introduced by Skeppstedt et al. (2018).Following related work on the dataset (2015b; 2016; 2018), we use the simplified functions set, where "support", "example", and "link" functions are encoded as "support", while "rebut" and "undercut" are encoded as "attack".We leverage spaCy 5 for feature extraction.
All the experiments including pretrained language models were conducted with the Microsoft/mDeBERTa_v3 (He et al., 2021), the multilingual model sufficient for both languages.

Experimental Setup
Each model is trained on an NVIDIA Tesla V100 GPU.On average, it costs 25 seconds per epoch for parsing on gold segmentation and 39 seconds per epoch for end-to-end parsing with 30 to 75 training epochs total (on the original training sets; the augmentation doubles the training data).
The hyperparameters are tuned on the development subset of the corresponding split.Adam optimizer is used with a weight decay of 0.1 and a dropout rate of 0.2; β = (0.9, 0.9).We use a learning rate of 2e-5 for the language model, while the randomly initialized layers have a learning rate of 2e-6.The discourse coefficients are trained with a learning rate of 2e-2.The dimension of the arc rep-5 https://spacy.io/.The models en_core_web_lg and ru_core_news_lg.
resentation is 100 and the dimension of the tag representation is 50.The maximum sequence length is set to 150 tokens and the batch size is 4.

Evaluation
To evaluate the argument tree parsing, in addition to the attachment scores (UAS, LAS), we use the evaluation metrics introduced by Peldszus and Stede (2015b).That is, we additionally report the macroaveraged F1 for central claim detection (cc), role assignment (ro), function tagging (fu), and F1 for positive attachment (at).To determine the statistical significance of pairwise comparisons, we perform paired t-test.

Baseline
We run the baseline MST model introduced by Peldszus and Stede (2015b)6 .It predicts an argumentative dependency tree over the given discourse units from bags of words and bigrams, bags of discourse connectors and their associated relations, POS tags, punctuation, Brown clusters (Brown et al., 1992) for words and bigrams, and occurrence of ADUs in the same sentence.Table 3 shows the baseline results.Due to the lack of discourse connectors vocabulary with annotated discourse relations for Russian, no markers or relations-related features were used in the multilingual experiments (-Cues).The same applies to Brown clusters, which are not available for Russian (-BC).Excluding these features from the original model for English results on average in a 2.5% decrease in F1 for central unit detection and role assignment, 2.9% for function classification, and a 1.6% decrease in LAS.Regardless, excluding them is necessary to standardize experiments with multilingual data.

Does machine translation violate reasoning?
In Table 3  paraphrases (En→Ru, Ru→En).The results on the machine translations are marginally better than on the original handcrafted data, except for identifying the central claim in English data and functions in Russian.The F1 scores for role and function identification, however, do not represent the quality of argumentation tree construction because role and function classes are imbalanced.Attachment scores are higher on paraphrases.It seems that the reason for this is that every translation step simplifies the argumentative markers.We conclude that the collected additional data is nearly as useful as the original.

Segmentation
In end-to-end argument parsing, the elementary discourse units (EDUs) are considered leaves of the argument tree.Whenever an ADU matches a subtree of multiple EDUs, we preserve the discourse relations structure by assigning a "same-arg" argumentative function to every intra-ADU relation (Figure 3).Adding the third function class did not change the architecture of the model.

Results and Discussion
Gold segmentation The results are shown in Table 4.The models with discourse perform significantly better than those without discourse, while the ones without discourse perform better than the baselines (Table 3).Although performance increases with textual paraphrases of the same arguments, adding discourse structure variations over the same segmentation can hinder performance by introducing noise to the training data.
Appendix B presents the interpretation of the DBAP models trained on rhetorical structures.
Joint Segmentation and Parsing Table 5 shows the end-to-end parsing performance on the same test data when considering EDUs as terminal nodes.In order to compare the BAP and DBAP models consistently, the "same-arg" function is excluded from evaluation.While comparing BAP with DBAP in this setting is still not entirely fair, the BAP results can be viewed as a non-structural baseline.When the second discourse structure variant is added, the training data provides variability in the representation of the connections between the same leaf nodes.It helps to find the general discourse patterns, resulting in a better performance on original test data in English.No improvement has been observed in Russian data, which highlights the differences between nuclearity interpretations in the two RST corpora.As a result of merging several relations into one, the nuclearity definition for some relations in the RST corpus for Russian differs from the original theoretical definition.In CAUSE-EFFECT and PURPOSE relations, the nucleus always implies the logical effect, regardless of the author's intention.This affects the adequacy of the converted dependency discourse tree.

A Examples of RST Predictions
Figure 4 illustrates the predictions of the RST parsers with EDUs intact for four variants of a text example.

B Learned Discourse Coefficients by Relation
We visualize the discourse coefficients C (RST) in the trained DBAP models in Figure 5.In light of the results, we divide rhetorical relations into four categories:

B.1 The Argument's Companion
The RST relations whose presence multiplies the likelihood of an argumentative function.

Figure 2 :
Figure2: Four RST structure variants predicted for the document micro_k002 reduced to the relations between argumentative discourse units.

Figure 3 :
Figure 3: Argument tree representation in the end-toend parser, micro_k002:En.See Figures 1 and 4a for reference.

Figure 4 :
Figure 4: Four full discourse structure variants collected for the text micro_k002.

Table 1 :
Example of a text paraphrase by argumentative discourse unit (ADU), micro_k002.

Table 2 :
Iruskieta et al. (2015)ourse parsing across different versions of a single text in the same language (mean ± std).The EDUs are reduced according to the gold ADU segmentation.Fleiss Kappa measures are computed following theIruskieta et al. (2015)'s method.

Table 4 :
, we additionally report the results on Performance of the biaffine argument parsers on the original and augmented data (gold segmentation).Results that differ significantly from those of the non-augmented BAP are marked with * (p < 0.05) or ** (p < 0.005).

Table 5 :
Test results of the end-to-end biaffine argument parsers.Results that differ significantly from those of the non-augmented BAP are marked with * (p < 0.05) or ** (p < 0.005).Wan, Tino Kutschbach, Anke Lüdeling, and Manfred Stede.2019.RST-tace a tool for automatic comparison and evaluation of RST trees.In Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019, pages 88-96, Minneapolis, MN.Association for Computational Linguistics. Shujun −1 .In the parsers trained on RST-DT, there are 17 coarsegrained relations which correspond to 78 different types of fine-grained RST relations.Contrast, Concession and Antithesis are treated by them as a single relation CON- • CAUSE −1 , CAUSE.Despite the fact that causal relations directly reflect the argumentation, the predicted rhetorical nucleus often contradicts the direction of argument.
Original text in Russian (Ru): [In fact, it would be justified]1 [if all German universities charged tuition fees.]2[As long as it is guaranteed that the funds really benefit the universities directly,]3 [we can continue to regard this as social justice.]4[In any case,]5 [the question of further training must be decided in advance.]6[You can always take a student loan]7 [or get a scholarship.]8[However, it is unfair to oblige people who do not belong to scientific circles to pay for someone else's education by collecting additional taxes.]9