Exploring Discourse Structures for Argument Impact Classification

Discourse relations among arguments reveal logical structures of a debate conversation. However, no prior work has explicitly studied how the sequence of discourse relations influence a claim’s impact. This paper empirically shows that the discourse relations between two arguments along the context path are essential factors for identifying the persuasive power of an argument. We further propose DisCOC to inject and fuse the sentence-level structural discourse information with contextualized features derived from large-scale language models. Experimental results and extensive analysis show that the attention and gate mechanisms that explicitly model contexts and texts can indeed help the argument impact classification task defined by Durmus et al. (2019), and discourse structures among the context path of the claim to be classified can further boost the performance.


Introduction
It is an interesting natural language understanding problem to identify the impact and the persuasiveness of an argument in a conversation. Previous works have shown that many factors can affect the persuasiveness prediction, ranging from textual and argumentation features (Wei et al., 2016), style factors (Baff et al., 2020), to the traits of source or audience Cardie, 2018, 2019;Shmueli-Scheuer et al., 2019). Discourse relations, such as Restatement and Instantiation, among arguments reveal logical structures of a debate conversation. It is natural to consider using the discourse structure to study the argument impact.  initiated a new study of the influence of discourse contexts on determining argument quality by constructing a new dataset Kialo. Stances, impact labels, and discourse relations are annotated in orange, red, and violet respectively.
As shown in Figure 1, it consists of arguments, impact labels, stances where every argument is located in an argument tree for a controversial topic. They argue contexts reflect the discourse of arguments and conduct experiments to utilize historical arguments. They find BERT with flat context concatenation is the best, but discourse structures are not easily captured by this method because it is difficult to reflect implicit discourse relations by the surface form of two arguments (Prasad et al., 2008;Lin et al., 2009;Xue et al., 2015;Lan et al., 2017;Varia et al., 2019). Therefore, there is still a gap to study how discourse relations and their sequential structures or patterns affect the argument impact and persuasiveness prediction.
In this paper, we acquire discourse relations for argument pairs with the state-of-the-art classifier for implicit discourse relations. Then we train a BiLSTM whose input is the sequence of discourse relations between two adjacent arguments to predict the last argument's impact, and the performance is comparable to that of a BiLSTM on raw text. This indicates that a sequence of discourse re-lations is one of the essential factors for identifying the persuasive power of an argument. Based on this intuition, we further propose a new model called DISCOC (Discourse Context Oriented Classifier) to explicitly produce discourse-dependent contextualized representations, fuse context representations in long distances, and make predictions. By simple finetuning, our model beats the backbone RoBERTa ) over 1.67% and previous best model BERT over 2.38%. Extensive experiments show that DISCOC results in steady increases when longer context paths with discourse structures, e.g., stances and discourse relations, are provided. On the contrary, encoders with full-range attentions are hard to capture such interactions, and narrow-range attentions cannot handle complex contexts and even become poisoned.
Our contributions can be highlighted as follows: 1. To the best of our knowledge, we are the first to explicitly analyze the effect of discourse among contexts and an argument on the persuasiveness.
2. We propose a new model called DISCOC to utilize attentions to imitate recurrent networks for sentence-level contextual representation learning.
3. Fair and massive experiments demonstrate the significant improvement; detailed ablation studies prove the necessities of modules.
4. Last, we discover distinct discourse relation path patterns in a machine learning way and conduct consistent case studies.

Overview
Kialo dataset is collected by , which consists of 47,219 argument claim texts from kialo.com for 741 controversial topics and corresponding impact votes. Arguments are organized as tree structures, where a tree is rooted in an argument thesis, and each node corresponds to an argument claim. Along a path of an argument tree, every claim except the thesis was made to either support or oppose its parent claim and propose a viewpoint. As shown in Figure 1, an argument tree is rooted at the thesis "Physical torture of prisoners is an acceptable interrogation tool.". There is one claim to support this thesis (S1 in green) and one to oppose it (O2 in fuchsia). Moreover, S1 is supported by its child claim S2 and opposed by O1, and S3 holds the same viewpoint of O2.

Claim and Context Path
As each claim was put in view of all its ancestral claims and surrounding siblings, the audience evaluated the claim based on how timely and appropriate it is. Therefore, the context information is of most interest to be discussed and researched in the Kialo dataset. We define that a claim denoted as C is the argumentative and persuasive text to express an idea for the audience, and a context path of a claim of length l is the path from the ancestor claim to its parent claim, denoted as (C 0 , C 1 , · · · , C l−1 ) where C l−1 is the parent of C. For simplicity, we may use C l instead of C without causing ambiguity. The longest path of C starts from the thesis. Statistically, the average length of the longest paths is 3.5.

Argument Stance
In a controversial topic, each argument claim except the thesis would have a stance, whether to support or oppose the argument thesis or its parent claim. In Kialo, users need to directly add a stance tag (Pro or Con) to show their agreement or disagreement about the chosen parent argument when they post their arguments. We use s i to denote the stance whether C i is to support or oppose its parent C i−1 when i ≥ 1. The statistics of these stances are shown in Table 1.

Impact Label
After reading claims as well as the contexts, users may agree or disagree about these claims. The impact vote for each argument claim is provided by users who can choose from 1 to 5.  categorize votes into three impact classes (Not Impactful, Medium Impact, and Impactful) based on the agreement and the valid vote numbers to reduce noise. We can see the overall distribution from Table 1. The argument impact classification is defined to predict the impact label y of C given the claim text C and its corresponding context path (C 0 , C 1 , · · · , C l−1 ).  3 Discourse Structure Analysis 3.1 Argument Impact from the Perspective of Discourse As paths under a controversial topic are strongly related to Comparison (e.g., Contrast), Contingency (e.g., Reason), Expansion (e.g., Restatement), and Temporal (e.g., Succession) discourse relations (Prasad et al., 2008), we model the discourse structures from a view of discourse relations. The first step is to acquire discourse relation annotations. BMGF-RoBERTa (Liu et al., 2020) is the state-of-the-art model proposed to detect implicit discourse relations from raw text. In the following experiments, we use that as our annotation model to predict discourse relation distributions for each adjacent claim pair. Specifically, for a given argument claim C l and its context path (C 0 , C 1 , · · · , C l−1 ), we denote p disco (C l ) = (r 1 , r 2 , · · · , r l ) as a discourse relation path such that r i ∈ R indicates the discourse relation between C i−1 and C i when i ≥ 1. In this work, we adopt the 14 discourse relation senses in CoNLL2015 Shared Task (Xue et al., 2015) as R. And we also define the corresponding distributed discourse relation path to be p dist (C l ) = (d 1 , d 2 , · · · , d l ) such that d i = F (C i−1 , C i ) is the predicted discourse relation distribution between claims C i−1 and C i (i ≥ 1) by a predictive model F. In experiments, F is BMGF-RoBERTa 1 . 8 out of 14 relations appear in the predictions, and the statistics of 7 frequent predictions are shown in Table 2.
As discourse contexts would affect the persuasive power of claims, we first discover the correlations between impacts and stances as well as correlations between impacts and discourse relations, illustrated in Figure 2. From the label distribution and correlations, we find there are some clear trends: 1) Stances have little influence on argument impact, but discourse relations do. Correlations indicate that it is the contents instead of standpoints that contribute to potential impacts; 2) It is a smart choice to show some examples to convince others Figure 2: Impact label distributions, the correlations between labels and stances, and the correlations between labels and discourse relations. Normalization is applied to the columns.
because Instantiation is more relevant to Impactful than any other relations; 3) Similarly, explaining is also helpful to make voices outstanding; 4) Restatement is also positively correlated with Impactful so that we can also share our opinions by paraphrasing others' viewpoints to command more attention. On the contrary, Chosen Alternative is a risky method because the audience may object.
To investigate the role of discourse relations in impact analysis, we design a simple experiment that a single-layer BiLSTM followed by a 2-layer MLP with batch normalization predicts the impact by utilizing the distributed discourse relation path p dist (C l ). For the purposes of comparison and analysis, we build another BiLSTM on the raw text. Each claim has [BOS] and [EOS] tokens to clarify boundaries and we use 300-dim pretrained GloVe word embeddings (Pennington et al., 2014) and remain them fixed. We set different thresholds for context path lengths so that we can control how many discourse relations or contexts are provided. From Figure 3, discourse features can result in comparable performance, especially when longer discourse paths are provided. Instead, the model with raw text gets stuck in complex contexts.

Discourse Context Oriented Classifier
It is generally agreed that the informative context can help understand the text to be classified. However, it is still unclear how to determine whether a context is helpful. One drawback of a broader context is the increasing ambiguity, especially in the scenario of the argument context path from different users like the results shown in Figure 3. Take claims in Figure 1 for example, S1 and O2 give two different consequences to support or oppose the thesis. And O1 objects S1 by a contrast conclusion. It is hard to build a connection between the thesis and O1 if S1 is not given because it is challenging to build a connection between "reveal desired information" with "interrogation tool" without a precondition "Torture can help force prisoners to reveal information". On the contrary, thesis and S2 are still compatible as S2 is also a kind of result. Hence, a recurrent model with the gating mechanism that depicts pair-wise relations and passes to the following texts makes more sense.
LSTM has gates to decide whether to remember or forget during encoding, but it cannot handle long-range information with limited memory. Recently, transformer-based encoders have shown remarkable performance in various complicated tasks. These models regard sequences as fully connected graphs to learn the correlations and representations for each token. People assume that transformers can learn whether two tokens are relevant and how strong the correlation is by back-propagation. Table 3 illustrates different possible ways to aggregation context information. Transformer (Vaswani et al., 2017) and BERT (Devlin et al., 2019) adopt full-range attentions while TransformerXL  and XLNet  regard historical encoded representations as memories to reuse hidden states. SparseTransformer (Child et al., 2019), in the opposite direction, stacks hundreds of layers by narrow the attention scope by sparse factorization. Information can still spread after propagations in several layers. Inspired by these observations, we design DISCOC (Discourse Context Oriented Classifier) to capture contextualized features by localized attentions and imitate recurrent models to reduce noises from long distance context. As shown in Figure 4, DISCOC predicts the argument impact through three steps.

Attention
Representative Table 3: Different attention mechanisms. The Memory attention freezes the historical representations so that gradients of C i would not propagate to the memory (C 0 , · · · , C i−1 ).

Adjacent Claim Pair Encoding
A difficult problem in such an argument claim tree is the noise in irrelevant contexts. A claim is connected to its parent claim because of a supporting or opposing stance, but claims in long distances are not high-correlated. Based on this observation, DISCOC conduct word-level representations by encoding claim pairs instead of the whole contexts.
Given a claim C l and its context path (C 0 , C 1 , · · · , C l−1 ), all adjacent pairs are coupled together, i.e., (C 0 , C 1 ), · · · , (C l−1 , C l ). We can observe that each claim appears twice except the first and the last. Next, each pair (C i−1 , C i ) is fed into the RoBERTa encoder to get the contextualized word representations. C 0 and C l are also encoded separately so that each claim has been encoded twice. We use − → H i to denote the encoded word representations of C i when this claim is encoded with its parent C i−1 , or when it is computed alone as C 0 . Similarly, ← − H i is the representations when encoding (C i , C i+1 ), or when it is fed as C l .
The encoding runs in parallel but we still use the term phase to demonstrate for better understanding. In 0-th phase, RoBERTa outputs − → H 0 . One particular relationship between a parent-child pair is the stance, and we insert the one special token [Pro] or [Con] between them. It makes the sentiment and viewpoint of the child claim more accurate. On the other hand, discourse relations can also influence impact prediction, as reported in Section 3.1. However, discourse relations are not mutually exclusive, let alone predictions from BMGF-RoBERTa are not precise. Thus, we use the relation distributions as weights to get sense-related embeddings over 14 relations. We add additional W 1 d i for the parent and W 2 d i for the child except position embeddings and segment embeddings, where d i is predicted discourse relation distribution for (C i−1 , C i ), W 1 and W 2 are trainable transformations for parents and children. Hence, RoBERTa outputs ← − H i−1 and − → H i with the concatenation of two claims, [CTX] (i ∈ {1, 2, · · · , l}), where [CTX] is a special token to indicate the parent claim and distinguish from [CLS]. Its embedding is initialized as a copy embedding of [CLS] but able to update by itself. And ← − H l is computed by self-attention with no context in the last phase. In the end, each claim C i has two contextualized representations ← − H i and − → H i with limited surrounding context information.

Bidirectional Representation Fusion
As claim representations { ← − H i } and { − → H i } from RoBERTa are not bidirectional, we need to combine them and control which of them matters more. The gated fusion (Liu et al., 2020) has been shown of a better mixture than the combination of multihead attention and layer normalization. We use it to maintain the powerful representative features and carry useful historical context information: where MultHead is the multi-head attention operation (Vaswani et al., 2017) whose query is ← − H i and key & value is − → H i , A j is the fusion gate for the j-th word embedding, [· · · ] is the concatenation, is the element product operation, and W a and b a are trainable matrix and bias for fusion gating.

Context Path Information Gathering
After extracting sentence-level claim representations u 0 , u 1 , · · · , u l , a transformer layer is used to gather longer-range context representations. The transformer layer includes a position embedding layer to provide sinusoid positional embeddings, a gated multi-head attention layer, a feed-forward network, and a layer normalization. The position embedding layer in DISCOC is different from that in the vanilla Transformer because it generates position ids in a reversed order, i.e. l, l − 1, · · · , 0. The reversed order is helpful to model the contexts of variable length because the claim to be classified has the same position embedding. We also choose a gate to maintain the scale instead of using a residual connection. The gated transformer can generate meaningful representations because each claim can attend any other claims and itself. On the other hand, it perfectly fits the pair-wise encoding that imitates the recurrent networks to reduce the noise in irrelevant contexts and enhance the nearest context's correlations. For example, in Figure 1, S2 is predicted as a result of S1 (with a probability of 39.17%) and a restatement (with a probability of 19.81%), and S1 is also a result of thesis (with a probability of 70.57%). Consequently, S2 is highrelevant to the thesis as a potential result if "physical torture is acceptable", which can be captured by DISCOC. Finally, a 2-layer MLP with batch normalization is applied to v l of the last claim to predict its impact.

Baseline Models
Majority. The baseline simply returns Impactful.

SVM. Durmus et al. (2019) created linguistic fea-
tures for a SVM classifier, such as named entity types, POS tags, special marks, tf-idf scores for n-grams, etc. We report the result from their paper.
HAN. HAN  computes document vectors in a hierarchical way of encoding and aggregation. We replace its BiGRU with BiLSTM for the sake of comparison. And we also extend it with pretrained encoders and transformer layers.
Flat-MLMs. Pretrained masked languages, e.g., RoBERTa, learn word representations and predict masked words by self-attention. We use these encoders to encode the flat context concatenation like [ Interval-MLMs. Flat-MLMs regard the context path as a whole segment and ignore the real discourse structures except the adjacency, e.g., distances between two claims are missing. We borrow the idea from BERT-SUM (Liu and Lapata, 2019): segment embeddings of C i are assigned depending on whether the distance to C l is odd or even.

Context-MLMs.
We also compare pretrained encoders with context masks. A context mask is to localize the attention scope from the previous to the next. That is, C i can attends words in C i−1 and C i+1 except for itself if 1 ≤ i < l; C 0 can only attend C 0 , C 1 , and C l can only attend C l−1 , C l .
Memory-MLMs. XLNet utilizes memory to extend the capability of self-attention to learn super long historical text information. We also extend Flat-MLMs under this setting.

Model Configuration and Settings
We use pretrained base models 2 in DISCOC and baselines. We follow the same finetuning setting: classifiers are optimized by Adam (Kingma and Ba, 2015) with a scheduler and a maximum learning rate 2e-5. The learning rate scheduler consists of a linear warmup for the 6% steps and a linear decay for the remaining steps. As for BiLSTM and HAN, the maximum learning rate is 1e-3. The hidden state dimension of linear layers, the hidden units of LSTM layers, and projected dimensions for attention are 128. The number of the multi-head attention is set as 8. Dropout is applied after each layer and the probability is 0.1. We pick the best context path length l for each model by grid search from 0 to 5 on validation data with the batch size of 32 in 10 epochs. Each model runs five times. Table 4 shows experimental results of different models. It is not surprising that neural models can easily beat traditional feature engineering methods in overall performance. But linguistic features still bring the highest precision. We also observe a significant 3.49% improvement with context vectors aggregating in HAN-BiLSTM compared with the simple BiLSTM. This indicates that it is necessary to model contexts with higher-level sentence features. Models with pretrained encoders benefit from representative embeddings, and HAN-RoBERTa achieves a gain of 5.49%. Flat context paths contain useful information to help detect the argument impact, but they also involve some noise from unrelated standpoints. Interval segment embeddings do not reduce noise but make BERT confused. It is counterintuitive that the segment embeddings depend on whether the distance is odd or even because BERT uses these for next sentence prediction. Since XLNet uses relative segment encodings instead of segment embeddings, Interval-XNet is better than Flat-XLNet in all three metrics. On the other hand, context masks bring another side effect for BERT, RoBERTa, and XLNet. Although these masks limit the attention scope at first sight, distant word information is able to flow to words with the increment of transformer layers. As a result, the uncertainty and attention bias increase after adding context masks. The memory storing context representations is also not helpful. The main reason is  Table 4: The averages and standard deviations of different models on the argument impact classification. The marker * refers to p-value < 0.05 and the marker ** refers to p-value < 0.001 in t-test compared with DISCOC.

Argument Impact Classification
that the last claim's update signal can not be used to update previous context representations. That is, Memory-models degenerate to models with frozen path features or even worth. DISCOC that we proposed can capture useful contexts and fuse in a comprehensive manner. Finally, DISCOC outperforms the second best model Flat-BERT over 1.61% and its backbone Flat-RoBERTa over 1.67%, the previous best model BERT by 2.38%.

Influence of the Context Path Length
Different claims have different contexts. We only report the best performance with a fixed maximum context path length in Table 4. Figure 5 shows F1 scores of models with different hyper-parameters. DISCOC always benefits from longer discourse contexts while other models get stuck in performance fluctuation. Most models can handle one context claim, which is consistent with our idea of pair-wise encoding. DISCOC has consistent performance gains; instead, other models cannot learn long-distance structures better. Each token in Flat-RoBERTa and Interval-RoBERTa can attend all other tokens, and the two are the most competitive baselines. However, Context-RoBERTa and Memory-RoBERTa limit the attention scope to the tokens of one previous claim, making models unable to make use of long-distance context information.

RoBERTa vs. BERT
As shown in Table 4, there is little difference between the performance of RoBERTa variants and that of BERT variants. We conduct the experiment for DISCOC (E-BERT) with BERT as the encoder reported in Table 5. Its performance has achieved a significant boost over 1.29% despite the small gap between itself and DISCOC.

Are Stances and Discourse Senses Helpful?
We also remove either the stance token embedding or the discourse sense embeddings from DISCOC. The results in Table 5 suggest that both sides of structures are essential for modelling the correlation between the parent claim and the child claim. By comparison, discourse sense embeddings are more vital.

Are Gated Transformers Necessary?
We add a gated transformer layer to gather sentencelevel vectors. Such gathering is necessary for the proposed framework because each claim can only attend limited contexts. BiLSTM and convolutions can also be used for this purpose, so we replace the gated transformer layer with a BiLSTM or a convolutional layer. Moreover, we also remove it to make predictions by u l directly. The results in Table 5 show that the gated transformer is the irreplaceable part of DISCOC because it retains the contextualized representations and remains their scales. Simple removing it hurts recall enormously.

High-coefficient Discourse Relation Patterns
We use Logistic Regression to mine several interesting discourse relation patterns. Detailed settings are described in Appendix A, and results including the most high-coefficient patterns are listed in Table 6. We observe that some discourse relation path patterns are distinguishing for classifying individual impact labels. Instantiation is a typical relation that only occurs in the top patterns of Impactful. Also, Restatement is relatively frequent for Impactful (5 of top 10), but it is the relation between the grandparent and the parent.  Table 7: F1 score differences between two best models on top 9 discourse relation patterns and all patterns.
age. That indicates some views are usually considered ordinary in complex structures. Conjunction is the dominant relation (8 of top 10) so that we are suggested to avoid to go along with others. The case of Not Impactful is a little clearer, in the sense that it has a unique relation Chosen Alternative as one of the most significant patterns. Restatement also appears frequently, showing neither generalization, nor specification, nor paraphrasing of others' views can help make claims stand out.

Case Study
In Appendix A, we define P r(r 1 , · · · , r l ) as the joint probability to generate the discourse relation path (r 1 , · · · , r l ) given the context (C 0 , C 1 , · · · , C l−1 ) and the claim C l . For example, the P r(Reason, Contrast) is 56.59% which corresponds to an Impactful claim "There is no evidence for this" with its parent claim "Our bodies know how to recognise and process current foods; changing them through genetic modification will create health issues". Furthermore, we find 5 of top 5 and 8 of top 10 are voted as Impactful claims after sorting based on P r(Reason, Contrast). For a complex pattern Restatement-Restatement appearing in both top patterns of the Impactful and the Not Impactful, 3 cases with the maximum probabil-ities are Not Impactful while the following 7 cases are Impactful. It is interesting that the thesis of the top 3 claims is the same discussion about an American politician. There are 25 Impactful claims and 22 Not Impactful claims in this topic, 24 of which are restatements of their parent claims. As for Restatement-Reason, the most top pattern of the Not Impactful, we find 7 of the top 10 claims relevant to politics, 2 of them about globalization, and one food-related. Therefore, there is no perfect answer in these quite controversial topics, and that is why Restatement and Reason appear frequently.

Empirical Results
On the other hand, we check the performance of testing examples to verify the effectiveness of these discourse relation patterns. We choose the best model of DISCOC, whose F1 score is 59.04% as well as the best model of DISCOC (w/o DiscoE) whose F1 score is 58.06%. We select testing examples with specific discourse patterns, and performance differences are shown in Table 7. DISCOC benefits from 7 of the top 9 patterns and the performance margins are even more significant than the improvement of the overall results. Without giving discourse relation patterns, the model still has trouble capturing such implicit context influences. Empirical results support our idea that implicit discourse relations could affect the persuasiveness.

Related Work
There is an increasing interest in computational argumentation to evaluate the qualitative impact of arguments based on corpus extracted from Web Argumentation sources such as CMV sub-forum of Reddit (Tan et al., 2016). Studies explored the importance and effectiveness of various factors on determining the persuasiveness and convincingness of arguments, such as surface texture, social interaction and argumentation related features (Wei et al., 2016), characteristics of the source and audience Shmueli-Scheuer et al., 2019;Durmus and Cardie, 2018), sequence ordering of arguments (Hidey and McKeown, 2018), and argument structure features . The style feature is also proved to be significant in evaluating the persuasiveness of news editorial argumentation (Baff et al., 2020). Habernal and Gurevych (2016) conducted experiments in an entirely empirical manner, constructing a corpus for argument quality label classification and proposing several neural network models.
In addition to the features mentioned above, the role of pragmatic and discourse contexts has shown to be crucial by not yet fully explored. Zeng et al. (2020) examined how the contexts and the dynamic progress of argumentative conversations influence the comparative persuasiveness of an argumentation process.  created a new dataset based on argument claims and impact votes from a debate platform kialo.com, and experiments showed that incorporating contexts is useful to classify the argument impact.
Understanding discourse relations is one of the fundamental tasks of natural language understanding, and it is beneficial for various downstream tasks such as sentiment analysis (Nejat et al., 2017;Bhatia et al., 2015), machine translation (Li et al., 2014) and text generation (Bosselut et al., 2018). Discourse information is also considered indicative for various tasks of computational argumentation. Eckle-Kohler et al. (2015) analyzed the role of discourse markers for discriminating claims and premises in argumentative discourse and found that particular semantic group of discourse markers are highly predictive features. Hidey and McKeown (2018) concatenated sentence vectors with discourse relation embeddings as sentence features for persuasiveness prediction and showed that discourse embeddings helped improve performance.

Conclusion
In this paper, we explicitly investigate how discourse structures influence the impact and the persuasiveness of an argument claim. We present DISCOC to produce discourse-dependent contextualized representations. Experiments and ablation studies show that our model improves its backbone RoBERTa around 1.67%. Instead, HAN and other attention mechanisms bring side effects. We discover distinct discourse relation path patterns and analyze representatives. In the future, we plan to explore discourse structures in other NLU tasks.