How to disagree well: Investigating the dispute tactics used on Wikipedia

Disagreements are frequently studied from the perspective of either detecting toxicity or analysing argument structure. We propose a framework of dispute tactics which unifies these two perspectives, as well as other dialogue acts which play a role in resolving disputes, such as asking questions and providing clarification. This framework includes a preferential ordering among rebuttal-type tactics, ranging from ad hominem attacks to refuting the central argument. Using this framework, we annotate 213 disagreements (3,865 utterances) from Wikipedia Talk pages. This allows us to investigate research questions around the tactics used in disagreements; for instance, we provide empirical validation of the approach to disagreement recommended by Wikipedia. We develop models for multilabel prediction of dispute tactics in an utterance, achieving the best performance with a transformer-based label powerset model. Adding an auxiliary task to incorporate the ordering of rebuttal tactics further yields a statistically significant increase. Finally, we show that these annotations can be used to provide useful additional signals to improve performance on the task of predicting escalation.


Introduction
Disagreements are pervasive in online communication.While usually perceived as a negative phenomenon, research in psychology has shown that debate and disagreement can promote beliefs that are better supported by evidence (Landemore, 2012;Mercier, 2018).Hallsson and Kappel (2020) argue that, in the ideal case, a range of arguments are considered for both sides, increasing the participants' ability to find good arguments in defence of a view and to critique reasons against it, and thereby forcing one to interact with reasoning and evidence which may otherwise be discarded due to confirmation bias.
Prior work in NLP has investigated detecting negative artefacts of online disagreements such as personal attacks or hate speech, e.g.Wulczyn et al. (2017) and Waseem and Hovy (2016).Research in argumentation mining instead looks at identifying argument structures (Lawrence and Reed, 2020) and inferring the quality of arguments (Habernal and Gurevych, 2016), often focusing on classical theories of argumentation (e.g.Aristotelian) which do not include less desirable aspects such as personal attacks.However, real world disagreements often contain both well-structured arguments and attacks, in addition to other dialogue acts, such as asking for and providing clarification.
In this work, we propose a framework of dispute tactics consisting of rebuttal and coordination strategies, denoting the role of a particular utterance in the context of a disagreement discussion.We build on the disagreement hierarchy proposed by Graham (2008), which includes a preferential ordering between different rebuttal tactics.We introduce WikiTactics1 : a set of 213 disputes (comprising 3,865 utterances) on Wikipedia Talk pages, manually annotated with the dispute tactics employed in the process of resolving a disagreement between editors.These multiturn, multiparty conversations are sourced from the WikiDisputes dataset (De Kock and Vlachos, 2021) which is annotated according to whether the dispute was resolved without the need for a moderator.An example of such a conversation is shown in Fig 1.
Using this framework and data, we investigate a number of research questions related to disagreements.Firstly, we find that a lower mean rebuttal level in a disagreement is correlated with less constructive dispute resolutions, providing empirical validation of the ordering proposed by Graham (2008) and recommended by Wikipedia to its editors.Individual users are found to utilise a range of rebuttal levels more often than adhering to only the The community put WP:ENGVAR in place exactly because there is no rational way to resolve a style dispute like this.The notion is that if English style X is established in article, don't change it without prior consensus.Without that [policy], articles would be beset by endless edit wars over style issues that would become a time sink across the encyclopedia.Hi, I am aware of WP:ENGVAR and would like to point out to you the policy says that one should "use the variety found in the first post-stub revision that introduced an identifiable variety".In the case of this article, that is "a herb", which was introduced in the original article.I will leave the current wording for a few weeks to see if anyone else decides to weigh in, and intend to then change the page to align with policy.

Fenugreek: A herb / an herb
It is impossible to get local consensus on this kind of thing, which is why ENGVAR exists.Leave it alone, or waste the community's time with an RfC but stop wasting your time and mine making useless arguments here.I don't care if it says "an" or "a" -what is not acceptable is messing with it.
I admit that when I made those edits, I didn't realise it was actually a ENGVAR issue but rather just a mistake, hence my zeal in making the changes.To emphasise: the policy exists to unamIbiguously resolve these debates and for this article, it should be "a herb".I see no real arguments for the contrary, and for what it's worth, my having made policy-incorrect edits (in good faith), doesn't diminish the fact that policy is clear on this one.I have warned you to walk away from being a style warrior and wasting everyone's time.You will do as you will.
No one further has weighed in on this and so I am making the change in accordance with policy, as I have done on each of the herb-related pages.Each of these articles is now in accordance with WP:ENGVAR.Please do not edit it without an RFC or DR.We are now within the spirit and letter of policy on each of these pages and I hope we can draw a line under this ridiculous matter.top or bottom of the rebuttal hierarchy.We quantify the use of mirroring in disagreements by observing how users deviate from their own mean rebuttal level depending on the rebuttal level used in a conversation, finding that mirroring takes place in 57% of cases.We further develop models for predicting the dispute tactics used in an utterance as a multilabel classification task, experimenting with both the binary relevance and label powerset approaches, as well as models with and without conversation context.Our best model is a transformer-based model which uses a label powerset approach and context.A statistically significant improvement is gained by taking into account the preferential ordering of the rebuttal tactics defined in our scheme using multi-task training.Finally, we illustrate that these annotations can be leveraged to improve performance on the task of predicting whether a dispute will be resolved without escalating to a moderator.

Online disagreements
Wikipedia Talk pages are a popular source for NLP studies of goal-oriented discussions (e.g.Niculae and Danescu-Niculescu-Mizil, 2016;Kittur et al., 2009).The Talk pages are linked to specific articles and are used to coordinate edits and in some cases to resolve disputes.WikiDisputes (De Kock and Vlachos, 2021) is a dataset of Talk page discussions tagged as "disputes" by editors.The dataset provides an "escalation" label for each dispute: under the Wikipedia dispute resolution policy, disputes which cannot be resolved by editors themselves are escalated to mediation, which is considered to be a proxy for constructiveness.
Wikipedia recommends the hierarchy of disagreement formulated by Graham (2008) as a guide for constructive dispute resolution (Wikipedia, 2020).Graham's hierarchy posits that there are seven levels of disagreement, ranging from namecalling (at the bottom) to refuting the central point.Descriptions for these levels are provided in Fig 3 in App A. Graham's hierarchy has been used as a framework for research in various fields, including healthcare (Cope, 2015), education (Phelps et al., 2019) and sociology (Neven et al., 2019).Within NLP, Tang and Wang (2017) have used this taxonomy in combination with LDA topic models to distantly annotate and analyse the rationality of online discussions.Despite its popularity, this hierarchy has not been verified empirically.
Individual levels of this hierarchy have been considered in prior work.For instance, name-calling and ad hominem attacks (levels 0 and 1) can be considered subsets of personal attacks (Waseem and Hovy, 2016;Chang and Danescu-Niculescu-Mizil, 2019).Counterarguments (level 4) and refutations (level 5) are described in terms of arguments and evidence, a topic that has received extensive attention in the field of argumentation mining, which seeks to categorise different elements of an argument as claim or premise (see, e.g.Lawrence and Reed (2020)).Contradiction (level 3), described as stating the opposing case, is related to the task of stance detection (Küçük and Can, 2020).Graham's hierarchy combines these different concepts and proposes that there is an preferential ordering among them which correlates with more favourable disagreement outcomes.Walker et al. (2012) also focus on online disagreements and introduced the Internet Arguments Corpus (IAC).This corpus contains utteranceresponse pairs annotated for the degree of agreement as well as whether the response is emotional versus factual, respectful versus insulting, asking questions vs asserting, and negotiating versus attack.Some of these markers can be mapped to Graham's hierarchy; however, Walker et al. (2012) consider pairs of utterances whereas our aim is to understand disagreements as a multiturn conversation.Notably, their taxonomy illustrates that different disagreement markers can be used in combination: an utterance may be insulting and attacking while still being factual.It also captures the fact that within a disagreement, all discussion is not necessarily aimed at countering a proposition by an opponent; negotiating compromises and asking questions plays an important part in how disputes are resolved.
Another taxonomy that considers concepts related to Graham's hierarchy is that of Benesch et al. (2016), on the topic of counterspeech.For instance, "presentation of facts to correct misstatements or misperceptions" is similar to refutation, the top level of the rebuttal hierarchy."Denouncing hateful speech" is similar to contradiction, and "hostile denouncing" is similar to name-calling and ad hominem attacks.This taxonomy does not provide the explicit arrangement in levels as Graham (2008) does.Graham (2008) further describes disagreements more abstractly, whereas counterspeech can be thought of as a specific form of disagreement.
Recent work in argumentation has explored the notion of argument quality.For example, Lukin et al. (2017) investigate the effect of personality factors in an audience on the convincingness of an argument.Habernal and Gurevych (2016)  topics, and propose a decision tree for motivating why an argument is more convincing.This includes aspects such as providing an explanation with facts versus attacking the opponent, which aligns with some of the rebuttal levels described above.However, the aforementioned work on argumentation quality does not consider the conversation context but rather looks at monologic arguments.

Disagreement annotation schema
In this work, we distinguish between two broad categories of dispute tactics: utterances aimed at countering the point of an opponent (which we term rebuttal tactics) and attempts to promote understanding and consensus (referred to as coordination tactics).
For rebuttal tactics, Graham's hierarchy provides a useful basis.We expand the original taxonomy to include categories for "repeated argument" and "attempted derailing or off-topic comments".The proposed levels and the preferential ordering are shown in Table 1, with complete definitions in Table 3 in App B.
For the coordination tactics, we draw on previous work on Wikipedia Talk page discussions.The taxonomy of Ferschke et al. (2012) contains the "information content" class which encapsulates providing, seeking or correcting information; similarly, we annotate both asking questions and providing information.As conversations on this platform are inherently goal-oriented, some discussion is often expended on coordinating edits (Viegas et al., 2007;Schneider et al., 2011); we include a class to capture this.We further use contextualisation to describe the "opening statement" of a dispute on Wikipedia).As discussed in Sec 2, the IAC also accounts for negotiating compromises, which is encouraged by the platform's dispute resolution policy.We also annotate for this category.Finally, we annotate three "retreat" moves: saying I don't know, conceding or recanting of a position, and giving up or bailing out.Full definitions for these classes are contained in Table 4 in App C. Unlike with the disagreement levels used for rebuttal moves, there is no ordering among these classes.
Per the work of Walker et al. (2012) and Viegas et al. (2007), we use multilabel annotation.We allow for up to three rebuttal strategies and two resolution strategies per utterance.

Data annotation
Using the taxonomy described above, a set of 213 disputes (3,865 utterances) was annotated.The disputes were sampled from WikiDisputes to contain approximately the same number of escalated and non-escalated disputes.The median conversation length is 21 utterances (minimum 5, maximum 44), with an average utterance length of 54 tokens.The average number of speakers is 4.
Three initial rounds of pilot annotation were carried out to determine the appropriateness of the taxonomy for the data.During each round, five disputes (totalling 194 utterances) were annotated independently by an experienced professional annotator and one of the authors.Following each round, the annotations were compared and definitions expanded to reduce uncertainty.The agreement scores after each round are shown in Table 5 in App D. The Cohen's κ improved after each pilot round, as did the Pearson's r.The Pearson correlation takes the ordinality of the different levels into account, while this is ignored by the κ score.
An initial source of disparity between the annotators was the fact that Graham's taxonomy describes characteristics of individual responses that express disagreement, as opposed to a thread of utterances in which a topic is discussed.For instance, the notion of the "central point" (DH6/7) can take on different meanings when a full thread is considered; referring to either the root of the dispute, or the essence of an argument.In the latter case, the central point drifts as various arguments are evaluated, whereas it would remain static in the former case.In our annotation, we opted to define the central point as relative to the utterance to which one is responding (i.e. the central point drifts).
The main annotation round was executed by the expert annotator, using the definitions set out in App B and C.An example from the dataset is shown in Fig 1 .The most common label identified is "counterargument" (DH5) with 996 uses, followed by "coordinating edits" (a coordination tactic) with 972 uses.Approximately a quarter of utterances were assigned more than one label during our annotation.Frequent label combinations are discussed in Sec 4.2.

Analysing dispute tactics
In this section, we investigate four research questions regarding theories of disagreement: 1. Does the mean rebuttal level in a conversation correlate with escalation? 2. Which tactics co-occur with personal attacks? 3. What effect do personal attacks have on a conversation?4. Do individual users adhere to certain levels of the rebuttal hierarchy?

Correlation with escalation
We are interested in how informative the dispute tactics are in predicting the outcome of a dispute.We consider the escalation tags provided in WikiDisputes and calculate their Spearman correlation with the mean observed rebuttal level in a conversation.We use the micro-averaged mean over all rebuttal scores assigned during the conversation (ie.excluding coordination utterances).Let C = [(u 1 , r 1 , c 1 ), ..., (u n , r n , c n )] denote a disagreement consisting of utterances u, rebuttal labelsets r and coordination labelsets c. l represents a mapping of rebuttal labelsets to the numeric values of their levels, as set out in Table 1 (RH0 -RH7).
We then calculate: This yields a weak negative correlation (ρ = −0.19,P = 0.005).We also experimented with a macro-averaged mean, which averages first at the This concept of stateless nation only applies to Sri Lankan Tamils who are demanding Tamil Eelam NOT Indian Tamils.Some political movements in Tamil Nadu may support Eelam nation but that does not mean we want to leave India.There is difference between Indian Tamils and Sri lankan-Tamils when it comes to politics, we are fully absorbed into Indian identity and society and have always been.Same can't be said about Sri Lankan-Tamils and Sinhalese in modern times." Tamils as a whole are a stateless nation.Tamil claim to be a nation but there is no sovereign Tamil state.If Tamils consider themselves a nation and somewhere in the world exists a movement for a Tamil state, then Tamils as whole are a stateless nation.To claim that Sri Lankan Tamils are stateless nation and India Tamils not, makes no sense, because both are the same ethnic group and so also the same nation." Indian tamils are not stateless, it has to be the most absurd thing i have ever come across, it's very biased propaganda.Tamil Nadu is where majority of Tamils live and we don't identify anything other than Indian and Tamil.
Tamil is one of many official languages of India and also has classical language status.utterance level and then at the disagreement level: This yielded similar results (ρ = −0.24,P = 0.001) Using the mean rebuttal level is an imperfect metric as it ignores temporal order: a dispute starting on a high level but ending with name-calling would have the same mean rebuttal level as the inverse.Nevertheless, this result affirms the intuition that using higher level rebuttals correlates with more constructive outcomes to a disagreement.

The context of personal attacks
Our annotation scheme defines two types of personal attacks: name calling, insults and hostility (level 0; hereafter referred to as "name calling") and attacks to credibility of the person or the argument (level 1; hereafter referred to as "credibility attacks").We refer to these levels jointly as personal attacks, though it also includes attacking an opponent's argument.Level 1 attacks are the third most common label in the dataset with 575 cases, whereas level 0 attacks occur in only 65 cases.This is aligned with the low levels of toxicity associated with Wikipedia Talk pages (Wulczyn et al., 2017), but provides the additional insight that members of the Wikipedia community may still be disparaging while avoiding direct insults.
To investigate the context in which such attacks occur, we evaluate co-occurrences in our multilabel annotation.The most common multilabel combination was credibility attacks with counterargument (119 occurrences), which is observed more than credibility attacks alone.The second most common combination is credibility attacks with repeated argument.To evaluate how much more than chance the classes co-occur, we calculate the pointwise mutual information (PMI; Jurafsky and Martin, 2008) of each label (x) with personal attacks (y): The PMI values for counterargument and repeated argument with credibility attacks are positive, which indicates that they indeed have a larger than chance co-occurrence with attacks.The coordination labels (excluding bailing out) have the largest negative PMI values, indicating a less than chance co-occurrence.The full results are shown in Table 6 of App E.

The effects of personal attacks
Previous work has framed personal attacks as a sign of conversational failure (e.g.Zhang et al., 2018).Given the above observation that they may co-occur with higher level rebuttals, we are interested in their overall effect on a conversation, and therefore calculate the probability of a conversation recovering after a personal attack occurs.We define recovery in terms of having an utterance labeled as rebuttal level 5 or higher and no further personal attacks.By this definition, half of the disputes were found to recover after a personal attack, indicating that personal attacks do not necessarily result in conversational failure.
Personal attacks are more frequently found in escalated conversations: 60.7% of personal attacks are in this category (with the total number of escalated and non-escalated disputes being balanced).Of the escalated disputes with personal attacks, only 44.3% are found to recover, whereas 59.2% of resolved disputes recover post attack.This indicates that although personal attacks also occur in non-escalated disputes, participants are better adept at moving beyond them.
We further find that immediate retaliation (i.e. a personal attack being followed by another personal attack) occurred in 25.7% of cases.In disputes where at least one personal attack had occurred, the probability that the initial offender will re-offend in the same conversation is 53%, while the probability of another user using a personal attack at some point subsequently is 64%.

Individual user rebuttal levels
WikiTactics contains utterances by 734 individuals.For users with more than 1 utterance (535 users), the median difference between the highest and lowest rebuttal level employed is 4, indicating that speakers utilise a range of strategies.167 users were found to only use levels 4 and higher whereas 18 used only levels 3 and below.
Mirroring (Meltzoff and Prinz, 2002) refers to a social phenomenon where speakers reflect the behaviour of others in a conversation.To identify whether this occurs in our data, we identify users who participated in more than one dispute (66 users).To determine whether a user u is mirroring the rebuttal levels of others in a dispute c, we calculate: where ru represents a user's mean rebuttal level overall, ru,c denotes the user's mean in dispute c, and rc is dispute c's mean rebuttal level excluding contributions from user u. m u,c represents the extent to which an individual's deviation from their own mean in a conversation is explained by the setting, and takes on a positive value if the changes are in the same direction.In our data, a positive m u,c is observed in 57% of cases, indicating that mirroring does take place.

Multilabel classification
Predicting the dispute tactics used in an utterance is a multilabel classification task, meaning that for each utterance, any subset of L classes (referred to as a labelset) can be assigned as label.A vector representation for a labelset can be constructed as follows: y k = [y k1 , y k2 , ..., y kl ] where y kj = 1 ⇐⇒ the j-th of L labels is relevant to the k-th example x k and 0 otherwise.A simple method for multilabel modelling is binary relevance (BR) classification, under which a set of L binary classifiers are trained to each predict independently whether a label applies or not (Boutell et al., 2004).Another variation of the problem, referred to as the label powerset (LP) method (Tsoumakas et al., 2010), frames it as a multiclass classification problem over the powerset of all possible label combinations (2 L combinations).Both of these approaches have been found to perform well in practice (see eg. Ferreira and Vlachos, 2019).
With the advent of deep learning, Nam et al. (2014) proposed an alternative solution whereby the labelset vectors are directly predicted by a neural network model.This is achieved by modifying the output layer of a traditional multiclass neural network such that the SoftMax activation is replaced by a sigmoid function over every label in the output vector, and training with crossentropy loss.Since the sigmoid is applied to every output independently to determine its relevance to an utterance, we refer to this as a BR method, however, unlike other BR methods, the model is able to capture dependencies between classes.

Metrics
Evaluating multilabel models requires some consideration.Simple accuracy, also referred to as exact match ratio (EMR) in this setting, calculates the fraction of samples for which the full labelset are predicted correctly (Sorower, 2010).This metric can be overly harsh, as it assigns no credit to par-tially correct predictions.A popular alternative is the Hamming loss, which measures the fraction of incorrectly assigned labels (or alternatively; the Hamming score measures the proportion of labels predicted correctly).For the k-th example with a predicted label vector ŷk , this would be: This metric can be overly generous in cases where the label assignment matrix is sparse, as is the case with our data; 75% are assigned a single label out of 18 classes.The Jaccard score examines only the proportion of correctly predicted positive labels out of the potential positive set, and is seen as a midpoint between Hamming and EMR as two extremes (Park and Read, 2018): In our evaluation, we report all three metrics as they capture different perspectives, but prioritise the Jaccard score.

Experimental setting
Truncated LP As mentioned in Sec 5, the LP method is a popular modelling choice for multilabel classification.In our case, a large number of labels are being considered relative to the number of samples, thus we do not use the full set of all label combinations, but instead consider only the 20 most commonly applied label sets in the training set.This method provides a coverage of 85% of samples in the dataset.For samples whose labelsets are not in this category, we find the largest subset of their labels which does fall in the top 20 labelsets.If none of their labels qualify, which is the case for 175 utterances, we ignore these samples during training, but keep them in the test set.
To allow for comparison between model types, we cast the predictions made by the LP model back to the multilabel setting for evaluation.
Incorporating conversation context While labels are assigned at the utterance level, some of the classes we aim to predict require knowledge of the preceding conversation context; for instance, "repeated argument" and "refutation" both occur in relation to another utterance.We experiment with both the context-agnostic method (to provide a baseline) and with models that incorporate the preceding context.

Predicting ordinality
The abovementioned models do not incorporate knowledge of the preferential ordering of the rebuttal tactics.To include this signal, we add an auxiliary task which predicts the direction of the rebuttal level of the current utterance relative to the previous level, if a rebuttal tactic is used.Using Fig 2 as an example, for utterance 2 the model would need to predict both level 4 (main task) and a downward transition (auxiliary task) relative to utterance 1.Further labels are added for upwards and same level transitions, as well as a separate class for coordination strategies which have no ordering.For the first utterance, level 3 is used as the reference level.If an utterance has multiple rebuttal labels, as indicated by the vertical lines in the figure, the maximum value was used as the reference point.

Text encoders
The following models are evaluated in our experiments: • BoW: A bag-of-words encoding of an utterance is processed by a two-layer multilayer perceptron.For the context-aware version, the preceding utterances are combined and encoded in the same manner, and the two vectors are concatenated as input to the model.• LSTM: A recurrent neural network with long short-term memory (Hochreiter and Schmidhuber, 1997) is used to process word embeddings (GloVe, Pennington et al., 2014).To incorporate conversation context, a hierarchical attention network (HAN; Yang et al., 2016) is used, which encorporates dialogue structure by to encoding word embeddings using an LSTM layer followed by an attention mechanism to build up utterance embeddings.Utterance embeddings are similarly combined to form a context embedding.• BERT: Two fully-connected layers are added to a pretrained transformer-based language model (Devlin et al., 2018) and finetuned for our classification task.We use the "BERT-BASE-CASED" model from HUGGINGFACE.
Implementation We split the data into train, test and validation sets with a ratio of 70-20-10, and employ early stopping based on the validation loss.
In the case of LP classification, the label with the largest score can be considered as the predicted class.In the BR setting a threshold must be set to determine whether a given label should be assigned.This is calibrated using a development set.
All models are implemented in Keras.All models use dropout (Srivastava et al., 2014) with p = 0.5 and the Adam optimiser (Kingma and Ba, 2014) with learning rate 0.001; except for the transformerbased models which use a learning rate of 2e-5.

Results
Our experimental results are shown in Table 2.We report results for the models described in Sec 6, using both context-agnostic and context-aware settings, with binary relevance and the truncated LP formulations.As per Sec 5.2, we report the Jaccard score, the Hamming loss and the EMR, but we prioritise the Jaccard score.
As expected, the LSTM and BERT models tend to outperform the BoW models.Adding conversation context improves performance in all models on the Jaccard score.The best performing model on the Jaccard metric is the BERT model with context.
The truncated LP method achieves better results for the majority of models and metrics compared to the binary relevance formulation, despite truncating the label powerset and ignoring 15% of the training data.This can be attributed to the fact that there are certain classes with stronger co-occurrence relations (e.g.personal attacks with higher level arguments) which benefits the LP method.
To gain a better understanding of model performance, we calculate the proportion of the test set with at least one label correctly predicted, yield-ing 0.395 on the best model (an increase of 0.11 over the EMR).The label most frequently correctly predicted is coordinating edits (111 of 137 cases), which is also the most common label in the training set.The next most correctly predicted label, proportionally, is contextualisation (75%, or 24 of 32 cases), despite not being a commonly used label.This is likely due to the additional positional information available to the model, since this label is often applied to the first utterance in a conversation.On the other hand, refutation and refuting the central point are never correctly predicted (out of 44 cases), with counterargument often mistakenly predicted instead.This is likely because of similarities between the classes and the latter being the second most common label in the training set.
We build on the best performing model to evaluate the effect of including the ordinality prediction auxiliary task.Using this model, we obtain a statistically significant improvement (P = 0.03, using the permutation test on the Jaccard score) over the best model that is not aware of the ordering of the rebuttal tactics, providing further evidence of the usefulness of the rebuttal hierarchy.We also train models using the median and minimum rebuttal levels of each utterance, and find that while these do provide an increase in the Jaccard score, the difference is not significant (P = 0.32 and P = 0.13).

Predicting escalation
Based on the findings of Sec 4, we believe that the dispute tactic annotations provided in the Wiki-Tactics dataset can provide useful additional learning signals for the task of predicting escalation, as formulated by De Kock and Vlachos (2021).We therefore do multitask training with escalation as the main task and tactics as the auxilliary task, such that the features that are predictive of dispute tactics are incorporated in the escalation predictions.
De Kock and Vlachos (2021) found that a HAN network achieved the best results on this task; thus, we reproduce their experiment and achieve a PR-AUC score of 0.40.To add the dispute tactic classification as an auxiliary task, we modify the contextaware LSTM model, which uses a HAN model to encode the preceding context to an utterance (as described in Sec 6).To predict escalation for a given conversation, this HAN model is used to obtain a conversation embedding.Two further fully connected layers are added to allow for feature specialisation for the escalation prediction task.Using this model, a PR-AUC score of 0.487 is obtained, indicating that knowledge of these dispute tactics is useful for tasks beyond classifying the tactics employed.

Conclusion
In this work, we introduced a framework and dataset for dispute tactics, consisting of rebuttal and coordination strategies.We used these annotations to analyse how different tactics are used in disagreements and by different individuals, providing insights on the context in which personal attacks occur and the extent to which users alter their own tactics to mirror the level of rebuttal used in a conversation.We further develop multilabel models for classifying the dispute tactics used in an utterance, experimenting with both binary relevance and label powerset methods.We further illustrate that knowledge of these tactics increases accuracy on the task of predicting escalation, indicating that the framework and dataset can be of use to researchers working on other aspects of disagreements online.

Limitations
Conversations on Wikipedia have been recognised for being goal-oriented (Niculae and Danescu-Niculescu-Mizil, 2016), constructive (De Kock and Vlachos, 2021) and exhibiting low levels of toxicity (Wulczyn et al., 2017).Learnings and models from Talk pages may therefore not transfer well to other domains.
Another limitation of this work is the size of the dataset, which is in part due to the difficulty of the task.Future work may look into creating larger datasets using this framework; however, as we have illustrated, the annotations can provide useful additional training signals for other conversation-based tasks such as predicting escalation.
The dataset was annotated by only one annotator, which may introduce biases.For instance, cultural norms may influence what is considered to be a personal attack.To ameliorate this issue, an experienced annotator was consulted in the development of the framework.Three rounds of pilot annotation and discussion was carried out between the authors and the annotator, with increasing agreement scores.
As a factor of annotation ease, we only work with English data despite conversation data being available for many languages on the platform.Our insights, including the preferential rebuttal ordering, might not hold for other languages and contexts.
States the opposing case, with little or no reasoning / supporting evidence.

DH0: Name-calling
Attacks personal characteristics of the author rather than the argument.
Criticizes the writing without addressing the argument.
Contradiction with new evidence / reasoning, which does not respond to the original argument.
Responding in part to the argument, without addressing the central point.

Name Description Bailing out
An indication that an editor is giving up on a conversation and will no longer engage.

Contextualisation
In the first utterance, an editor "sets the stage" by describing which aspect of the article they are challenging.This does not directly disagree with anyone, and is therefore a non-disagreement move.

Asking questions
Seeking to understand another editor's opinion better.This does not include rhetorical questions, which are generally disagreement moves.

Providing clarification
Answering questions or providing information which seeks to create understanding, rather than only furthering a point.Suggesting a compromise An attempt to find a midway between one's own point and the opposer's.This is explicitly encouraged by Wikipedia.

Coordinating edits
Wikipedia Talk pages are primarily used to for goal-oriented discussions, to coordinate edits to a page.As part of disagreement threads, there is often also some discussion of these edits.This can signal that a compromise has been found.

Conceding / recanting
An explicit admission that an interlocutor is willing to relinquish their point.

I don't know
Admitting that one is uncertain.This signals that an editor is receptive to the idea that there are unknowns which may impact their argument.

Other
For utterances not covered by any other class, for instance social niceties.

Figure 1 :
Figure 1: An example from the WikiTactics dataset.Different speakers are indicated by colour.WP:ENGVAR refers to the Wikipedia English Variants policy.RfC and DR refer to Wikipedia dispute resolution procedures.
because some Indian parties have solidarity with Sri Lankan Tamils for Eelam state does NOT mean Indian Tamils want to secede from Indian union, even you know that.Only person who is imposing anything here is YOU with your "77 million tamils being stateless nation" are you insane?It's obviously propaganda!You'll be laughed at your face if you asked anyone from Tamil Nadu about wanting separate nation for Indian Tamils or to join Eelam.Tamil secession movement even existed in India in the past.Indian government had added a legislation that outlawed anyone wanting independence from India and so the Tamil secession movement in India got weaker.You can not deny this because it's history.A famous quote by Tamil poet Kannadasan about the Tamils as stateless nation.Historical Dictionary of the Tamils (2007), p. 319.We have not even finished our discussion and you're already imposing your POV on Tamil people.Probably you have a ulterior motive to divide Tamils and to deny the Tamil nationalism.The entire Tamil population belongs to a nation and this Tamil nation has no sovereign state, which makes Tamils to the largest stateless nation in the world.

Figure 2 :
Figure 2: An example of an annotated dispute from WikiTactics.Different speakers are represented by colours.Some utterances are assigned multiple labels, which are shown as vertical lines.

Table 1 :
introduce a corpus of paired arguments on controversial Dispute tactics used to annotate WikiTactics.For rebuttal tactics, the proposed ordering is indicated by level numbers.Levels marked with an asterisk were added to Graham's disagreement hierarchy.