On Classifying whether Two Texts are on the Same Side of an Argument

To ease the difficulty of argument stance classification, the task of same side stance classification (S3C) has been proposed. In contrast to actual stance classification, which requires a substantial amount of domain knowledge to identify whether an argument is in favor or against a certain issue, it is argued that, for S3C, only argument similarity within stances needs to be learned to successfully solve the task. We evaluate several transformer-based approaches on the dataset of the recent S3C shared task, followed by an in-depth evaluation and error analysis of our model and the task’s hypothesis. We show that, although we achieve state-of-the-art results, our model fails to generalize both within as well as across topics and domains when adjusting the sampling strategy of the training and test set to a more adversarial scenario. Our evaluation shows that current state-of-the-art approaches cannot determine same side stance by considering only domain-independent linguistic similarity features, but appear to require domain knowledge and semantic inference, too.


Introduction
Same side stance classification (S3C) is the task to predict, for a given pair of arguments, whether both argue for the same stance (Stein et al., 2021). It abstracts from conventional stance classification, which, for an individual argument, predicts whether it argues for or against a corresponding issue. The hypothesis underlying S3C is that it can "probably be solved independently of a topic or a domain, so to speak, in a topic-agnostic fashion". 1 Successful S3C can, for instance, help to quickly identify coherent posts in social media debates, or to quantify groups of posts with opposing stances. To advance S3C as a task in the argument mining community, this paper makes three main contributions: (1) Development of new transformer-based approaches which improve upon the state of the art. (2) Renewed assessment of the original S3C shared task dataset, and compilation of new training and test sets that enable a more realistic evaluation scenario.
(3) Compilation of an additional, hand-crafted test set consisting of adversarial cases, such as negations and references to contrary positions within single arguments, to investigate the hypothesis underlying S3C in particular. Our results indicate that current state-of-the-art models are not able to solve such cases. We conclude with recommendations how datasets and evaluation scenarios for the S3C task could be further developed. 2 2 Related Work Stance Classification S3C has been introduced as a shared task by Stein et al. (2021). Prior work on stance classification, such as that of Somasundaran and Wiebe (2010), Gottipati et al. (2013), and Sridhar et al. (2015), focuses more on detecting the stance towards a certain topic and only marginally the direct comparison between two arguments. Sridhar et al. (2014) describe a collective stance classification approach using both linguistic and structural features to predict the stance of many posts in an online debate forum. It uses a weighted graph to model author and post relations and predicts the stance with a set of logic rules. Rosenthal and McKeown (2015) use the conversational structure of online discussion forums to detect agreement and disagreement, and Walker et al. (2012) exploit the dialogic structure in online debates to outperform content-based models. As opinionated language in social media typically expresses a stance towards a topic, it allows us to infer the connection between stance classification and target-dependent sentiment classification, as demonstrated by Wang and Cardie (2014) and Ebrahimi et al. (2016). Stance classification in tweets was also a target of the SemEval-2016 Task 6 (Mohammad et al., 2016), where most teams used n-gram features or word embeddings. Further, it gained recognition in argument mining, as demonstrated by Sobhani et al. (2015). Xu et al. (2019) introduce reason comparing networks (RCN) that identify agreement and disagreement between utterances towards a topic. They leverage reason information to cope with non-dialogic utterances. Since the S3C task authors hypothesize that textual similarity between arguments may be sufficient, the task bears structural similarity towards semantic textual similarity, which has often been a topic of shared tasks (Agirre et al., 2013;Xu et al., 2015;Cer et al., 2017), and for which many datasets can be found (Dolan and Brockett, 2005;Ganitkevitch et al., 2013).

S3C Shared Task
The S3C dataset (Stein et al., 2021) is derived from the args.me corpus  and comprises pairs of arguments from several thousand debates about one of two topics, namely abortion and gay marriage. The arguments have been retrieved from online debate portals. Argument pairs were sampled from single arguments that occurred within the same debate context. Binary labels for pairs were inferred according to whether or not the two arguments take the same stance. Two tasks have been defined based on this data: within, where training and test sets contain pairs from both topics, and cross, where the training set is composed of arguments from the abortion topic, and the test set only contains gay marriagerelated argument pairs. Table 1 (Exp. 1) shows the statistics of our resampled dataset. Single unique arguments are re-occurring in different pairings, and, for the within task, the training and the test set significantly overlap, albeit the pairings differ. In the official S3C shared task, the winner models by Ollinger et al. (2021) and  used a BERT-based sequence pair classification. They find that longer sequences yield better results, and that truncation of arguments longer than BERT's maximum sequence length has a negative impact.

Experimental Setup
Following the results of the shared task, transformer-based language models, such as BERT, currently are the most successful approach to S3C. Based on this previous work, we experiment with more recent transformers, carrying out the following three experiments.  Experiment 1: Optimization We reproduce the shared task in its original form as well as the best-performing approach at the S3C shared task of Ollinger et al. (2021). It serves as a baseline for comparison and represents the state of the art. The approach is based on BERT with pre-trained weights for the English language. Argument pairs are fed as a sequence pair into the model, and the pooled output of the last layer is used for binary classification. This architecture is fine-tuned with binary cross-entropy loss for three epochs, and a learning rate of 5e−5. In addition, we experiment with newer transformer-based pre-trained networks: RoBERTa (Liu et al., 2019), which improved BERT by using larger and cleaner datasets for pre-training; XLNet (Yang et al., 2019), which employs autoregressive pre-training; DistilBERT (Sanh et al., 2019), which utilizes knowledge distillation during pre-training; and ALBERT (Lan et al., 2020), which, among other things, uses embedding matrix compression and sentence order prediction as a pre-training task.
Experiment 2: Bias Control We are not only interested in determining how well current transformer models are able to solve the S3C task, but particularly in the task's setup. During our first experiments, we noticed certain properties in the official dataset which may lead to unrealistically optimistic results. The S3C dataset is derived from arguments scraped from public debate pages and categorized as either pro or con stance for a certain issue. Pairs for S3C were sampled from combinations of all possible pairs from the n unique arguments within a single debate, and then randomly split into separate training and test sets. While

Claim:
The gay marriage ban goes against human rights. Same side?
Negation Banning gay marriage is not a violation of the human rights. false Paraphrase Basic rights, including the right to marry, apply to homosexual couples, too. true

Paraphrase-Negation
Denying gays the right to marry does not violate their human rights. false Argument Denying gays the right to adopt children violates their human rights. true

Argument-Negation
Denying gays the right to adopt children does not violate their human rights. false Citation Some say banning gay marriage goes against their human rights. And it sure is. true Citation-Negation Some say banning gay marriage goes against their human rights. But it is not. false Table 2: An example claim along hand-crafted variations.
for the within task, this procedure ensures nonoverlapping of pairs, there is a severe overlap of individual arguments between training and test. Also, single debates from which pairs are sampled vary greatly in size. To test the influence of these two observations regarding overfitting effects, we first create an extended set containing all n(n − 1)/2 possible argument pairs per debate, and then sample three new dataset splits of roughly comparable size, but with varying degrees of overlap of single arguments (cf. Table 1). The random split replicates the sampling strategy of the original S3C task. The two disjoint splits ensure that (almost) no single argument seen during training is reoccurring in a test set pair. This is achieved by splitting either across distinct debates (within), or across topics (cross). 3 The last split creates a test set which ensures that only one single argument from each pair is also contained in the training set.

Experiment 3: Adversarial Examples
In the third experiment, we manually create an artificial test set (Hakimi et al., 2021) to reveal the ability of our best model to solve different types of "adversarial" cases for same stance prediction more systematically. We select 25 distinct arguments from the "gay marriage" topic that are short and express their stance clearly. For each selected argument, we construct new arguments of four distinct types to obtain two pairs, one with same stance, and one with opposing stance. The first type, Negation, is a simple negation of the argument. Paraphrase alters important words from the argument to synonymous expressions with the same stance. The third type, Argument, uses an argument from the same topic and stance, but semantically completely different regarding the first one. Citation repeats or summarizes the first argument and then expresses agreement or rejection (a case frequently occurring in the dataset). The last three types are also formu-lated in a negated version to create additional test instances for the opposite stance. This results in a test set of 175 cases (see Table 2).

Evaluation
We report accuracy (A) and macro-F1 scores (F1) as experiment results.
Experiment 1: For the within task, we randomly split the official within training dataset into 90% for training and 10% for testing. For the cross task, we select all within pairs of the official training dataset assigned to the abortion topic as training data, and all gay marriage pairs for testing. For both tasks, another 10% of the sampled training sets are used as validation set during our experiments. This strategy creates an evaluation scenario equivalent to the official S3C shared task, but with slightly less training data.   Since the first two splitting strategies involve a random selection, we repeat the selection process five times and report average results. All tested scenarios surpass the majority baseline approving that the model actually learns to recognize (dis-)agreement of arguments. In accordance with the results from Experiment 1, S3C works accurately (86.6% F1) for the randomly composed test set. However, for the two disjoint datasets with no overlap of individual arguments, the performance drops severely (ca. 62% F1). The performance for within does not even surpass the cross performance which is trained on a completely different topic. And in the single scenario, where one argument of a test pair has been seen during training, the performance is with 65% F1 rather low. Experiment 3: A close inspection of misclassified pairs from the previous experiment of the disjoint test set reveals typical cases which require certain logical inference capabilities to obtain a correct same side stance prediction. Based on this, we manually crafted the test set for the third experiment. For adversarial cases, even our best model only achieves 43.4% accuracy (41.7% F1-score). The confusion matrices in Figure 1 show that the model is able to capture shallow semantic similarity between arguments (paraphrase) successfully. In contrast, it is not capable to predict the semantically more challenging types (argument and citation). Negation, leading to opposing stance, is completely overlooked.

Discussion
The experiments show that S3C performance drastically decreases for unseen arguments (Experiment 2), and for difficult, adversarial cases (Experiment 3), which undermines our confidence in the results from Experiment 1. Considering argument pairs composed of previously unseen individual arguments as a common scenario for S3C, the high performance on the official shared task dataset appears too optimistic. How can these differences between the original and our new scenarios be explained? Let us recall: Pairs of the original S3C dataset originated from single debates. It must be noted that debate size, i.e., the number of argument pairs sampled from a single debate, follows a power-law distribution (e.g., the Experiment 1 training set contains 17,187 pairs from combinations of 251 arguments from the largest debate alone). Fine-tuning of a transformer model now causes that re-occurring arguments of the same stance presented in different combinations get attracted to each other in the embedding space. Arguments of opposing stance from one debate, in contrast, get repulsed. If enough combinations of argument pairs of one debate are presented to the network, embeddings of the pro and the con stance eventually form clearly separable clusters. This results in a task-specific overfitting on certain debates. Each of the n unique arguments from one debate occur up to n − 1 times in the training pairs. The model's performance thus correlates with the size of a debate when test pairs are sampled from the same debates as the training pairs. In fact, slicing the results from Experiment 1 across different debate sizes reveals that test pairs originating from the five largest debates are predicted with nearly 100% accuracy. For smaller debates, the accuracy drops to the level of non-overlapping dataset splits.

Conclusion
We carry out experiments to investigate the same side stance classification task. Our results show that recent transformer models improve over the state of the art in the recent S3C shared task. With 73.7% F1-score, the best performance is achieved by the ALBERTv2 model. We find, however, that the shared task's experimental setup suffers from overfitting, yielding overly optimistic results. A manually crafted test set of adversarial cases shows that all models fail on adversarial cases involving negation and citation of opposing arguments. From these results, three conclusions can be drawn for the improvement of the same side stance classification task: (1) For a more realistic evaluation scenario, training and test set pairs should be sampled from distinct sets of arguments. 4 (2) When the training set involves re-occurring arguments in different pairings, machine learning models should pay particular attention to measures against overfitting. For instance, a validation set should not be randomly sampled from the training set. (3) The hypothesis underlying the S3C task was that it can be solved in a topic-agnostic fashion. However, even our best model struggles to accurately predict the cross-topic scenario, or complex cases involving different arguments expressing the same stance. This finding suggests that the basic S3C hypothesis is not entirely true. For such cases, topic-specific knowledge and a deeper semantic representation of individual arguments than those encoded by current transformer models would be needed. A framework for argument mining and evaluation" (project no. 406289255) within the priority program "RATIO: Robust Argumentation Machines" (SPP 1999).

Ethics Statement
We have used the S3C dataset (Ajjour et al., 2020;Stein et al., 2021) without any major modifications to the data contained. The dataset is a collection of opinionated texts obtained from publicly available and appropriately acknowledged sources respecting their terms and conditions. We did not employ any author-specific features in our approaches and instead process only the corresponding arguments, although representing personal views of anonymous authors. Our artificial dataset is based on a manually selected small subset of the S3C dataset that we then used to formulate our custom arguments to test different argument pair types. Our aim was only to generate arguments based on the types we introduced that are semantically correct sentences with specific characteristics without representing our stance on the underlying issue. By reusing pre-trained models using the Huggingface.co transformers library (Wolf et al., 2020), our approach might have inherited some forms of bias. We did not perform any evaluation of this potential problem. It is worth noting that our experiments show that our approach is far from being ready to be used within a product. Our goal is to advance the research on this task. In terms of computational resources, we restricted ourselves to small variants of pre-trained models that can be fine-tuned with (relatively) fewer resources and are accessible to the majority of researchers.
The proposed technology is applicable to an English-speaking audience.

A.1 Experimental Setup
All experiments were performed on a single GeForce RTX 2080 Ti with 11 GB RAM. The time per epoch of fine-tuning depends largely on the batch size, which in turn depends on the sequence length and model architecture, and averages at about 30 -90 minutes for the models and datasets we tested. We kept most hyperparameters at their default values, and focused on different settings for maximum (input) sequence length, batch size / gradient accumulation steps, and epochs of finetuning. Another important factor for prediction performance and fine-tuning duration is the training data composition and amount which already factors in the time mentioned above.

A.2 Tables
Statistics Table 5 shows statistics about the number of argument pairs and unique arguments which shows that argument pairs have to reuse single arguments multiple times. Details about tokenization and sentence segmentations can be seen in Table 6.    space and we suspect that models overfit faster. As can be seen in Tables 7 and 9. We included results for different sequence lengths and architectures; ALBERT and BERT were consistently outperforming the other architectures. The full listing of models trained with various sequence lengths can be found in Table 7. Results are reported on our recompiled test set, not the currently unpublished S3C task gold labels. The sequence length of 256 was used for experimentation and 512 for final results. We observed that the performance differences between models are relatively similar and transfer between different sequence lengths. We observed a drop of 20% accuracy (F1) for the cross task between validation and test sets. There was almost no drop for the within subtask. The cross performance drop can be explained by the completely unknown topic samples while for within the topic is known, just the test samples are unknown, so vocabulary usage may be known. This also means that models trained on a spread of different topics are more generic and robust against unknown samples.

Official Test Set
We were able to use the yet-tobe-published S3C task test set to evaluate our models. The test data used by Ollinger et al. (2021) was not available anymore, so we could only compare results based on similar experimental setups. Evaluating the same fine-tuned models on the shared task (hidden) test labels reveals more distinct differences in their performance. A full listing of the same metrics used in the shared task leaderboard can be seen in Table 9. We also included the of-ficial leaderboard results of the best-performing model by Ollinger et al. (2021). Similar to Table 7 the ALBERT-base-v2 models perform best with BERT-base models following. Other architectures like the Distil* variants, Electra, etc. show drastically worse results. The most probable cause for this compared to the results from our test data split (10% of the training data) might be overfitting while fine-tuning the models.