Argument Undermining: Counter-Argument Generation by Attacking Weak Premises

Text generation has received a lot of attention in computational argumentation research as of recent. A particularly challenging task is the generation of counter-arguments. So far, approaches primarily focus on rebutting a given conclusion, yet other ways to counter an argument exist. In this work, we go beyond previous research by exploring argument undermining, that is, countering an argument by attacking one of its premises. We hypothesize that identifying the argument's weak premises is key to effective countering. Accordingly, we propose a pipeline approach that first assesses the premises' strength and then generates a counter-argument targeting the weak ones. On the one hand, both manual and automatic evaluation proves the importance of identifying weak premises in counter-argument generation. On the other hand, when considering correctness and content richness, human annotators favored our approach over state-of-the-art counter-argument generation.


Introduction
Following Walton (2009), a counter-argument can be defined as an attack on a specific argument by arguing against either its claim (called rebuttal), the validity of reasoning of its premises toward its claim (undercut), or the validity of one of its premises (undermining). Not only the mining and retrieval of counter-arguments have been studied (Peldszus and Stede, 2015;Wachsmuth et al., 2018), recent works also tackled the generation of counter-arguments. Among these, Bilu et al. (2015) and  studied the task of contrastive claim generation, the former in a partly rule-based manner, the latter data-driven. Moreover, Hua and Wang (2019) proposed a neural approach. So far, however, research focused only Claim (title): Feminism is in the third wave. In countries, such as America, it has caused nothing but trouble.
Premises (sentences): I'm going to be bringing up several feminist arguments I have heard myself. First off, the all-dreaded wage gap. It has in fact been illegal to pay women less than men since the early 1960s. [...] Secondly the pink tax. Women's products are of course going to cost more than men's. They use entirely different chemicals specifically made to cater to softer skin [...] and for the fourth point, although I could go on for much longer, the feminst movement is needed elsewhere. In countries, such as Iraq, India, and Saudia Arabia. The feminst movement being in a country where women aren't being forced to cover their entire bodies, aren't being sold off with doweries, and aren't being oppressed, is downright absurd.
Counter-argument: The fact that other women have it worse doesn't mean that women don't have it bad elsewhere. For example, I can be fired for being gay in 29 out of 50 states in the US. The fact that people are stoned for being gay in Brunei doesn't mean that isn't an example of homophobia... Table 1: An example argument (claim + premises) and a counter-argument in response to it, taken from Reddit changemyview. The italicized premise part was quoted by the user who stated the counter-argument. on rebutting a given argument, ignoring the other aforementioned types. We expand this research by studying to what extent argument undermining can be utilized in counter-argument generation.
In argument undermining, the validity of some premises is questioned. Such a phenomenon can be observed often in online discussions on social media. For example, in the discussion excerpt in Table 1, taken from the Reddit forum changemyview, 1 a user contests the whole stated argument (claim and premises) refers to the specific premise highlighted in the table (on Reddit, it is the quoted part of the text). This implies two steps: first, to identify Human life is invaluable. A fetus is a human.
: To abort means to kill.
Abortion is murder.

Argument
Counter-argument counter-argument generation weak-premise ranking

premise attack generation
A fetus is not a human. It is just cells in the womb, so abortion is not killing.
Human life is invaluable. #1 A fetus is a human.
: #2 To abort means to kill.
Abortion is murder.
w/ highlighted premises Figure 1: Argument undermining: Instead of countering a given argument directly, our approach first ranks the argument's premises by weakness. Then, an attack focused on the weakest premises is generated. a potentially weak and thus attackable premise in the argument, and second, to counter it.
In this work, we propose to tackle the task of counter-argument generation by attacking one of the weak premises of an argument. We hypothesize that identifying a weak premise is key to effective counter-argument generation-especially when the argument is of high complexity, comprising multiple interlinked claims and premises, making it hard to comprehend the argument as a single unit. Figure 1 illustrates our two-step pipeline approach: it detects premises that may be attackable and then generates a counter-argument addressing one or more of these premises. To identify weak premises, we build on the work of Jo et al. (2020), who classify attackable sentences using BERT. In contrast, we rank premises based on their attackability concerning the argument's main claim, utilizing the learning-to-rank approach of Han et al. (2020). For the second step, similar to , we fine-tune a pre-trained transformer-based language model (Radford et al., 2018), in a multi-task learning setting: next-token classification and counterargument classification.
In our experiments, we make use of the changemyview (CMV) dataset of Jo et al. (2020), where each instance is a post consisting of a title (say, an argument's claim) and a text (the argument's premises). Some of the sentences in the text are quoted by comments to the post. These sentences are considered to be weak/attackable premises 2 . We further extend the dataset by collecting texts from comments, defining counter-arguments.
To analyze our approach, we evaluate both of its steps individually as well as in combination. In particular, we first compare our ranking model for detecting attackable premises to Jo et al. (2020), observing significant improvements in the effectiveness. Second, given the ground-truth attackable premise (the quoted sentences), we evaluate our counter-argument generation model against several baselines. Our automatic evaluation provides evidence that training the model with the weak premise annotated significantly boosts the scores across all metrics. We additionally confirm these results by a manual evaluation, indicating that our approach is better than the baseline in 56% of the cases. Finally, we apply our generation model based on the automatically detected weak premises and compare it to the approach of Hua and Wang (2019), which generates counter-arguments with opposing stance to the argument (i.e., rebuttals). While the automatic evaluation here was not in favor of our approach, the manual evaluation gave evidence of the favorability of our approach on all three tested quality dimensions. To summarize, our contributions are 3 : • A model for detecting premise attackability, achieving state-of-the-art effectiveness.
• A new approach to counter-argument generation that identifies and attacks weak premises.
• Empirical evidence of the importance of considering a specific attackable premise in the argument when generating a counter-argument.

Related Work
Recently, text generation has gained much interest in computational argumentation, both for single claims and complete arguments. Bilu et al. (2015) composed opposing claims combining rules with classifiers, whereas  tackled an analog task with neural methods. Alshomary et al. (2020) reconstructed implicit claims from argument premises using triplet neural networks, and Gretz et al. (2020) explored ability of GPT-2 to generate claims on topics. Recently, Al-shomary et al. (2021) studied how to encode specific beliefs into generated claims. Sato et al. (2015) generated full arguments in a largely rule-based way. El Baff et al. (2019) modeled argument synthesis as a language modeling task, and (Schiller et al., 2020) studied the neural generation of arguments on a topic with controlled aspects and stance. Unlike all these, we deal with counter-arguments.
Research exists for mining attack relations (Cocarascu and Toni, 2017;Chakrabarty et al., 2019;Orbach et al., 2020), mining counter-considerations from text (Peldszus and Stede, 2015), and retrieving counter-arguments (Wachsmuth et al., 2018;Orbach et al., 2019). However, only the consecutive works of Wang (2018, 2019) addressed the generation task. Their latest neural approach takes an argument or claim as input and generates a counter-argument rebutting it. Differently, we consider countering an argument by attacking one of its premises, known as undermining (Walton, 2009).
Part of our approach is to identify attackable premises, which can be studied from an argument quality perspective. That is, a premise is attackable when it lacks specific quality criteria. A significant body of research has studied argument quality assessment, with a comprehensive survey of quality criteria presented in Wachsmuth et al. (2017). Implicitly, we target criteria such as a premise's acceptability or relevance. Still, we follow Jo et al. (2020) in deriving attackability from the sentences of posts that users in the Reddit forum CMV attack. These sentences represent premises supporting the claim encoded in the post's title. The authors experimented with different features that potentially reflect weaknesses in the premises. Their best model for identifying attackable premises is a BERT-based classifier. We use their data to learn weak premise identification, but we address it as a learning-to-rank task.
As for text generation, significant advances have been made through fine-tuning large pre-trained language models (Solaiman et al., 2019) on target tasks. We also benefit from this by utilizing a pretrained transformer-based language model (Devlin et al., 2018), and we fine-tune it in a multi-task fashion similar to .

Approach
As sketched in Figure 1, our pipeline approach counters an argument by attacking the validity of one of its potentially weak premises. This section presents the two main steps of our approach: first, the ranking of weak and thus attackable premises, and second, the generation of the weak premises' attack.

Weak-Premise Ranking
Given an argument in the form of a claim and a set of premises, the task is to identify the argument's attackable premises. Unlike previous work (Jo et al., 2020), we model the task as a ranking task instead of a classification task, in which, for each argument, we learn to rank its premises by their weakness relevant to the claim. Our hypothesis here is that the attackability of a premise can be better learned when considering both the claim and other premises of the argument.
We operationalize the weak-premise ranking similar to the ranking approach of Han et al. (2020). In particular, given a set of premises and the claim, we first represent each premise by concatenating its tokens with the claim's tokens, separated by special tokens [cls] and [sep]:

[cls] claim_tokens [sep] premise_tokens [sep]
Next, the resulting sequences are passed through a BERT model to obtain a vector representation for every premise. Each vector is then projected through a dense layer to get a scoreŷ that reflects the weakness of the premise. Finally, a list-wise objective function (we use a Softmax loss) is optimized jointly on all premises of an argument as follows: where y is a binary ground-truth label reflecting whether the given premise is attackable (y = 1) or not (y = 0). Given training data, we can thus learn to rank premises by weakness.

Premise Attack Generation
Given the ranking step's output, we identify the k highest-ranked premises in an argument to be attackable (in our experiments, we test k = 1 and k = 3). Then we generate a counter-argument putting the identified attackable premises into the focus. To this end, we follow  in using transfer learning and fine-tune a pre-trained transformer-based generation model on our task.  Figure 2: Architecture of our approach: Given an argument, a weak premise, and a counter, three embedding representations are generated and fed to the transformer to obtain hidden states from which the language model and classification heads learn the Next-token prediction and Counter-argument classification tasks respectively In our fine-tuning process, the input is a sequence of tokens created from two segments, the argument and the counter-argument:

[bos] arg_tokens [counter] counter_tokens [eos]
The final token embedding is then a result of concatenating three embeddings: word and positional embeddings learned in the pre-training process, as well as a token-type embedding learned in the fine-tuning process. Here, the token type reflects whether the token belongs to the argument in general, to a weak premise, or the counter-argument. Now, we train our model jointly on two tasks: • Next-token prediction. Given a sequence of tokens, predict the next one.
• Counter-argument classification. Given two concatenated segments, is the second a counter-argument to the first.
The first task is similar to the next-sentence prediction task introduced in (Devlin et al., 2018), which was shown to be beneficial for multiple representation-learning tasks. Figure 2 shows the architecture of our generation model. For training, we augment a given set of training sequences D by adding distracting sequences, which we use, for each argument and its weak premise, a non-relevant text instead of the counter-argument. Given a sequence of tokens d = (t 1 , t 2 , · · · , t n ) ∈ D, we then optimize the following two loss functions jointly with equal weighting: where Θ denotes the weights of the model, k is the number of previous tokens, and y j is the groundtruth label of the sequence, indicating if the second segment of the sequence is a counter or not.

Data
As proposed, the presented approach models the task of counter-argument generation as an attack on a potentially attackable premise. Such behavior is widely observed on the Reddit forum changemyview (CMV). In particular, a user writes a new post that presents reasons supporting the pro or con stance towards a given topic (captured in the title of the post), asking the CMV community to challenge the presented view. In turn, other users quote specific segments of the post (usually a few sentences) and seek to counter them in their comment. An example has already been given in Figure 1.
The structure induced by CMV defines a suitable data source for our study. Specifically, we create the following distantly supervised mapping: • The title of the post denotes the claim of the user's argument; • the text of the post denotes the concatenated set of the argument's premises; • the quoted sentence(s) denote the attackable (weak) premises; and • the quoting sentences from the comment denote the counter-argument.
In our work, we build on the CMV dataset of Jo et al. (2020), where each instance contains a post, a title, and a set of attackable sentences (those quoted in the comments). We use the same split as the authors, consisting of 25.8k posts for training, 8.7k for validation, and 8.5k for testing. We extend their dataset by further collecting the quoting sentences from the comments (i.e., the counter-arguments). The final dataset compiles 111.9k triples of argument (claim and premises), weak premise (one sentence or more), and a counter-argument (a set of sentences), split into 67.6k training, 23k validation, and 22.3k for testing instances.

Evaluation
In the following, we present the experiments we carried out to evaluate both steps of our approach individually as well as in a pipelined approach. On the one hand, we aim to assess the applicability of identifying weak premises in an argument and the impact of targeting them in the process of counterargument generation. On the other hand, our goal is to assess how well counter-argument generation via undermining is compared to other known counterargument generation approaches.

Weak-Premise Ranking
As presented, we approached the task of finding attackable premises by learning to rank premises by their weakness with respect to the main claim.
Approach Based on the code of Han et al. (2020) available in the Tensorflow learn-to-rank framework (Pasumarthi et al., 2019), we used a list-wise optimization technique that considers the order of all premises in the same argument. 4 We trained our ranking approach on the CMV dataset's training split and referred to it as bert-ltr below. 5 Baselines We compare our approach to the Bertbased classifier introduced by Jo et al. (2020), trained on the same training split using the authors' code. We use their trained model to score each premise and then rank all premises in an argument accordingly. We call this bert-classifier. As Jo et al. (2020), we also consider a random baseline as well as a baseline that ranks premises based on sentence length.
Measures To assess the effectiveness, we followed Jo et al. (2020) in computing the precision of putting a weak premise in the first rank (P @1), as well as the accuracy of having at least a weak premise ranked in the top three (A@3).
Results Table 2 shows the weak-premise ranking results. We managed to almost exactly reproduce the values of Jo et al. (2020) for all three baselines. Our approach, bert-ltr, achieves the best scores according to both measures. In terms of a one-tailed dependent student-t test, the differences between  bert-ltr and bert-classifier results are significant with at least 95% confidence. These results support our hypothesis of the importance of tackling the task as a ranking task with respect to the main claim. Below, we will use our weak-premise ranking model in the overall approach, i.e., to automatically select attackable premises in an argument.

Premise Attack Generation
Next, we evaluate our hypothesis of the importance of identifying weak premises in the process of counter-argument generation. To focus on this step, we use the ground-truth weak premises in our data. These are the quoted sentences in the post, considered potentially attackable premises.
Approach We used OpenAI GPT as a pretrained language model. We trained two versions of our generation model: our-model-w/ with an extra special token ([weak]), surrounding the attackable sentences to give an extra signal to our model, and once our-model-w/o without it. We fine-tuned both versions with the same settings using the transformers library (Wolf et al., 2020) for six epochs 6 . We left all other hyperparameters to their default values. As mentioned, the model's input is a sequence of tokens constructed from the argument (with weak premises highlighted) and either the correct counter or a distracting sequence. We randomly select one sentence from the original post to be the distracting sequence for each input instance.
Baseline We compare our model to a GPT-based model fine-tuned on a sequence of tokens representing a pair of an argument (title and post) and a counter-argument. We consider this as a general counter-argument generation model, trained without any consideration of weak premises. We train  Table 3: Premise attack generation: METEOR and BLEU scores of the output of each evaluated approach compared to the ground-truth counter sentences and the full comment (argument). Values marked with * are significantly better than counter-baseline at p < .05. the baseline using the same setting as our model. We refer to it as counter-baseline.
Automatic Evaluation To assess the importance of selecting attackable sentences, we evaluated the effectiveness of our model in different inference settings in terms of what is being attacked: (1) the claim of the argument, (2) a random premise, or (3) a weak premise given in the ground-truth data.
In the random setting, we randomly selected three premises from the argument, and we generated one counter for each. The final result is the average of the results for each.
We computed METEOR and BLEU scores, comparing the generated premises to (a) the exact counter sentences of the quoted weak premise and (b) the full argument. We performed this automatic evaluation on 1k posts from the test split.  knowledge about weak premises as token types is sufficient, and adding an extra special token doesn't help. Although the differences between our best model and the baseline are not big, they are significant according to the one-tailed dependent studentt test with a confidence level of 95%. For both versions of our model, best scores are achieved when considering the weak premises as the target (except for the first METEOR column). However, not all these differences are significant. This gives evidence that exploiting information about weak premises in the training of counter-argument generation approaches can improve their effectiveness.

Results As shown in
To further assess the relationship between the generated counters and the attacked premises, we computed the proportion of covered content tokens in the weak premise for the two versions of our model and the baseline. Figure ?? shows a histogram of the percentages. Clearly, both versions of our model have higher coverage of the annotated weak premise than the baseline.

Manual Evaluation
To analyze the generated counter-arguments more thoroughly, we carried a manual evaluation study on a sample of 50 random examples. Two authors of the paper inspected the sample comparing the two versions of our model. The results were in favor of our-model-w/o. Therefore, we compared only our-model-w/o against the counter-baseline. In particular, we assessed the relevance and appropriateness of the output of the two for each example. Given an argument, the highlighted premise to be attacked, and the two counters, we asked three annotators who hold an academic degree and are fluent in English (no author of this paper) to answer two questions: 1. Which text is more relevant to the highlighted premise?
2. Which text is more appropriate for being used as a counter-argument?
Results As shown in Table 4, considering the majority vote, annotators favored our model in 56% of the cases in both tasks. These results give further evidence supporting our hypothesis of the importance of identifying weak premises. Considering the given task as a ranking task, we used Kendall's τ to compute the annotator's agreement. The mean pairwise agreement was 0.41 for the relevance assessment and 0.23 for appropriateness. Clearly, assessing the text's appropriateness of being a counter-argument is more subjective and more challenging to judge than the relevance task.

Overall Approach
Finally, we assess the overall effectiveness of our counter-generation approach, that is, when we identify weak premises automatically using Bertltr and then generating a counter-argument using our-model-w/o, focusing on the selected weak premises.
Approach Due to the limited P@1 value of our ranking model (see Table 2), we evaluate two variations of our overall approach that differ in terms of what premises to attack. The first variant attacks the weakest premise. In the second, we first generate three counters considering each of the top three weak premises. Then, we select the counter that has the most content-token overlap with the corresponding weak premise.
Baselines On the one hand, we compare our approach to the counter-baseline from the previous section. On the other hand, we consider the stateof-the-art counter-argument generation approach of Hua and Wang (2019), an LSTM-based Seq2seq  model with two decoders, one for selecting talking points (phrases) and the other for generating the counter given the selection.
Automatic Evaluation While the approach of Hua and Wang (2019) learns from a dataset collected from the same source (CMV), it requires retrieving relevant argumentative texts with a stance opposite to the input argument. Due to the complexity of the data preparation, we decided instead to evaluate all approaches on the test split of Hua and Wang (2019). 7 As a result, Hua and Wang (2019)'s approach is trained on their training split, while our approach is trained on our training split, and then both are evaluated on the same test split of Hua and Wang (2019). This can be considered a somewhat unfair setting for our approach due to certain domain differences since the dataset of Hua and Wang (2019) comprises political topics only. Similar to Section 5.2, we generated counters for 1k examples and computed METEOR and BLEU scores of the generated counters with respect to the ground-truth counters, which are here full arguments (CMV comments).
Results Table 5 shows that our approach outperforms the counter-baseline in both settings, even with weak premises selected automatically. Considering the top-3 weak premises instead of the top-1 improves the results. The best scores are achieved by Hua and Wang, though. A reason for this may be the slight domain difference between our model's training data and the test data used for evaluation. Another observation is that the scores of both our approach and the baseline increase compared to Table 3. This is likely to be caused by the higher number of ground-truth references for each instance in data of Hua and Wang (2019) compared to the test split of our data, making it more likely to have token overlaps.  Table 6: Overall approach: Average scores of the three annotators for the three evaluated quality dimensions of the counter-arguments generated by our approach and the one of Hua and Wang (2019). 1 is worst, 5 is best. The bottom line shows the inter-annotator agreeement.
Manual Evaluation Given the known limited reliability of automatic generation evaluation, we conducted another user study to evaluate the quality of the generated counters by our model and Hua and Wang. We evaluate the same quality dimensions used previously by Hua and Wang (2019): • Content Richness. The diversity of aspects covered by a counter-argument.
• Correctness. The relevance of a counterargument to the given argument and their degree of disagreement.
• Grammaticality. The grammatical correctness and fluency of a counter-argument.
We used the Upwork crowd working platform to recruit three annotators with English proficiency and experience in editorial work. 8 We asked each of them to evaluate a sample of 100 examples. Each contained an argument (claim and premises) and two counters (one of each approach). We asked the annotators to compare the counters and assess each with a score from 1 (worst) to 5 (best) for each quality dimension.

Results
The results are presented in Table 6. Unlike in the automatic evaluation, the annotators gave, on average, higher scores on all quality dimensions to our generated counters than to those of Hua and Wang. 9 Bringing knowledge from pretrained language models (GPT) generally seems to contribute to the grammaticality and the richness of the generated counters. In terms of generating a correct counter, focusing the generation model on a specific weak premise in an argument seems to help (2.65 vs. 1.81), even though the results are far from perfect. Manual inspection revealed that 8 Upwork, http://upwork.com 9 We note that the scores of Hua and Wang in Table 6 are notably lower than those reported by Hua and Wang (2019). We believe this to be due to the comparison with our approach that affected the annotator's scores.
Claim: there's nothing wrong with income inequality.
Premises: billionaires like the rockefellers and trump worked hard to earn their money and provide their families with luxury. meanwhile, my grandpa didn't and because of that i am of lower-middle class status. just because i'm poorer doesn't mean i'm entitled to the cash that the rich spent years to accumulate. i simply have to swallow my pride and start at the bottom . to try and stump income inequality is to meddle with the very basis of pure , unadulterated capitalism and meritocracy Our Counter: income inequality is precisely because of the fact that rich people have a vested interest in having a bunch of money in the first place, which leads to great wealth inequality. if you are poor but poor you have no incentive to live. by providing good quality goods and services, you are able to contribute to society better than most poor people. you're basically doing something that is expected of you, which is a wonderful trait. i think the problem is that you should at least be able to be generous towards someone without expecting to be rewarded.
Hua and Wang's Counter: this is a great example of how hard it is to invest in low income housing. it's not like it's going to end up being worse for everyone. if you don't like it, you're going to have to worry about it. the rich don't want to pay for it because they do n't have to pay taxes. they aren't going to be able to do anything about it, they just don't want. Figure 4: Example counter-arguments generated by our approach and by the approach of Hua and Wang (2019). The italicized premise segment was identified as the weak premise by our approach. far from all generated arguments are counters to exactly what is in the argument, indicating more room to work on this topic.
Krippendorff's α values show that the annotators had a fair agreement on grammaticallity and correctness tasks (given the subjectiveness of the tasks), but only slight agreement on content richness. We, therefore, think that the results for the latter should not be overinterpreted.
In Figure 4, we show an example argument in favor of income inequality. Our approach considers the premise "being poor does not entitle someone to the cash of the rich people". It then generates a counter-argument on the topic of inequality, focusing on the fact that "being poor limits the ability to contribute to society". In contrast, the counterargument generated by Hua and Wang diverges to address "low-income housing" which is less relevant to the topic. More examples of generated counters can be found in Figure 5 in the appendix.

Conclusion
In this work, we have proposed a new approach to counter-argument generation. The approach focuses on argument undermining rather than re-buttal, aiming to expand the research in this area. The underlying hypothesis is that identifying weak premises in an argument is essential for effective countering. To account for this hypothesis, our approach first ranks the argument's premises by weakness and then generates a counter-argument to attack the weakest ones.
In our experiments, we have first evaluated each step individually. We have observed state-of-theart results in the weak-premise identification task. Our results also support the need for identifying weak premises to generate better attacks. We have also evaluated the overall approach against the state-of-the-art approach of Hua and Wang (2019). While we did not beat that approach in automatic evaluation scores, independent annotators favored the counter-arguments generated by our approach across all evaluated quality dimensions.
We conclude that our approach improves the state of the art in counter-argument generation in different respects, providing support for our hypothesis. Still, the limited manual evaluation scores imply notable room for improvement. Most importantly, controlling the stance of the generated counters is yet to be fully solved.

Ethical Statement
We acknowledge that ethical issues might arise from our work. First, we would like to ensure that we did not violate user privacy when using data from public platforms. By reusing pre-trained models, our approach might have inherited some forms of bias. Mitigating such bias is still ongoing research. It is worth noting that our experiments show that our approach is far from being ready to be used as an end technology. Our goal is to advance the research on this task.