Argue with Me Tersely: Towards Sentence-Level Counter-Argument Generation

Counter-argument generation -- a captivating area in computational linguistics -- seeks to craft statements that offer opposing views. While most research has ventured into paragraph-level generation, sentence-level counter-argument generation beckons with its unique constraints and brevity-focused challenges. Furthermore, the diverse nature of counter-arguments poses challenges for evaluating model performance solely based on n-gram-based metrics. In this paper, we present the ArgTersely benchmark for sentence-level counter-argument generation, drawing from a manually annotated dataset from the ChangeMyView debate forum. We also propose Arg-LlaMA for generating high-quality counter-argument. For better evaluation, we trained a BERT-based evaluator Arg-Judge with human preference data. We conducted comparative experiments involving various baselines such as LlaMA, Alpaca, GPT-3, and others. The results show the competitiveness of our proposed framework and evaluator in counter-argument generation tasks. Code and data are available at https://github.com/amazingljy1206/ArgTersely.


Introduction
Counter-argument generation task aims to automatically generate a statement that has a different stance from the original argument (Toulmin, 2003;Damer, 2009).Existing works describe it as a paragraph-level generation task (Hua and Wang, 2018;Alshomary et al., 2021;Alshomary and Wachsmuth, 2023).However, sentence-level counter-argument generation can be quite different.The main challenge of sentence-level generation is to condense the counter-argument into a concise Topic: Can firear m restr ictions reduce gun cr ime?
The government should impose str icter regulations on firear m owner ship to reduce gun-related cr imes.
Str icter firear m regulations may limit the rights of law-abiding citizens without effectively addressing the root causes of gunrelated cr imes, such as mental health issues and socioeconomic disparities.

Paragraph-level
While I understand the concern for reducing gun-related cr imes, it is important to consider the balance between individual rights and public safety.Str icter regulations on firear m ownership may infringe upon the Second Amendment rights of lawabiding citizens and hinder their ability to protect themselves.Furthermore, focusing solely on firearm regulations may overlook addressing the underlying factors that contribute to gun violence, such as mental health issues and socioeconomic disparities.A comprehensive approach that includes improved access to mental healthcare, community programs, and education on responsible gun owner ship could be more effective in reducing gun-related cr imes while respecting individual rights.

Original Argument
Figure 1: An example that elucidates the difference between paragraph-level and sentence-level counterarguments.Topic words reflecting the discussion points are in bold.Words that are underlined and in the same color denote the key points shared between two counterarguments.
sentence.It requires identifying the key points and formulating a counter-argument in a limited space.An example of the difference between paragraphlevel and sentence-level counter-argument generation is shown in Figure 1.
To address this challenge, we propose a benchmark ArgTersely for sentence-level counterargument generation.The dataset is derived from ChangeMyView (CMV), an online debate forum, and has been annotated by humans.
Recently, large language models, such as Ope-nAI ChatGPT and GPT-4 (Bubeck et al., 2023), PaLM (Chowdhery et al., 2023), and LlaMAs (Touvron et al., 2023a,b) have achieved great success and demonstrated remarkable performance in text generation tasks.By leveraging the pretrained language model, we propose a framework, arXiv:2312.13608v1[cs.CL] 21 Dec 2023 Arg-LlaMA, to generate high-quality counterarguments.Our framework is a pipeline comprising (1) an instruction component, (2) a language model, and (3) a filter component.The instruction component comprises multiple Chainof-Thought (CoT;Wei et al., 2022) instructions addressing common errors in debates along with their corresponding reasoning steps.As for the language model, we utilize instruct-tuning (Wei et al., 2021) on LlaMA-7b (Touvron et al., 2023a) with the Low-rank Adaptation (Hu et al., 2021) method.During inference, we employ multiple CoT instructions as input for the language model and utilize the filter component to select the best candidate counter-argument as the output of the system.
Previous work typically employed n-gram-based metrics such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) for rapidly evaluating the quality of counter-argument generation (Alshomary et al., 2021;Schiller et al., 2021).However, we believe that these metrics do not effectively assess whether the generated sentences are pertinent and in line with human preferences.To this end, we propose incorporating model-based metrics, Arg-Judge, as a supplementary evaluation approach.Specifically, we trained a BERT-based (Devlin et al., 2019) model, using the human preference data generated during the annotation process.In addition, we introduce a metric, ChatGPT Eval, which we obtain by using ChatGPT to score the sentence's position and argument completion.Moreover, we have made the human evaluation more specific by asking human annotators to assess the outputs based on five dimensions, which enables a comprehensive evaluation of the model's performance.
Our contributions are mainly as follows: • We propose a benchmark, ArgTersely, for sentence-level counter-argument generation and a dataset annotated by humans.• We propose a counter-argument generation framework, Arg-LlaMA.The framework is capable of generating high-quality counterarguments.• We propose a novel, lightweight evaluator, Arg-Judge, which enables it to reflect the real ranking and is highly consistent with human evaluation.

Task Formulation
The task input consists of two components: a topic and an original argument.
(1) The topic, denoted as τ , explains the premise of the dialogue and the focus of the debate.
(2) The original argument, denoted as x, is a sentence containing the initial perspective or stance put forward.The objective of this task is to generate a sentence, y, that provides a coherent rebuttal response to x based on the given topic τ .

Dataset Creation
We based our dataset annotation on the CMV dataset (Tan et al., 2016) All annotators have substantial debate experience and at least a bachelor's degree.The annotation process spanned 42 days, yielding 31, 197 argument-counterargument pairs, each associated with the relevant topic.We highlight the ethical considerations during the annotation process, including potential risks, identifiable information, compensation, and annotation biases in Section 9.The statistics of ArgTersely are shown in Table 1. 3 Arg-LlaMA Figure 2 shows the framework we proposed, Arg-LlaMA.It is mainly composed of two parts: 1) a language model (LM) with instruct-tuning, for generating counter-argument, and 2) a filter, for se-lecting high-quality counter-argument.We employ LoRA and instruct-tuning methods to obtain an LM.Additionally, we leverage human preference data to train the filter.
During inference, we use CoT instructions as inputs of the LM.After obtaining a series of outputs from the LM, the filter will select the best counterargument as output.The generation pipeline is detailed in Section 3.3.

Instruct-tuning the LlaMA
Instruction Set Creation In line with the selfinstruct (Wang et al., 2023) approach, we initially generated 148 instructions based on 10 seed instructions.Following a manual verification process, these instructions were expanded to form an Argumentation Instruction Set consisting of 2,772 instructions.Specifically, our specific implementation differs from the self-instruct method in the following aspects: 1.Our seed instructions focus on argument-related instructions, such as "Provide evidence to support the conclusion", "Point out its logical error", etc.A detailed list of specific instructions is shown in Appendix A and attached files.2. We use the ChatGPT3 to generate instructions, enabling us to generate more diverse and elaborate contexts.Low-Rank Tuning Using the above Argumentation Instruction Set and Alpaca instruction set (Taori et al., 2023), we fine-tuned LlaMA-7b model with LoRA method.LoRA maps the weight update of the self-attention module projection matrix in the Transformer (Vaswani et al., 2017) architecture to a lower dimension and then returns to the normal output dimension.In our work, we performed LoRA on all Query/Key/Value/Output projection matrices in the self-attention module.

Training the Filter
The filter component is also a language model.We designed this component with the purpose of selecting high-quality counter-arguments from candidate sentences.Ranking Data for training Our training data, named Ranking Data (RD), originates from human preference data generated during the annotation process of ArgTersely dataset.Given an original argument x, we assign ranking scores to candidate sentences based on the following rules: 1 = Sentences selected by annotators that can form a strong rebuttal relationship with x. 2 = Sentences not selected by the annotator but belonging to the same conversation as x. 3 = Safe reply, randomly selected from a predefined list, as listed in Appendix B 4 = Sentences sampled from other conversations.We finally got 20,000 training samples and 800 testing samples, each sample consists of an original argument and four candidates.We denoted the original argument as x, the candidates list as Y = [y 1 , y 2 , y 3 , y 4 ], and the ranking score for y i as Training Task The training task is learning to rank the candidates in the correct sequence.In this task, we assign the ranking scores of four candidates as the ground truth, with higher scores indicating lower quality.
To optimize the parameters θ of the filter, we first used the parameters of BERT-base (Devlin et al., 2019) to initialize it.The loss function we employed is cross-entropy loss:

Generation Pipeline
The generation pipeline consists of three steps: 1) provide CoT instructions to guide the LM, 2) use the LM, generates outputs based on instructions, and 3) apply filtering to refine and obtain the final result by selecting the most appropriate counterargument.
CoT Instruction Our generation pipeline starts with a series of CoT instructions.We propose CoT instructions to guide the model in generating realistic and logical arguments through multi-step reasoning.Based on Kee's (2006) debating theory, we design few-shot and multi-step reasoning templates for several common errors in debate.We roughly divide common errors into the following categories: • Factual Error: a mistake in the presentation of fact.• Logical Fallacy: errors in reasoning that undermine the validity of an argument.• Confirmation Bias: errors in selectively interpreting information in a way that supports existing hypotheses.The specific formats of these instructions are listed in Appendix C.During inference, we provide the LM with a set of instructions that correspond to the aforementioned errors for generating the counter-arguments.LM The LM serves two roles.Firstly, it acts as an error identifier, tasked with identifying errors within the original argument.Secondly, it generates a candidate counter-argument for each instruction provided.
Filter After the LM with various CoT instructions, we get a set of candidate counter-argument Y = [ ŷ1 , ŷ2 , ..., ŷn ].The purpose of the filter is to select the one that maximizes the probability as the output of the system:  (Zhang et al., 2020): a decoder-only language model which was trained on online dialogue corpus.• LlaMA (Touvron et al., 2023a): a collection of models trained on publicly available datasets, and we use LlaMA-7b.• Alpaca-LoRA (Taori et al., 2023): a model obtained by LoRA-tuning LlaMA-7b based on the Alpaca instruction set.• GPT-3 (Brown et al., 2020): a large language model without instruct-tuning.

Evaluation Metrics
Our evaluation metrics include automatic evaluation metrics and human evaluation.
Automatic Evaluation First, we do not entirely disregard n-gram-based automatic evaluation metrics that commonly utilized, including BLEU (Papineni et al., 2002), ROUGE (Lin, 2004) and ME-TEOR (Lavie and Agarwal, 2007).However, of greater importance, we present two model-based evaluation metrics to assess performance differences among different systems.A detailed explanation follows: • ChatGPT Eval: We utilize two instructions to guide ChatGPT in generating the stance score (S st ) and the argument completeness score (S com ), both of which range from 0 to 100.The instructions we employed are outlined in Appendix D. The stance score assesses whether the original sentence and the generated sentence have opposing stances, while the completeness score gauges the generated counter-argument's caliber, specifically if it makes logical sense.We employ a weighted average of these two scores to get the final score of ChatGPT Eval.
, where λ is set to 0.5 in our experiments.To reduce the uncertainty of ChatGPT provided, we set the temperature factor to 0.1.• Arg-Judge: In order to ascertain the degree of relevance and informativeness of the generated counter-arguments, we adopt a "reverse validation" approach using the Filter model that was trained in Section 3.2.To this end, we establish Arg-Judge as the metric for evaluating the efficacy of this approach in identifying meaningful counter-arguments that are not mere nonsense safe replies.Specifically, we normalize the average-pooled hidden before the softmax layer of the filter model θ to get a continuous predicted score ŝ ∈ [0, 4].We empirically define the Arg-Judge score as We selected the hyper-parameter setting based on our observation that large language models (such as Alpaca-LoRA and GPT-3) tend to generate sentences with scores concentrated between 0 and 0.8.Arg-Judge can thus enhance the distinguishability for these high-scoring sentences, while still maintaining the monotonicity.Human Evaluation Based on the work of Hua et al. ( 2019), we conducted a more detailed human evaluation.Five human judges are asked to rate arguments on a Likert scale of 1 (worst) to 5 (best) across 5 dimensions to evaluate the performance of the systems: • Grammaticality: assess whether output adheres to the rules of grammar.• Appropriateness: focus on whether the output is contextually suitable.• Content Richness: reflect the depth of information provided by output.• Logic: measure the rationality of output.
• Persuasiveness: show the extent to which readers are persuaded by output.Additionally, We use Top-1 to represent the proportion of the best output.We emphasize the importance of human evaluation as it provides results that are more aligned with human value, compared to automatic metrics based on n-grams or models.also complemented with each other.For instance, BART and Dialog-GPT models tend to generate stance-correct but logic-and-contentlacking "safe replies", which may receive acceptable scores in BLEU, ROUGE, or ChatGPT Eval, but low scores from Arg-Judge.This addresses the limitations of traditional metrics used in the past.

Human Evaluation
We report the result of human evaluation in Figure 3 and Table 3, and we have the following observations: • Our system outperforms in multiple dimensions and Top-1 metric.It shows that the counterarguments generated by our framework are more in line with human preference.

• Result of Human evaluation is consistent with
ChatGPT Eval and Arg-Judge we report in Table 2, which corresponds to our hypothesis that lm-based evaluator may be suitable for counterargument generation.• Compared with other models, our system excels in appropriateness, logic, and persuasiveness This achievement can be attributed to the CoT instructions, that effectively guide the language model through multi-step reasoning.
Top-1 Arg-LlaMA 62% Alpaca-LoRA 27% LlaMA 10% Table 3: Result of human evaluation.Top-1 means the proportion of system output ranked as the first.The counter-arguments generated by our proposed model were rated by human judges to be of higher quality.

Ablation Study
We perform ablation studies to explore the role of different components in both training and generation process.We explored four variants in addition to the overall framework: 1) Instead of the argumentation instruction set, we use the Alpaca instruction set to instruct-tuning the LM. 2) We replace the LM with LlaMA-7b, which has not been fine-tuned by instructions.3) We remove CoT instructions components and use a series of simple instructions, such as "Give me the counterargument".4) We remove the filter components and select output from the candidates randomly.
The results of the ablation study are presented in Table 4.We have the following findings: • Using argumentation instructions during training is very helpful for the model.This clearly demonstrates the effectiveness of our proposed argumentative instruction set.We make the argumentation instruction set publicly accessible to benefit the wider community.• Instruct-tuning matters.Simply generating from LlaMA affects performance, while instruction tuning can help the model better adapt to argumentation scenarios, respond to the instructions, and reason out correct rebuttals from CoT instructions.• Compared with common instructions, CoT instructions can produce higher-quality counter-arguments.This is because CoT instructions can give a logical chain and a multistep reasoning process, which improves the quality of output.• Multiple error templates can improve the quality of generated counter-arguments.Multiple error templates can help the LMs discover potential errors from multiple perspectives, thus generating richer candidate sentences.• Filter component plays a crucial role in our system.It enables us to select high-quality arguments from candidate sentences, while random selection fail to achieve similar performance.

Validation of Arg-Judge Metric
In order to explore the capability of our proposed Arg-Judge to reflect the actual ranking level and its consistency with human evaluation, we designed two corresponding tasks.Quality Selection Dataset (QSD): This dataset is used to check whether the Arg-Judge is aligned with human evaluation.It consists of 500 triplets in the format of <original argument, better counterargument, worse counter-argument>.Given an original arguments from ArgTersely, we first used the ChatGPT to generate two counter-arguments and then manually selected one as the better counter-argument and another as the worse counterargument.

Validation Tasks and Comparisions
RD: Can Arg-Judge reflect the real ranking?Given the original argument, the task on the RD dataset is to select the best counter-argument from four candidates.We use precision at one (P@1) to measure the ability of Arg-Judge to reflect real ranking.
QSD: Is Arg-Judge consistent with human evaluation?Given the original argument, the task on the QSD dataset is to select a better counter-argument from two candidates.We use accuracy to reflect the consistency between Arg-Judge and human evaluation.
Comparisions We use BERT-base and ChatGPT as comparisons.To adapt ChatGPT to these tasks, we constructed two instructions.Specific information about the instructions is in Appendix E.

Validation Result
Arg-Judge can reflect the real ranking.The result in Table 5 shows that the performance of Arg-Judge is better than ChatGPT and BERT.It means that Arg-Judge demonstrates sorting capabilities that reflect real-world scenarios after training.Arg-Judge is highly consistent with human evaluation.Result is in Table 6.Based on the result, the consistency with human evaluation ranks in the following order from high to low: Arg-Judge, ChatGPT, and BERT.It shows the high consistency between Arg-Judge and human evaluation.

Case Study
We illustrate the advantage of our model through a case study in Table 7. Alpaca-LoRA and LlaMA fail to adequately tap into the subtext of the original argument that the white upper middle class cannot help the working class when they are fighting each other.And without instruct-tuning, GPT-3 and LlaMA mostly consist of extensions or additions to the original text, lacking a compelling rebuttal.Whereas, our model recognizes this implicit logic and implements a counter-argument that it is not a plausible excuse.

Related Work
Counter-Argument Generation Datasets (Ji et al., 2021;Yuan et al., 2021;Hua and Wang, 2018;Stab et al., 2018) for counter-argument generation mainly establish the rebuttal relationship in the conversation using automatic methods such as citation or reply detection.Tan et al. (2016) Topic: How should the struggles of white upper middle-class individuals be addressed in the USA?Original Argument: The struggles faced by white upper middle-class individuals in the USA raise questions about the prospects and viability of investing in the white working-class, let alone providing support for communities of color in inner cities.
Ours: However, the struggles faced by white upper middle-class individuals in the USA should not be used as an excuse for not investing in the white working-class.
GPT-3: The white working-class is not a monolithic group.The white working-class is not a homogeneous group.The white working-class is not a static group.
Alpaca-LoRA: The struggles faced by white upper middle-class individuals in the USA raises questions about the prospects and viability of investing in the white working-class and providing support for communities of color in inner cities.
LlaMA: Let alone providing support for communities of color in inner cities. proposed CMV dataset, including the citation relationship between original posts and their corresponding replies.2023) provide an almost annotation-free method for aligning pre-trained language models with instructions.The overall process is an iterative bootstrapping algorithm, which starts off with a limited seed set of manually written instructions that are used to guide the overall generation.We fine-tuned the Arg-LlaMA model using the self-instruct approach, where we included seed instruction for a variety of related tasks of the counter-argument generation.
Adaptation As the size of the language model increases, the cost of fine-tuning also increases.A series of works (Shin et al., 2020;Li and Liang, 2021;Houlsby et al., 2019)

Conclusions
In this paper, we introduce a benchmark ArgTersely for sentence-level counter-argument generation.Specifically, we present a human-annotated dataset and develop a language model based on argumentation instructions.We further construct a framework Arg-LlaMA, which leverages the language model.Additionally, we propose two model-based metrics, ChatGPT Eval and Arg-Judge, as complements to n-gram-based metrics.Experiments show that our framework competes well with mainstream models, and our metrics are effective and highly consistent with human evaluations.

Ethical Considerations
Since we propose a new dataset ArgTersely, we solve some possible ethical issues in this section.Potential Risk Our dataset is sourced from Change-MyView (CMV), a subcommunity on Reddit.Users must adhere to community rules4 , including restrictions on hate speech.We also formulate an ethical guideline and require annotators to follow it.We train annotators to mark and skip sentences violating the ethical guidelines.Annotators were informed about potential risks.Our annotation process respects intellectual property and privacy rights.
Identifiable Information Our data is sourced from open platforms, safeguarding privacy.We also removed sensitive information such as emails, phone numbers, and usernames during data preprocessing.
Compensation We employed 24 part-time annotators, compensating them at $0.25 per conversation (equivalent to at least $3.75 per hour, with a cap of 2 hours per day), which surpasses the local minimum wage.
Annotation Bias We perform a series of methods to reduce the bias during annotation, including annotator training, trial annotation, and crossannotation.

Limitations
While the experimental results demonstrate the effectiveness of Arg-Judge, it is important to note that our exploration of the consistency between human evaluation and language model evaluators (including ChatGPT Eval and Arg-Judge) was limited to a specific set of scenarios.Furthermore, due to computational resource constraints, we were unable to train a larger-scale language model as an evaluator.
Moving forward, our future research will involve expanding the evaluation of the language model evaluator across a broader range of scenarios and utilizing a larger-scale language model to enhance its capabilities.

Figure 3 :
Figure 3: Result of human evaluation on grammaticality, appropriateness, content richness, logic and persuasiveness.
The overview of our proposed framework Arg-LlaMA.First, CoT instructions guide the language model to identify errors.Next, the LM generates candidate sentences based on those errors.Finally, BERT-based filter selects the best counter-argument by scoring the concatenated original argument and candidate sentence.

Table 1 :
The statistics of ArgTersely.

Table 2 :
(Loshchilov and Hutter, 2018) counter-argument generation.We report BLEU-1 (BLEU), ROUGE-L (ROUGE), METEOR, ChatGPT Eval, Arg-Judge, average number of words per sentence(# Word) and the number of parameters in each model(# Param).The best results are in bold.Our proposed model perform well in most metrics (Wilcoxon signed rank test(Kotz and Johnson, 2012), p < 0.05).triplets in the format of <topic, original argument, counter-argument>.Implementation Details When training the base LM with instruct-tuning, we use LlaMA-7b as the base model.We set the learning rate to 3 × 10 −4 , batch size to 256, gradient accumulation step to 16, and train the model 5 epochs on 4 NVIDIA RTX3090 GPUs.The α and r of the LoRA method are both set to 16.When training the filter, we use BERT-base as the base model.We set the learning rate to 1 × 10 −5 , batch size to 64, and train on an NVIDIA RTX3090 GPU for 2 epochs.For training both models, we employed AdamW(Loshchilov and Hutter, 2018)as optimizer.

Table 2
• Comparing n-grams to reference sentences for argument generation tasks is often insufficient.Our proposed model-based metrics, ChatGPT Eval, and Arg-Judge, not only demonstrate consistency with the n-gram metric, but they are We use the test set of this dataset to check if the Arg-Judge can reflect the real ranking.As mentioned in Section 3.2, it has 800 testing samples.A sample includes an original argument and four candidate counter-arguments.

Table 5 :
Validation result on the RD dataset.It demonstrates the reliability of sorting the results of Arg-Judge score.

Table 6 :
Validation result on the QSD dataset.Arg-Judge shows high consistency with human evaluation.

Table 7 :
Case study of an instance in the test set.
Chen et al. (2023)023)ethods like prompt-tuning and adapter-tuning to alleviate this problem.However, it's difficult to directly optimize the prompt, and introducing the adapter layer will cause a delay in reasoning.Considering that, Hu et al. (2021)proposed Low-Rank Adaptation (LoRA), which greatly reduces the number of trainable parameters for downstream tasks.Additionally, there have been advancements in extending Low-Rank Adaptation.Dettmers et al. (2023)proposed QLoRA that fine-tunes large models on limited memory GPUs through 4-bit quantization and Low-Rank Adapters.Chen et al. (2023)introduced LongLora that leverages sparse local attention and achieves context extension with minimal computation.Since our work does not involve long contexts in generation and does not prioritize optimization techniques, we just utilize LoRA to fine-tune our model for efficiency.