Contextual Interaction for Argument Post Quality Assessment

,


Introduction
Given the highly subjective nature of argumentation, arriving at a standard answer for a contentious topic is often challenging, as diverse opinions exist.Consequently, assessing the quality of arguments is a complex task, as it necessitates assessing not only the relationship between arguments and the topic at hand, but also the quality of the argument itself (Wachsmuth et al., 2017b).Nowadays, there are many debate websites on the Internet, such as Quora, Kialo, and Zhihu.Usually, those websites present multiple arguments in a masonry layout.People tend to view those arguments in a very short period of time, then give those arguments an overall estimation of their strength based on their impression of the relative quality of different arguments.In this regard, on the internet, when confronted with a large volume of text, people's evaluations of the quality of arguments often stem from quick impressions rather than careful consideration after thorough reading and thinking.
Argumentation is frequently perceived as a tool for facilitating various forms of reasoning, including decision-making and persuasion.However, these approaches often assume that the individuals involved will exhibit purely rational behavior.In contrast, human behavior is known to blend rational and emotional elements in guiding their actions.It has been suggested that a substantial link exists between the process of argumentation and the emotions experienced by the participants in that process.According to the study of social science (Benlamine et al., 2015;D'Errico et al., 2018;Li and Xiao, 2020;Hilton, 2008), the strength of an argument relies much more on the argument's emotional appeal rather than the argument's logical coherence, which means it is hard to capture the quality of arguments only with their own context.The first impression made by comparing different arguments plays an important role in assessing the argument's quality.However, how to model this comparing procedure remains uncharted.Indeed, most current AQ assessment approaches (Marro et al., 2022;Gurcke et al., 2021a;Wachsmuth et al., 2016;Persing and Ng, 2017) consider an argument's quality based on its own context under the restricted perspective of logical, rhetorical, or dialectical.Only a few works tried to solve the AQ assessment problem through a monolithic view (Fromm et al., 2022).Some tried to capture a feature that may affect the quality of an argument, such as argument structure (Li et al., 2020) and discourse structure (Liu et al., 2021).However, revealing those features requires extra annotation costs under the guidance of linguistic specialists which makes those methods hard to be applied to large-scale datasets.Since argument quality is a relative concept, it's hard to distinguish the slight difference between them, especially for arguments with subtle differences.
This study presents a novel investigation into measuring the nuanced differences between arguments for quality assessment.Specifically, we examine the contextual interaction of content between various argument posts to enhance the quality assessment process.We explore two different approaches to simulate the comparison of different contexts as humans evaluate multiple arguments: 1) Supervised contrastive learning for cross-argument interaction that pulls together those arguments with similar quality, while pushing apart other arguments whose quality is at both ends.2) Utilization of Large Language Models (LLMs) with in-context examples for the AQ assessment.Extensive evaluation on the standard IBM-Rank-30K dataset (Gretz et al., 2020) demonstrates that contrastive learning surpasses state-of-the-art baselines in terms of overall assessment accuracy across arguments of varying quality ranges, as well as in distinguishing arguments with similar quality.While LLMs with in-context examples exhibit effectiveness in recognizing arguments at extreme quality ends, they fall short compared to contrastive learning when quantifying argument quality and differentiating between arguments with close quality gaps.
Major contributions of this paper are: 1) An investigation of modeling contextual interaction for the AQ assessment.2) Proposal of a contrastive interaction approach that outperforms state-of-theart baselines on the IBM-Rank-30k dataset.3) An exploration of LLMs with in-context examples for the AQ assessment underscores the limited performance of LLMs in quantitatively recognizing the subtle difference between arguments.

Related Work
Argument Quality (AQ) Assessment.The problem of creating a convincing argument has its origins in ancient Greece, where the persuasiveness of arguments was discussed through dialectic and rhetoric (Aristotle and Kennedy, 2006).Based on classical theories on arguments (Johnson and Blair, 2006;Hamblin, 1970;Perelman and Olbrechts-Tyteca, 1969;Eemeren and Grootendorst, 2003), Wachsmuth et al. (2017b) identify logical, rhetorical, and dialectical as the three aspects of AQ.With recent progress in natural language processing, AQ has been studied and applied in various domains, including student essays (Wachsmuth et al., 2016), news editorials (El Baff et al., 2020), and social media discussions (Wachsmuth et al., 2017c;Skitalinskaya et al., 2021).
In the current research on AQ assessment, the main focus has been on studying it as a sub-task in Argument Mining (AM).However, due to the extreme subjectivity of AQ, there is no clear definition for it.As a result, it is believed that there are various factors that can influence AQ.Wachsmuth et al. (2017b) summarized 15 factors that affect argument quality, categorizing them into logical, rhetorical, and dialectical aspects.Gurcke et al. (2021b) assessed argument quality specifically in terms of sufficiency with human efforts, hypothesizing that the conclusion of a sufficient argument can be derived from its premises.Li et al. (2020) assessed the persuasiveness of arguments by analyzing their argument structure using a factor graph model.On the other hand, Singh et al. (2021) focused on explicating implicit reasoning (warrants) in arguments with the help of trained experts.Falk and Lapesa (2023) attempted to enhance the AQ assessment performance by incorporating knowledge from other dimensions into the prediction process through multi-task learning.These studies consider each perspective of argument quality separately, which limits their holistic view of the concept.Additionally, some of these approaches require additional annotations (Marro et al., 2022), bringing difficulty in application to large-scale datasets.
Cross-document Interaction.The idea of modeling cross-document interactions has been widely explored in machine learning and learning to rank tasks.Pang et al. (2019) proposed SetRank, which utilizes a self-attention mechanism to capture local context information from cross-document interactions and learn permutation equivalent representations for the input documents.In a related area, van den Oord et al. (2018) introduced unsupervised contrastive learning, a method for extracting useful representations from high-dimensional data.This technique has been influential in various domains and has shown promising results in learning meaningful representations.Building on the concept of contrastive learning, Khosla et al. (2020) extended it to the supervised setting, enabling effective utilization of label information.Large Language Models with In-context Examples.In-context learning (ICL) (Honovich et al., 2022) has achieved tremendous success on large language models (LLMs).The main concept behind in-context learning is to leverage analogies for learning.In-context learning uses a small number of examples to create a demonstration context, often expressed through natural language templates (Min et al., 2022).Specifically, a query question and demonstration context are concatenated to form a prompt, which is then inputted into an LLM for prediction.Besides, to enhance the efficiency and performance of existing LLMs, especially for non-opensource APIs like ChatGPT and GPT-4, OverPrompt (Li et al., 2023) enables parallel processing of multiple inputs within a single prompt using in-context learning.Liu et al. (2022) aim to explore a more effective strategy for carefully selecting in-context examples, which can amplify GPT-3's in-context learning capabilities more efficiently than random sampling.Recently, some studies have employed LLMs for ranking and quality assessment tasks.Sun et al. (2023) examined the use of generative language models like ChatGPT and GPT-4 for relevance ranking in Information Retrieval (IR).Kocmi and Federmann (2023) utilized the GPT series models to assess the quality of machine translation and proposed the GEMBA metric.

Overview
To imitate the reaction when humans read different arguments, we explore two alternative approaches for this comparing procedure, especially for the differentiation between arguments with a close quality gap.The first approach is based on supervised contrastive Learning (CL) (Khosla et al., 2020), which aims to maximize the similarity between similar pairs of samples, while simultaneously minimizing the similarity between dissimilar pairs.Additionally, a reasoning module based on graph neural networks is introduced to leverage the discourse relation and logical structure of arguments for their quality assessments.We explore an alternative approach that leverages the capabilities of LLMs.Specifically, we utilize LLMs with in-context examples, which can be viewed as a means of incorporating contextual interaction as prompted demonstrations.In this approach, the models analyze and process text while considering the provided examples within a given prompt.To imbue the LLMs with knowledge of argument quality, we supply the models with example arguments of varying quality levels in the prompts.We then prompt the model to assess and rate the provided arguments based on the standards set by those examples.

Supervised Contrastive Learning for Cross-argument Interaction
The architecture of our proposed supervised contrastive learning method is shown in Figure 1.During the training phase, each original batch B o consists of 4n arguments from the training set, shuffled in random order.Each argument is concatenated with its corresponding topic using the delimiter [SEP] tokens, and a [CLS] token is added to the head of the sequence: The argument representation, which serves as the input to the contrastive objective, remains consistent with the backbone architecture, such as RoBERTa (Liu et al., 2019).After being encoded, we extract the last layer's hidden units of the [CLS] token as the representations for arguments in the original batch B o , obtaining a representation list To determine the anchor candidate for arguments with median quality within the batch, the original batch B o is ranked based on the descending order of their true labels (human-annotated scores) to be an ordered one B r , with the ordered representation list Then, we divide this ranked batch into four equal-size mini-batches, denoted as B s r and H s r in which s ∈ [1,2,3,4].For example, the representations in the second mini-batch is To increase the distinction between median arguments with two extreme ends, the second and third mini-batches are chosen as the anchor candidate sets, and the other two mini-batches only serve as negative sample sets.That is, we hope that the arguments in the second (or third) mini-batches could be as close as possible, and be as far away as possible from the arguments in the remaining three mini-batches.The contrastive loss is defined as follows: wherein s is 2 or 3, sim is the cosine similarity scores, and τ is the temperature parameter.
Apart from the contrastive alignment, the model is also trained under the ground-truth scores using a mean squared error (MSE) loss: wherein y pred is the predictions of the model, and y true is the ground-truth scores.
The final loss is given by a combination of the contrastive loss and the MSE loss.The contrastive loss is to encourage the model to learn discriminative representations for anchor candidates and in-batch positives, while also promoting separation from the in-batch negatives.The MSE term measures the discrepancy between the model's predictions and the ground truth labels.
wherein α and β are the weighting factors.By adjusting the value of β, we can control the relative importance of the contrastive loss and the MSE loss in the overall training objective.

Reasoning Using Discourse Relation
Similar to previous studies (Toledo et al., 2019;Lauscher et al., 2020;Habernal and Gurevych, 2016;Wachsmuth et al., 2017a;El Baff et al., 2018), we recognize the vital role played by the context and logical structure of an argument in the

Large Language Models with In-context Examples
In this work, we employ two large language models, ChatGPT (gpt-3.5-turbo)and Davinci-003 (text-davinci-003).Both ChatGPT and Davinci-003 are based on the GPT-3.5 architecture, while Davinci-003 is an earlier version.Both of them offer excellent performance and can generate coherent and contextually relevant responses in many text generation tasks.Therefore, we aim to explore whether these two models exhibit equally outstanding performance in the AQ assessment task.
Our approach aims to emulate the comparative evaluation carried out by human readers when browsing debate forums or websites.To achieve this, we adopt a method that incorporates knowledge of argument quality, designing the prompt with in-context examples.Examples within the prompt can be considered as posts that humans have already seen and their corresponding evaluations in their minds when they browse different arguments.Figure 3 shows the detailed prompt for AQ ranking and comparison task.For the prompt, we first provide example arguments along with their corresponding topic and score.Then, we provide the task instructions.Finally, we present the argument(s) to be evaluated or the arguments pair to be distinguished, followed by "Score:" or "Better argument: ", indicating that the model is required to complete the score or judgment.

Experimental Setup
Dataset.IBM-Rank-30k (Gretz et al., 2020) is a large-scale argument dataset containing 30k arguments over 71 topics.Each argument in the dataset is annotated with a continuous quality score between 0 and 1, and we use weighted average quality scores as ground-truth labels since they negate the influence of non-reliable annotators.The training subset contains 49 topics with 20,974 arguments.The Dev subset contains 7 topics with 3,208 arguments and is used for tuning hyper-parameters and determining early stopping.The test subset contains 15 topics with 6,315 arguments.
Task setup.We evaluate the effectiveness of incorporating contextual interaction for argument quality (AQ) assessment on two tasks, namely, the AQ ranking, and the AQ comparison.
The AQ ranking task follows the official setup of IBM-Rank-30k.Given a list of arguments A = [a 1 , a 2 , ..., a n ] and their corresponding topics T = [t 1 , t 2 , ..., t n ], the task is to assign a ranking for those arguments based on the descending order of their quality score S(a i , t i ).The result of the argument quality ranking task is evaluated with Pearson correlation (Cohen et al., 2009), Spearman correlation (Wissler, 1905), and NDCG (Normalized Discounted Cumulative Gain) (Järvelin and Kekäläinen, 2002) on all test samples.
The AQ comparison task involves predicting the argument of higher quality in a given argument pair, denoted as (A 1 , A 2 ).To evaluate the performance of models on this task, we conduct experiments using three sets of argument pairs, each consisting of 2,000 pairs.These pairs were categorized based on the difference in argument quality scores.The categories include pairs with a score difference of less than 0.25, ranging from 0.25 to 0.5, and exceeding 0.5.The objective of this task is to assess how the model performs in terms of differentiating between argument posts in different levels of quality gaps.Accuracy, the percentage of correct predictions made by the model over all argument pairs, is used to evaluate this task.
Implementation details.We use RoBERTa-base (Liu et al., 2019) and BERT-base (Devlin et al., 2019) as our backbone models.The hyperparameters α and β in Eq.4 are set to 0.5 and 0.8, and the temperature in Eq.1 is set as 0.1.We finetuned the RoBERTa-base model for 5 epochs with a learning rate of 2e-5 and a batch size of 32, which align with the settings of the BERT model in (Gretz et al., 2020).The vanilla model and models with DAGN are trained by MSE loss, while models with CL are trained based on the loss in Eq.4.
As for demonstrations to LLMs, we employed two settings for example numbers N. The first approach includes three examples consisting of a high-quality argument (1 point), a medium-quality argument (0.5 points), and a low-quality argument (0.1 points).The second setting involves providing ten examples in the prompt, with scores ranging from 0.1 to 1.The smallest difference in scores between each pair of examples is approximately 0.1 points.We experiment with these two settings for both AQ ranking and comparison tasks.
In AQ ranking tasks, we instruct LLMs to give a quality score of the input argument, and we test two settings on candidate arguments numbers S: 1) S = 1, rating individual argument, and 2) S = 10, rating groups of ten arguments together, to evaluate the impact of the candidate arguments number on AQ ranking performance.In AQ comparison task, to recognize if the LLMs have the ability to distinguish the quality of arguments, we ask the LLMs to select the argument with better quality between the two options, that is, we instruct them in the prompts to return the number (1 or 2) corresponding to the higher-quality argument according to the specified format.The prompts used can be found in Figure 3. Due to the randomness of scoring arguments using LLMs, we conduct five experiments separately and report the average score or majority vote of the five sets of experiments as the final results.In the naming of model variants, kE and kS stands for the example arguments presented to the LLM, and the arguments that the LLM is asked to score for once.For instance, ChatGPT-3E-10S stands for ChatGPT which is prompted to score a batch of 10 arguments while being shown with 3 examples.
Comparison models.We employ the following baselines in our evaluation: SVM BOW is a support vector regression ranker (Gretz et al., 2020).The training set is composed of the most frequent 1000 tokens and utilizes an RBF kernel and bag-of-words features.Arg Length (Gretz et al., 2020) evaluates argument quality based on its length in characters, following the intuition that longer texts may provide more detailed explanations.RoBERTa and BERT (Favreau et al., 2022) concatenate the argument with its corresponding topic, and generate a quality score.TFR-BERT with ensemble losses achieves stateof-the-art effectiveness on IBM-Rank-30k as shown in (Favreau et al., 2022).This approach is built upon the work by (Han et al., 2020), which incorporates several ranking losses in TFR-BERT.Favreau et al. ( 2022) applies a similar technique to evaluate argument quality by combining the output of multiple TFR-BERT models, each trained with a distinct ranking loss.RoBERTa w/ own adpt (Falk and Lapesa, 2023) is a recent approach that integrates knowledge from different dimensions into the prediction process using multi-task learning.RoBERTa-base is used as the backbone model for all the dimensions.Dual BERT w/ spark (ZS) (Deshpande et al., 2023) involves four types of augmentations for the AQ prediction by offering feedback, deducing hidden assumptions, providing a similar-quality ar-gument, or presenting a counterargument.Dual BERT was used as the backbone model.
The outcomes obtained from SVM BOW and Arg Length are referenced from (Gretz et al., 2020).The results of BERT and TFR-BERT are directly cited from (Favreau et al., 2022).Additionally, RoBERTa w/ own adaptar and Dual BERT with Spark (ZS) are cited from (Falk and Lapesa, 2023;Deshpande et al., 2023), respectively.

Experimental Results
Tables 1 & 2 present the results obtained for the AQ ranking task and the AQ comparison task, respectively.
Our proposed contrastive argument interaction outperforms state-of-the-art baselines.The results in Table 1 show that RoBERTa's performance is superior to that of BERT.RoBERTa achieves a Pearson correlation of 0.5283 and a Spearman correlation of 0.4858.It also can be seen that our proposed contrastive learning (CL) for the argument interaction outperforms the baselines with a Pearson correlation of 0.5580 and Spearman correlation of 0.5186.According to NDCG@K, our methods outperform baselines by a significant margin.Moreover, RoBERTa + DAGN and RoBERTa + CL achieve a prominent performance with a Pearson correlation of 0.55 and Spearman correlation of 0.51.The prominent performances of these 2 models reveal that reasoning and comparing both have a large impact on argument quality.The experimental results indicate that CL can consistently improve the AQ ranking performance over vanilla RoBERTa and BERT models.For the AQ comparison task, we exclusively incorporate RoBERTa-based models, excluding LLMs, as they have demonstrated su- perior performance compared to other models.The result in Table 2 revealed that CL method achieves the highest accuracy in discerning argument pairs with a difference in scores within 0.25 and above 0.5, reaching 65.45% and 89.6%, respectively.This confirms that our proposed approach effectively enhances the model's ability to distinguish subtle differences and creates a clear distinction between arguments with intermediate scores and those at the extreme ends.Performance of LLMs.As shown by the results in Tables 1 and 2, despite the fact that LLMs have demonstrated strong ability in text generation tasks (Bang et al., 2023;Shen et al., 2023;Laskar et al., 2023), they still have significant limitations compared to our models when it comes to quantifying text quality.Indeed, both Davinci-003 and ChatGPT underperform in the task of scoring text, whilst ChatGPT outperforms Davinci-003 as expected.After a closer look at the results, when there are 3 examples in a prompt, we find that Davinci-003 exhibits stronger randomness in predicting scores for low to medium scores and tends to give scores similar to those in the examples through observation of the predicted score distributions in the three experiments.On the other hand, ChatGPT provides more diverse predicted scores, indicating its stronger comprehension ability compared to Davinci-003.When we attempted to predict their quality by grouping candidate arguments into sets of 10, there was a slight improvement in the results compared to before.This indicates that the interaction between arguments is also effective in LLMs.When the number of examples in the prompt is increased to 10, the performance decreases compared to when there are only 3 examples, particularly evident in the NDCG metric.The performance deterioration becomes even more pro-   2 indicate that the LLMs can distinguish between good and poor argument quality when there is a significant quality gap (with a difference in quality scores above 0.5), the accuracy of the LLMs drops to around 50% when the difference in argument quality is small (with a difference in quality scores below 0.25).This accuracy is close to that of random selection, which suggests that LLMs are more inclined to randomly select rather than make judgments based on understanding when the difference in argument quality is small.When the difference in argument quality is between 0.25 and 0.5, the comparison capability of the LLMs shows a significant improvement, reaching around 67%.However, there is still a 10% gap compared to the performance of the fine-tuned models.This phenomenon is also reflected in the performance of the LLMs on the correlation metrics for the quality ranking task, where it shows relatively poor results.However, it achieves a decent performance in terms of the NDCG metrics, which indicates LLMs are able to distinguish between good and poor arguments but struggle to differentiate small differences in quality.

Discussion
Argument interaction helps distinguish the slight difference.As a case study, two example arguments in Table 3 express their opinions supporting factory farming.While both arguments discuss the advantages of factory farming in increasing agricultural output, slight differences in expression lead to variations in their quality scores.One argument receives a high-quality score of 0.95, while the other argument scores around 0.8, indicating slightly lower quality.When using RoBERTa-base without argument interaction, the model assigns a score of 0.97 to the argument with a score of 0.95.However, for the other argument that has similar content but slight flaws in expression, RoBERTabase still assigns a high score of around 0.95.Meanwhile, after introducing contrastive learning as a means of argument interaction, the model maintains a score of 0.95 for the argument with a score of 0.95.Additionally, it can provide a score of 0.87 which is closer to the true score for the other argument.Similar results are observed for the topic of Social media.These two examples demonstrate how the argument interaction can enhance the model's ability to discern subtle differences in expressions.
Distribution of predicted scores with argument interaction is closer to the real distribution.The histogram presented in Figure 4 illustrates that the fine-tuned RoBERTa-base generates quality logits that are more polarized compared to the actual data distribution.In the ground truth, top arguments (score ≥ 0.9) make up about 34% of the distribution, while in the prediction without argument interaction, they account for nearly 50%.This suggests that the model fails to distinguish between decent and good arguments due to the difference in expression skills and the resulting emotional responses from readers.The histogram of the model with contrastive learning exhibits a distribution closer to the real data.It identifies top arguments at approximately 36%, indicating better differentiation between decent and good ar- guments compared to the RoBERTa-base model.These findings demonstrate that incorporating contrastive learning enables the model to accurately capture subtle differences in quality and provide more precise assessments.
The potential of LLMs in AQ assessment tasks.LLMs have shown immense potential in argument quality (AQ) assessment tasks, particularly in argument comparison.As evident from the results in Table 2, when there is a significant difference (above 0.5) in the quality of arguments, large language models can achieve comparable resolution capability with only a few examples.However, despite this, there is a significant gap between large language models and fine-tuned models when the difference in quality between arguments is reduced.When the quality gap is small (below 0.25), the evaluation of quality by LLMs becomes closer to random guessing, indicating their limited ability to capture nuanced contexts by the LLMs involved in this study.Meanwhile, we have also observed that when it reaches a certain number of examples in the prompts (e.g., 20 examples), due to the excellent memorization ability of the large model, it tends to replicate text and scores that appear in the prompts in the answers.How to make the LLMs overcome the hallucination during the evaluation process, and return appropriate content, which can also help improve the performance of the large model in AQ tasks.
To evaluate the ability of LLMs to assess the quality of arguments under the same topic, we conducted experiments on the topic "We should adopt libertarianism".Due to the limited number of arguments available for a single topic (approximately 300 to 400), we conducted experiments on the argument quality comparison task by selecting 100 ar-gument pairs for each of the three score differences under 3 examples.Despite the relatively limited experimental sample size, which led to some fluctuations in the results, the experimental findings are largely consistent with the conclusions obtained in Table 2.
This observation suggests that LLMs, despite their proficiency in open-ended text generation (Arora et al., 2022), may lack precision in assigning precise AQ scores, leading to relatively lower performance in AQ ranking tasks.

Conclusion
In this study, we explored two alternative methods for assessing the quality of natural language arguments: supervised contrastive learning and large language models (LLMs) with in-context examples.Comprehensive evaluation highlights the importance of considering contextual interactions for evaluating argument quality.We found that supervised contrastive learning, which captures intricate interactions between arguments, outperformed state-of-the-art baselines in evaluating argument quality.However, LLMs with in-context examples showed limitations in quantifying argument quality and distinguishing between arguments with a narrow quality gap.

Limitations
While the proposed contrastive learning approach enhances the effectiveness of AQ predictions, it is important to acknowledge that it may introduce additional computational costs.The inference cost increased as we introduced DAGN and CL.The phenomenon that LLMs failed in AQ ranking task can be attributed to several potential factors.Bias in training data, model interpretability, and the need for domain-specific fine-tuning are among the factors that require careful consideration and mitigation.We only conducted experiments on ChatGPT and Davinci-003, without testing GPT-4.From our experiments, it can be observed that LLMs hold great promise in AQ assessment tasks.While substantial prospects exist for LLMs in the domain of quality evaluation tasks, overcoming the aforementioned challenges remains an imperative subject for ongoing research.

Figure 1 :
Figure 1: The supervised contrastive learning for crossargument interaction.The representations (Reps) of arguments with median quality are pulled closer (→←), but pulled far (←→) from other arguments whose quality are at both ends.
domain of argument quality assessment.Some studies have investigated the impact of discourse relations on the quality of argumentation(Durmus et al., 2019; Li et al., 2020).In the field of reading comprehension,Huang et al. (2021) use Discourse-Aware Graph Network (DAGN) to capture advanced discourse features that can effectively represent passages to solve logic QA tasks.Herein, we further adapt this method to the domain of quality evaluation for arguments, leveraging graph neural networks to learn the discourse structure of contexts.Specifically, as shown by the example in Figure2, DAGN treats discourse units as nodes and constructs a graph structure with certain conjunctions serving as edges to generate the graph representation.Differently, we supplement common conjunctions (in Appendix A.1) used in arguments as edges when constructing the graph structure.Finally, the original [CLS] token of the argument and the generated graph representation is contacted as the representation of the argument.

Figure 2 :
Figure 2: The example of the reasoning using discourse relation within the context of argument.

Figure 3 :
Figure 3: The prompt for AQ assessment and comparison, where N represents the number of examples, and S represents the number of arguments to be evaluated.
Figure 4: Histogram of the prediction from RoBERTa and RoBERTa w/ CL compared with ground truth.

Table 1 :
Results of AQ ranking task.Our proposed models are based on contrastive learning (CL), discourse-aware graph network (DAGN) and large language models (LLMs).The best results are marked in bold, and the results of best baseline are underlined.† denotes that the results are directly cited from corresponding papers.

Table 2 :
AQ comparison results.D denotes the difference value of quality score between argument pairs.

Table 3 :
Examples on arguments with small difference, on which RoBERTa w/ CL can give more discriminative scores than vanilla RoBERTa.nounced as the number of examples increases to 20 and 30.Upon observing the experimental results, one possible explanation is that due to the strong memorization capabilities of the GPT-3.5 model, it tends to heavily rely on the scores provided in the examples (details are shown in Appendix A.2).As the number of examples in the prompt increases, large language models tend to be more inclined to repeat the content of the examples from the prompt in their output, including the scores and text of the examples.This results in a failure of the final evaluation outcome.Although the results in Table Roxanne El Baff, HenningWachsmuth, Khalid  Al Khatib, and Benno Stein.2020.Analyzing the Persuasive Effect of Style in News Editorial Argumentation.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3154-3160, Online.Association for Computational Linguistics.Neele Falk and Gabriella Lapesa.2023.Bridging argument quality and deliberative quality annotations with adapters.In Findings of the Association for Computational Linguistics: EACL 2023, pages 2469-2488, Dubrovnik, Croatia.Association for Computational Linguistics.Charles-Olivier Favreau, Amal Zouaq, and Sameer Bhatnagar.2022.Learning to rank with bert for argument quality evaluation.The International FLAIRS Conference Proceedings.Michael Fromm, Max Berrendorf, Johanna Reiml, Isabelle Mayerhofer, Siddhartha Bhargava, Evgeniy Faerman, and Thomas Seidl.2022.Towards a holistic view on argument quality prediction.ArXiv, abs/2205.09803.Shai Gretz, Roni Friedman, Edo Cohen-Karlik, Assaf Toledo, Dan Lahav, Ranit Aharonov, and Noam Slonim.2020.A large-scale dataset for argument quality ranking: Construction and analysis.In AAAI Conference on Artificial Intelligence.Timon Gurcke, Milad Alshomary, and Henning Wachsmuth.2021a.Assessing the sufficiency of arguments through conclusion generation.arXiv preprint arXiv:2110.13495.