Cross-Domain Argument Quality Estimation

,


Introduction
The argumentation process is one of the cornerstones of society, as it allows the exchange of opinions and reaching a consensus together.Fueled by advances in natural language processing, recent years have witnessed the advent of Argument Mining (AM), i.e., the field of automated discovery and organization of arguments.AM is helpful over various scenarios, reaching from legal reasoning (Wyner et al., 2010;Walker et al., 2014;Poudyal et al., 2020;Villata, 2020) to supporting the decision-making process of politicians (Haddadan et al., 2019;Duthie et al., 2016;Menini et al., 2017;Lippi and Torroni, 2016;Awadallah et al., 2012).Thus, there is a flurry of works on identification of arguments from text (Stab et al., 2018b;Fromm et al., 2019;Trautmann et al., 2020) and retrieval of them (Wachsmuth et al., 2017c;Fromm et al., 2021;Dumani and Schenkel, 2019;Dumani et al., 2020;Stab et al., 2018a).Since arguments often have to be weighed against each other, a central property of arguments is their Argument Quality (AQ) or convincingness, i.e., their (perceived) strength.While the ancient Greeks (Rapp, 2002) already discussed the constituents of strong arguments, automated estimation is a relatively uncharted field.Due to the high subjectivity of argument strength (Swanson et al., 2015;Gretz et al., 2020;Toledo et al., 2019;Habernal and Gurevych, 2016b;Stab et al., 2018b), obtaining high-quality annotations is challenging, cf.Section 1.In this light, a legitimate question is the reliability and robustness of the existing approaches for estimating AQ and their applicability in real-life scenarios.Existing AQ benchmark datasets are often restricted to a single domain (Wachsmuth et al., 2016;Persing and Ng, 2017) or/and make different assumptions about factors impacting the AQ.Thus, enabling transfer between sources and datasets appears especially appealing, but existing works (Gretz et al., 2020;Toledo et al., 2019;Swanson et al., 2015;Habernal and Gurevych, 2016b) cease to provide detailed studies thereupon.
In this work, we thus investigate for the first time the automatic evaluation of the quality of arguments from a holistic perspective, bringing together various aspects.First, we evaluate whether AQ models can generalize across datasets and domains, a crucial feature for deployment in the diverse environments encountered in relevant real-world applications.Next, we investigate the hypothesis of whether models for related argument mining tasks inherently learn the concept of argument strength without being explicitly trained to do so by evaluating their zero-shot performance for estimating AQ.A In summary, our contributions are as follows: • As far as we know we are the first to study the generalization capabilities of AQ prediction models across different datasets and AQ notions.
• Since we determine the size of the dataset as one of the decisive performance factors, we further investigate a zero-shot setting of transferring from related Argument Mining tasks.
2 Related Work

Argument Quality
Argument Quality (AQ), sometimes also called Argument Strength, is a sub-task of Argument Mining (AM) that is one of the central research topics among argumentation scholars (Walton et al., 2008;Toulmin, 2003;Van Eemeren and Grootendorst, 1987).Due to its highly subjective nature, there is no single definition of AQ.As a result, there are various proposals for different factors that can affect the quality of an argument, such as the convincingness of an argument (Habernal and Gurevych, 2016a).There are several ways to express the strength of an argument.Some works take an absolute continuous score, while others argue that strength estimation works better in (pairwise) relation to other arguments.To the best of our knowledge, we are the first to evaluate how AQ estimators trained on different corpora, AQ notions, and AQ tasks correlate with each other.One of the first relatively large corpora was presented by Swanson et al. (2015).The SwanRank corpus contains over 5k arguments, where each argument is labeled with a continuous score that describes the interpretability of an argument in the context of a topic.They propose several methods based on linear regression, ordinary kriging, and SVMs as regression algorithms to automatically estimate the strength from an input text encoded by hand-crafted features.Other corpora have followed, using relative-and/or absolute convincingness (Habernal and Gurevych, 2016b;Potash et al., 2019) as an annotation criterion.The works proposed AQ estimators based on SVMs or BiL-STMs combined with GloVe embeddings (Pennington et al., 2014).Gleize et al. (2019) provide a dataset, IBM-EviConv, that focuses on ranking the evidence convincingness.They used a Siamese network based on a BiLSTM with attention and trainable Word2Vec embeddings.Gretz et al. (2020), andToledo et al. (2019) created their corpora by asking annotators whether they would recommend a friend to use the argument in a speech supporting or disputing the topic, regardless of their own opinion.Both use a fine-tuned BERT (Devlin et al., 2019) model for the absolute AQ regression task.
The shared evaluation practice in the previous works is to evaluate methods on each dataset independently.Gretz et al. (2020) use their newly introduced dataset for pre-training of their model.The authors then investigate the strength of their models by applying them on two related datasets UKPConv and SwanRank.By finetuning the model on the training part of two datasets, they investigate if the pretraining is helpful for the target corpora.Our work proposes to advance the evaluation and advocate for an accurate cross-dataset evaluation without additional fine-tuning on the evaluation dataset to estimate the model's applicability in challenging real-life scenarios.
As a common understanding of AQ is still lacking, Wachsmuth et al. (2017a,b) investigated different dimensions of AQ.Based on a survey paper of existing argument quality theories (Wachsmuth et al., 2017a), they developed a taxonomy that aims to capture all aspects of AQ.In their work, they present a small corpus of 320 arguments annotated for 15 dimensions and explore the correlations between the different dimensions.Thus, their work presents a different view that rather focuses on the argumentation theory than on multiple corpora and the generalization of AQ estimators.Lauscher et al. (2020); Ng et al. (2020) created a cross-domain corpus (Q&A forums, debate forums, and review forums) with 5,295 arguments using the annotation scheme of Wachsmuth et al. (2017a).They conclude that, in most cases, models benefit from the inclusion of out-of-domain training data.However, they do not perform a cross-corpora study of their architectures, which limits the generalizability and impact of their experiments.

Generalization across Argument
Quality Corpora High-level applications such as Argument Retrieval (Wachsmuth et al., 2017c;Fromm et al., 2021;Dumani and Schenkel, 2019;Dumani et al., 2020;Stab et al., 2018a) and autonomous debating systems (Slonim et al., 2021) require reliable Argument Quality (AQ) models to select strong arguments among the relevant ones.The research community has identified this gap and proposed and evaluated different automated models for AQ estimation (Gretz et al., 2020;Toledo et al., 2019;Swanson et al., 2015;Habernal and Gurevych, 2016b).However, AQ is often captured differently due to its high subjectivity, e.g., absolutely as a continuous score or relative to other arguments by pairwise comparison.Consequently, many publications also introduced their own corpus with individual annotation schemes capturing different notions of AQ.While they have compared multiple AQ estimators against each other within a single corpus, there is a lack of cross-corpora empirical evaluations.Thus, the robustness of predictions across datasets remains largely unexplored, which poses a severe challenge for reliable real-world applications integrating diverse data sources.To assess the generalizability capability of AQ estimation models, we designed a series of experiments across all four major AQ datasets to answer the following research questions: 1. How well do AQ models perform across datasets if annotations schema and domain of the arguments do not change?
2. How does the corpora size affect generalization?
3. How well do models generalize across different text domains?
4. How does the AQ quality notion affect generalization?
5. Does the AQ model become more robust if it is trained with a combined dataset containing data from different domains and labeling assumptions also vary?

Datasets and Evaluation Setting
We briefly describe the four AQ datasets used in our empirical study, which all capture AQ on a sentence level.They are also summarized in Table 2.
1. Swanson et al. (2015) constructed the dataset SwanRank with 5,375 arguments whose quality is labeled in the range of [0, 1], where 1 indicates that an argument can be easily interpreted.It consists of four controversial topics taken from the debate portal CreateDebate1 .
2. Habernal and Gurevych (2016b)   As some of the corpora did not provide official train-validation-test splits and differed in the number of topics and the formulated task (in-topic vs. cross-topic), we decided to do our own split based on the topics of the arguments.Contrary to the original topic splits in UKPConv, IBM-ArgQ and IBM-Rank, we treat the supporting and opposing arguments from a certain topic as one topic because they have very great similarities.Whereas in their work, e.g. the topics "We should abandon cryptocurrency" and "We should adopt cryptocurrency" are represented as two topics.We perform 10-fold cross-topic cross-validation, where each fold is a 60%/20%/20% train-validation-test split, and we additionally ensure that no topic occurs in more than one split.By the latter requirement and the topic merge, we ensure an inductive setting where the AQ estimation can not rely on similar arguments in the training corpus and therefore provides a more challenging but more realistic task.

Model and Training
Since transfer learning achieves state-of-the-art Argument Mining (AM) results on different corpora and tasks (Reimers et al., 2019;Fromm et al., 2019;Trautmann et al., 2020), we also apply it to our AQ estimation task.We use a bert-base model, pretrained on masked-language-modeling, and finetune it to predict absolute AQ scores on the respective datasets, cf.Section 3.1.As an input, we used the arguments from the respective datasets and concatenated the topic information, separated by the BERT specific [SEP ] limiter, similar to other work in AM (Fromm et al., 2019;Reimers et al., 2019;Gretz et al., 2020).We concatenate the last four layers (as Gretz et al. (2020); Toledo et al. (2019) did it) of the fine-tuned BERT model output to obtain an embedding vector of the size 4 • 768 = 3, 072.For the regression task, we stack a Multi-Layer Perceptron (MLP) with two hidden layers, one with 100 neurons and a ReLU activation, followed by the second hidden layer and a sigmoid activation function.We train the architecture end-to-end, with SGD with a weight decay of 0.35 and a learning rate of 9.1 × 10 −6 .The MLP uses dropout with a rate of 10%.

Results
Table 3 summarizes our results.We report the Pearson correlation score between the predicted-and ground-truth absolute AQ evaluated on a hold-out test set.Contrary to the original topic splits in UKPConv, IBM-ArgQ and IBM-Rank we treated the supporting and opposing topics as one topic.The task is therefore more challenging, as topic information from the contrary stance can not be used during training.However, the task is also more realistic, as one can not expect to have arguments from all topics in the training set.

Evaluation on Similar Datasets and Importance of Training Set Size
First, we evaluate the performance of the model on similar datasets and the dependency on the size of the training dataset.We can observe that models perform very well on other datasets from a similar domain labeled with a similar quality notion, i.e., IBM-ArgQ and IBM-Rank (both are crowd collected and annotated based on recommendableness.Furthermore, we can notice that the size of the dataset is crucial for performance: a model trained on the largest IBM-Rank dataset achieves the best score also on IBM-ArgQ.This insight gives us a solid foundation for the next steps.

Generalization Across Domains and Quality Notions
Next, we investigate whether a transfer across domains is possible.Recall that the four datasets cover two different domains: the sentences from UKPConv and SwanRank have been extracted from debate portals, while IBM-Rank and IBM-ArgQ have been collected from the crowd.Compared to in-domain generalization, we observe a considerably worse generalization between domains: For example, trained on the crowd dataset IBM-ArgQ, we can achieve a correlation of 38.9% on the crowd dataset IBM-Rank, while training on the debate datasets SwanRank and UKPConv results in negligibly low correlations of 8% and 3%, respectively.Conversely, when evaluated on the debate portal dataset SwanRank, we obtain a correlation of 42.5% when using a model trained on the other debate portal dataset UKPConv, while the crowd-collected datasets IBM-ArgQ and IBM-Rank only achieves 27.8% and 37.0%, respectively.The smaller difference compared to the first comparison can be explained by the larger training datasets.
Surprisingly, we observe a completely different picture for generalization across quality notions.We see only a moderate drop in performance for a fixed domain but a different quality notion.For instance, the model trained on SwanRank performs relatively well on the UKPConv dataset.Viceversa, we observe a more considerable performance drop, which can be explained by the smaller size of the UKPConv dataset.

Multi-Domain and Multi-Quality Notion Training
To investigate whether a single model can grasp various dimensions of quality and work on arguments from various domains, we designed another set of "leave-one-out" experiments.We train on the training sentences of all but one AQ corpus and evaluate the performance on all test sets.The four rows "all except" define the three training sets, e.g."all except UKPConv" consists of the training sets of (SwanRank, IBM-ArgQ, and IBM-Rank).
The entries on the diagonal thus show how well the models perform when evaluated on an unseen corpus.
For evaluation on the unseen IBM-Rank dataset after training on the remaining ones, we can obtain a correlation of 46.5%, which nearly reaches the correlation of 48.1% we obtained when training and evaluating on IBM-Rank.For SwanRank, IBM-ArgQ and UKPConv, we can even surpass the correlation on the respective test set by training on all other training sets instead of the one from the respective corpus.

Cross-Corpora Generalization Conclusion
To summarize, we conclude that in our analysis the available datasets and models for AQ are reliable.
Our most important insight is that AQ notions do not contradict each other, and a single model can estimate the AQ of text from different domains.Therefore, the practical recommendation for reallife application is to combine all available datasets across different domains and AQ notions.

Zero-Shot-Learning in Argument Mining
In this section, we investigate whether explicit Argument Quality (AQ) corpora are a necessity or whether the task of AQ can also be solved by transferring from other related argument mining tasks such as Argument Identification (AId) or Evidence Detection (ED), In contrast to the relatively new task of automatic AQ estimation, other Argument Mining (AM) tasks already offer a broad range of large datasets that cover different domains and annotation schemes.Moreover, the agreement between the annotators is higher on the other tasks, as AQ is highly subjective (Swanson et al., 2015;Gretz et al., 2020;Toledo et al., 2019;Habernal and Gurevych, 2016b;Stab et al., 2018b).Therefore, a successful transfer from related tasks to the target task of AQ would represent a significant advance in the field.To this end, we investigate the zero-shot capability of AM models across different corpora and different AM tasks.To the best of our knowledge, we are the first to compare AM task similarity by providing a first study on how individual tasks can benefit from each other.
In particular, we aim to answer the following guiding research questions: 1. Can we achieve satisfactory performance by zero-shot transfer from related AM tasks, i.e., without fine-tuning the respective task?
2. Is there a difference in transferring from different tasks, i.e., is one task more suited than the other?
While not a primary focus of this work, for completeness, we also provide experimental results for the reverse direction of transferring from AQ estimation to the other tasks.

Datasets and Tasks
This section provides an overview of the three different AM corpora and tasks we used in our experiments.They are also summarized in Table 4.
1. UKP-Sentential (Stab et al., 2018b) contains over 25k arguments distributed across eight controversial topics.It is annotated for AId, where each sentence is labeled as either argumentative or non-argumentative in the context of a topic.
2. The IBM-Evidence (Ein-Dor et al., 2020) corpus includes nearly 30k sentences from Wikipedia articles.All sentences are annotated with a score in the range of [0, 1], denoting the confidence that the sentence is evidence (either expert or study evidence) of the article's topic.
3. IBM-Rank (Gretz et al., 2020) is the largest of the four AQ datasets, which has also been used in the previous Section 3. The corpus annotation is in the range of [0, 1], where 1 indicates a strong argument and a score of 0 indicates a weak argument.
We split all three datasets into the train, validation, and test sets (70%/10%/20%).Similar to Section 3.1, we designed the splits such that no topic in the training set also occurs in the test set, which is often called the "cross-topic" scenario in AM and corresponds to a more interesting, but also more challenging task, which requires a sufficient degree of generalization to unseen topics.

Evaluation Setting
We use a standard BERT large model (Devlin et al., 2019) pre-trained on the masked-languagemodeling task to evaluate the zero-shot generalization capability.As an input for the fine-tuning, we use the sentences from the respective datasets and concatenate the topic information, separated by the BERT specific [SEP] limiter, similar to Section 3.2.We develop three different zero-shot evaluation strategies for the different transfer settings: • AId → Regression Tasks: We use the BERT encoder output as input to a linear layer with a dropout that predicts the classes.Crossentropy serves as training loss.The probabilities between 0 and 1 indicate if a sentence is argumentative or not.The predicted probability of the positive class, i.e., whether it is argumentative, is then directly used as a score for ED and AQ on the respective corpora.We use Spearman rank correlation instead of Pearson correlation as an evaluation measure to account for the difference in scale.
• Regression Tasks → AId: ED and AQ use the BERT representations in a single hidden layer that scores the sentences according to their absolute quality or the probability of containing evidence.Since we train on regression Table 4: Overview of the different Argument Mining (AM) datasets, we used for the zero-shot experiments, with their size in terms of the number of sentences, the number of covered topics, the source domain and the AM task.
tasks, we use the Mean Squared Error loss during training.We then apply the trained models to AId.We select an optimal decision threshold α among all possible thresholds on UKP-Sentential's validation set according to Macro F 1 .By choosing the validation set, we avoid an unfair leakage to the model.This model is then evaluated on the UKP-Sentential test set.
• Regression Task ↔ Regression Task: For the evaluation the two regression models, we calculate the Spearman correlation coefficient directly on their respective outputs.

Results
Table 5 shows the results of our experiments.We train three models with different random seeds for each source task and report the mean and standard deviation of the evaluation on all target tasks.We generally observe, unsurprisingly, that training on the same task as evaluating yields the best results with Spearman correlations of ≈ 77.90% for ED → ED and ≈ 47.45% for AQ → AQ.
A notable exception is AId, where a model trained on ED achieves ≈ 75.16%Macro F 1 and thus can slightly surpass the performance of a model directly trained on AId of ≈ 73.51%, although within the range of one standard deviation.Exceeding the in-task performance is a strong result, as the model has never explicitly been trained for the task.We generally observe almost perfect zero-shot transfer towards AId, as also the model trained on AQ achieves a performance of ≈ 71.27%, which is only 2% points behind the ≈ 73.53% from AId to AId.Thus, models capable of predicting whether a sentence provides evidence (ED) or capable of predicting the AQ of an argument inherently learn concepts that enable the detection of whether a sentence is argumentative or not (AId).To further give context to the zero-shot performance, the BiCLSTM approach trained on the AId task from (Stab et al., 2018b) obtained a Macro F 1 of 64.14%, i.e., worse results than the zero-shot transfer despite explicitly being trained on the task, which underlines the remarkable zeroshot performance, and may indicate that AId is a simpler task than the other two, ED and AQ.
For ED, we achieve the best performance of ≈ 77.90% Spearman correlation by directly training on this task.The model trained on AId obtains the closest zero-shot transfer result with a rank correlation of ≈ 55.53%, which still represents a considerable correlation, despite being ≈ 22% points behind.The model trained for AQ shows the worst transfer from the studied tasks with a correlation of ≈ 43.50%.Overall, we note that the challenging zero-shot transfer is still possible with an acceptable loss in performance.Models trained on detecting whether a sentence is argumentative or not (AId) transfer better than those trained for predicting the argumentative strength of a sentence AQ to the target task of predicting the confidence in whether a sentence provides evidence (ED).
For AQ, the main focus of our paper, we achieve the best performance of ≈ 47.45% Spearman correlation by directly training on this task.When transferring from related AM tasks in a zero-shot setting, we have to tolerate decreases in performance to ≈ 28.66% for transfer from ED, and ≈ 27.49% for transfer from AId, respectively.Both zero-shot models are better at the prediction of AQ than models directly trained on the same target tasks but on another corpus (previous section UKPConv could achieve 3.0% and SwanRank 8.0% on the Gretz dataset).Models capable of detecting whether a sentence is argumentative (AId) are slightly less applicable to predicting the sentence's argumentative strength than the models for predicting a level of supporting evidence (ED).One factor here may be that ED is also a regression task as opposed to the classification task of AId.
To summarize, the results suggest that the tasks of AId, i.e., classifying whether a sentence is argumentative, and ED, i.e., predicting a numeric   level of supporting evidence, are closer to each other than to the more difficult task of assessing the argumentative strength, as witnessed by worse zero-shot transfer results from and to AQ.Nevertheless, in principle, a transfer in the highly challenging zero-shot setting is possible; for closer related tasks, it can even lead to similar scores as training directly on the target task.

Multi-Task Learning for Argument Quality
As shown in the last section, the AM tasks are sufficiently close to each other to enable successful zero-shot transfer.An interesting question from this observation is whether the performance in AQ estimation further improves by multi-task learning.
To this end, we developed a multi-task model that involves a shared BERT encoder and separated linear layers for the respective tasks.We trained the architecture with weighted loss functions, ensuring that each task is weighted equally.Our results are shown in Table 6.Focusing on the right-most column first, we can see that the performance in terms of Spearman correlation only marginally improves by multi-task learning.A possible explanation is that we already observed that the other two tasks are seemingly less challenging and more closely related to each other than to AQ.As additional sup-porting evidence, ED slightly and AId considerably benefit from multi-task learning with AQ.

Conclusion
We see this work as a fundamental step towards a more holistic view of Argument Quality (AQ).
We have shown that for good generalization across individual AQ corpora, a match between the source and target domain of the arguments is essential.In contrast, diversity in AQ concepts does not hinder generalization but rather enriches it.The target domain has a minor impact with a sufficiently broad coverage of different domains and adequate size.This insight is directly applicable to practical applications: the advantages of different AQ notions allow the direct integration of different data sources, which is a prerequisite for handling the input from different domains encountered, e.g., by general-purpose argument retrieval engines.Moreover, we were able to elucidate the relationship between AQ's and other Argument Mining (AM) tasks, such as Evidence Detection (ED) and Argument Identification (AId).Our zero-shot transfer experiments showed that the concepts learned for one of the tasks are sufficient to solve the other to some extent without explicitly being trained for it.By comparing the results obtained, we con-clude that AId and ED are more closely related to each other than to AQ and are per se also easier to transfer to.The multitask experiment further emphasized this, where AQ could gain less from the other tasks than vice-versa.Thus, an important open question is how to enable a more successful transfer to AQ, extending beyond the three tasks we studied in this work.

Limitations
1. Our investigation in the zero-shot experiment is not exhaustive, we focused on the interplay between the three main tasks that also provide datasets of similar size: argument identification, evidence detection, and argument quality.However, there are other tasks, such as stance classification (deciding whether an argument supports or opposes a particular or argument structure identification (identifying argumentative discourse units, such as claims and premises).Other tasks might be better source tasks for estimating argument quality.
2. Our experiments are based on the most popular datasets in argument mining and argument quality and may not generalize to other more specialized text domains, such as law or politics.
3. Using only English datasets limits the generalizability of the results to other languages and cultures.The ability to identify and evaluate the quality of arguments may be different in other languages and cultures, and the annotators may not be able to accurately capture these differences.This may lead to a lack of robustness and reliability of the results.

Ethics Statement
The BERT architectures are pre-trained on a large corpus of text data, which may contain biases in terms of language, content, and social issues.These biases could be transferred to our work, which could lead to inaccurate or unfair results.

A Computing & Software Infrastructure
All experiments were conducted on a Ubuntu 20.04 system with an AMD Ryzen Processor with 32 CPU-Cores and 128 GB memory.We used Python 3.7, PyTorch 1.4, and the Huggingface-Transformer library (4.15.0).For the experiments in Chapter 3, we used four NVIDIA RTX 2080 TI GPUs with 11 GB memory.The in Chapters 4 and 5 were trained on a single NVIDIA Tesla V100.The default parameters from the Huggingface-Transformer library4 were used for all hyperparameters not specified in the following sections.

B Generalization across Argument Quality Corpora
In Section 3, we trained bert-base-uncased models with a batch size of 64.The learning rate was set to 9.1 • 10 −6 .A weight decay of 0.31 was used.We calculated the 95th percentile based on the four AQ validation sets and truncated longer sentences to that length.The losses in the multidataset setting were equally weighted for each of the four datasets.We used early stopping on the validation MSE loss, with a patience value of five epochs, as a regularization technique to avoid overfitting.

C Zero-Shot-Learning in Argument Mining
For Section 4, we trained bert-large-uncased architectures with a batch size of 64.The learning rate was set to 1 • 10 −5 , and a warm-up period was used for the first 0.1 epochs.We opt for evaluations every 0.1 epoch in our training configuration, resulting in 10 evaluations per epoch.Our train/validation/test split is based on a reasonably standard 70%/10%/20% split.Furthermore, we calculate the 99th percentile of the max length of all sentences inside the validation split and truncate them to that length.This further decreases the required learning time due to a reduced input dimension without losing significant information.We used a dropout rate of 0.1 for the dropout layer in the AId → Regression Tasks setting.The losses in the multi-dataset and multi-task setting were equally weighted for each of the three argument mining datasets.Finally, to further reduce variance annotated a large corpus of 16k argument pairs and investigated which argument from the pair is more convincing.Based on the argument pair annotations, they created an argument graph and used PageRank to calculate absolute scores for the individual arguments.The dataset is called UKPConvArgRank (here shortly UKP-Conv) and contains 1,052 arguments.It consists of 32 topics extracted from the debate portals CreateDebate 2 and ProCon 3 .3. Toledo et al. (2019) created their corpora IBM-ArgQ of 5,300 arguments by asking (1) debate club members (from novice to experts) and (2) a broad audience of people attending the experiments, if they would recommend a friend to use the argument in a speech supporting or contesting the topic regardless of their personal opinion.They modeled the quality of each individual argument as a real value in the range of [0, 1], by calculating the fraction of 'yes' answers.4. Gretz et al. (2020) created their corpora of 30k arguments by asking crowd contributor the same question as Toledo et al. (2019).Gretz et al. (2020) further introduce new scoring methods that consider the annotators' credibility without removing them entirely from the labeled data, as done in Toledo et al. (2019).The new scoring functions and the broader annotator selection presumably better represent the general population compared to Toledo et al. (2019).

Table 1 :
Two example arguments from the studied datasets with Argument Quality score.

Table 2 :
Overview of the different Argument Quality (AQ) datasets with their number of arguments, the number of distinct topics, the different source domains, and the AQ notion used for annotation.

Table 3 :
(Swanson et al., 2015)ed by the Pearson correlation between ground truth and predicted Argument Quality on the respective test sets.The first row corresponds to the respective correlation values reported in the original work.In SwanRank(Swanson et al., 2015)the authors evaluated their approach with Root Mean Squared Error (RMSE), Pearson correlation was not measured in their work.The first four rows correspond to models trained on a single dataset, whereas for the last four rows, all but one dataset, have been used for training, i.e., following a leave-one-out scheme.Bold numbers indicate the best results for each column within the two groups.

Table 5 :
Zero-Shot performance of the Argument Mining models.The evaluation measure is Macro F 1 for Argument Identification (AId), and the Spearman correlation for Evidence Detection (ED) and Argument Quality (AQ).

Table 6 :
Performance of multi-task models trained on different Argument Mining task combinations, including Argument Identification (AId) and Evidence Detection (ED).The performance is measured by Macro F 1 for AId, and the Spearman correlation for ED and AQ.