Multitask Instruction-based Prompting for Fallacy Recognition

Fallacies are used as seemingly valid arguments to support a position and persuade the audience about its validity. Recognizing fallacies is an intrinsically difficult task both for humans and machines. Moreover, a big challenge for computational models lies in the fact that fallacies are formulated differently across the datasets with differences in the input format (e.g., question-answer pair, sentence with fallacy fragment), genre (e.g., social media, dialogue, news), as well as types and number of fallacies (from 5 to 18 types per dataset). To move towards solving the fallacy recognition task, we approach these differences across datasets as multiple tasks and show how instruction-based prompting in a multitask setup based on the T5 model improves the results against approaches built for a specific dataset such as T5, BERT or GPT-3. We show the ability of this multitask prompting approach to recognize 28 unique fallacies across domains and genres and study the effect of model size and prompt choice by analyzing the per-class (i.e., fallacy type) results. Finally, we analyze the effect of annotation quality on model performance, and the feasibility of complementing this approach with external knowledge.


Introduction
A fallacious argument is one that seems valid but it is not (Hamblin, 2022).Theoretical work in argumentation has introduced various typologies of fallacies.For example, Van Eemeren et al. (2002) consider fallacies that occur when an argument violates the ten rules of a critical discussion, while Tindale (2007) categorizes fallacies into 4 categories: structural fallacies, related to the number and structure of arguments; fallacies from diversion, drawing from the (un)intentional diversion of the attention from the issue at hand; logical fallacies, related to the argument scheme at play and Question-Answering dialog moves in ARGOTARIO: Has anyone been on the moon?The moon is so far away, we should focus on our society.Fallacy: Red Herring Propaganda techniques in news: The ability to build an untraceable, unregistered gun is definitely a game changer.
Fallacy: Loaded Language Educational website on fallacies: She is the best because she is better than anyone else Fallacy: Circular Reasoning Fact-checked news: Says Joe Biden has said 150 million Americans died from guns and another 120 million from COVID-19.Fallacy: Cherry Picking Table 1: Examples of fallacies from multiple datasets language fallacies, related to vagueness or ambiguity.Fallacious reasoning can bring misbehaviour and be used for manipulation purposes.Thus, having a system that can recognize fallacy types across domains and genres is crucial for applications that teach humans how to identify fallacies and avoid using them in their arguments.
Work in computational models for fallacy recognition is still in its infancy, with a limited set of relatively small datasets such as ARGOTARIO (Habernal et al., 2017), which consists of question and answer dialog moves; name-calling in social media debates (Habernal et al., 2018), fallacies as propaganda techniques in news (Da San Martino et al., 2019b); logical fallacies from educational websites (Jin et al., 2022), and fallacies used for misinformation in social media and news around Covid-19 (Musi et al., 2022).Table 1, shows some examples of fallacies from these datasets.
Previous work on fallacy recognition has tackled just one dataset at a time.For example, work on detecting propaganda techniques use fine-tuning of different pre-trained transformers with embeddingbased or handcrafted features (Da San Martino et al., 2020;Jurkiewicz et al., 2020) as well as LSTMs and transformers for sequence tagging of propaganda fragments (Da San Martino et al., 2019a;Yoosuf and Yang, 2019;Alhindi et al., 2019;Chernyavskiy et al., 2020), while Jin et al. (2022) propose a structure-aware classifier to detect logical fallacies.
Fallacy recognition is a challenging task for three main reasons: i) the number of classification labels (fallacy types) and class imbalance in existing datasets is often very high; ii) existing datasets cover varying genres and are typically very small in size due to annotation challenges; and iii) models trained on individual data sets often show poor out of distribution generalization.
A recent line of work (Wei et al., 2022;Sanh et al., 2022) relies on the intuition that most natural language processing tasks can be described via natural language instructions and models trained on these instructions in a multitask framework show strong zero-shot performance on new tasks.Based on this success, we propose a unified model based on multitask instruction-based prompting using T5 (Raffel et al., 2020) to solve the above challenges for fallacy recognition (Section 3).This approach allows us to unify all the existing datasets and a newly introduced dataset (Section 2) by converting 28 fallacy types across 5 different datasets into natural language instructions.In particular, we address the following research questions: i) Can we have a unified framework for fallacy recognition across domains, genres, and annotation schemes?ii) Are fallacy types expressed differently across datasets?iii) What are the effects of model size and prompt choice on the per-class performance for the fallacy recognition task?
Experimental evidence shows that our multitask fine-tuned models outperform task specific models trained on a single dataset by an average margin of 16% as well as beat strong few-shot and zeroshot baselines by average margins of 25% and 40%, respectively in macro F1 scores across five datasets (Section 4.1).To further deepen our understanding towards the task of fallacy recognition we analyze the performance of our models for each fallacy type across datasets, model size and prompt choice (Section 4.2).We further analyze the effect of annotation quality on the model performance, and the feasibility of complementing this approach with external knowledge (Section 4.3).We make all datasets, code and models publicly available.2017), a dataset for fallacy detection where given a QA pair the task is to detect the fallacy in answers.Their scheme include five fallacy types: Ad Hominem, Appeal to Emotion, Red Herring, Hasty Generalization, Irrelevant Authority.
The second dataset (PROPAGANDA) contains 18 propaganda techniques in news articles annotated at the fragment and sentence levels (Da San Martino et al., 2019b).We focus on 15 that are fallacies and frequent enough in the data: Loaded Language, Name Calling or Labeling, Exaggeration or Minimization, Doubt, Appeal to Fear/Prejudice, Flag-Waving, Causal Oversimplification, Slogans, Appeal to Authority, Black-and-White Fallacy, Thought-Terminating Cliche, Whataboutism, Reductio ad Hitlerum, Red Herring, and Strawman.
The third dataset (LOGIC) is recently released by Jin et al. (2022) and contains 13 logical fallacies (Faulty Generalization, False Causality, Circular Claim, Ad Populum, Ad Hominem, Deductive Fallacy, Appeal to Emotion, False Dilemma, Equivocation, Fallacy of Extension, Fallacy of Relevance, Fallacy of Credibility, Intentional Fallacy) from educational websites on fallacy such as Quizziz and study.com.They contain diverse types of text such as dialogue and short statements (e.g., the Circular Reasoning example shown in Table 1).The authors also introduce another challenge dataset: CLI-MATELOGIC that follows the same fallacy scheme.However, it contains text segments that are too long (e.g.multiple paragraphs) with no annotations of smaller fallacious fragments like the Propaganda dataset.Therefore, CLIMATELOGIC is beyond the scope of this study.
The final existing fallacy dataset (COVID-19) is about fact-checked content around Covid-19 (Musi et al., 2022).The authors identify 10 fallacies (Evading the Burden of Proof, Cherry Picking, Strawman, Red Herring, False Authority, Hasty Generalization, Post Hoc, False Cause, False Analogy, Vagueness) through analysis of fact-checked social media posts and news by considering fallacies as indicators of misinformation.
More detailed description of all datasets is shown in Appendix B.
New Fallacy Dataset Drawing from the annotation scheme developed by Musi et al. (2022), we annotate 778 segments (477 fallacious) from 92 climate change articles fact-checked by climate scientists at climatefeedback.org.Each fact-checked article is accompanied by an "annotations" section where segments from the original articles are directly followed by the reviewers' comments.Two annotators look at both segments and comments to annotate fallacy types.They had a 0.47 Cohen's κ (Cohen, 1960), which corresponds to moderate agreement.The gold labels were then done by an expert annotator (in argumentation and fallacy theory) that went over both cases of agreement and disagreement to decide the final label.We denote this dataset as CLIMATE where it differs from CLIMATELOGIC (Jin et al., 2022) in three ways: i) it is built using a fallacy scheme specifically developed for misinformation; ii) the fallacious segments are identified by domain experts at climatefeedback.org and contain comments which explain fallacious aspects; iii) the segments are mostly 1-3 sentences long.
Final Labels.We unify the labels of similar fallacies (e.g., False Cause, False Causality, Causal Oversimplification → Causal Oversimplification).We also rephrase some fallacy types by removing words such as "Appeal to" (e.g,.Appeal to Emotion → Emotional Language) that tend to throw off generative models causing over prediction of these types as observed in our initial experiments.Some fallacies have partial or full overlap with others across the four schemes.Therefore, we merge these types and use the label of the most frequent or the most representative label of the fallacy type (e.g., Fallacy of Relevance → Red Herring).We also unify the definitions of fallacy types in prompts across datasets.We end up with 28 unique fallacy types across five datasets ARGOTARIO: 5, PROPA-GANDA: 15, LOGIC: 13, COVID-19 and CLIMATE: 9. Complete list of fallacy labels and definitions for all types is shown in Appendix B. Following the success of multitask instructionbased prompting we approach different formulations of fallacies across datasets as different tasks with a generic prompting framework in a single model.We use T5 (Raffel et al., 2020) as the backbone model for training on all five fallacy datasets that have different number and types of fallacies.We hypothesize that when a model is able to learn to recognize fallacy types from multiple datasets, it is more likely able to learn generic traits of fallacy types rather than learning characteristics specific to a single dataset.
A sample list of instructions for each dataset is shown in Figure 1 (Full list in Appendix C).All instructions start with an n-gram (e.g.'Given a text segment') followed by a list of fallacy types with or without their definitions.The complete set of fallacies and definitions are shown in Appendix B. The final component of the instruction is specific to each dataset (e.g., question-answer pair for ARGOTARIO, sentence-fragment or sentence only for PROPAGANDA).The generation target during training and test is one of the fallacies types that are permissible for each dataset.In addition, we ask the model to generate the fragment that contains the fallacy (PROPAGANDA dataset only) during training to increase the diversity of prompts and instructions during training.Since the overall objective of this work is to have a generic classifier for fallacy and to compare with other classification methods, evaluating the model's ability to correctly generate the fallacious fragment is beyond the scope of this paper.During inference time, we use greedy decoding and select the generated target as the prediction of fallacy type.The evaluation is done using strict string match with the gold fallacy.Model hyperparameters are shown in Appendix A.

Evaluation Setup and Results
Given the high imbalance nature of all fallacy datasets, we report both accuracy (equivalent to micro F1 as we do not include multi-label instances) and Macro F1.
Baselines.We consider the following three models as our baselines: i) zero-shot classification using UnifiedQA (Khashabi et al., 2020); ii) few-shot instruction-tuning of GPT-3 (Brown et al., 2020); and iii) full-shot fine-tuning of BERT (Devlin et al., 2019).UnifiedQA is a question-answering model that is trained on 20 question-answering datasets in different formats and showed generalization capability to unseen data.We use its recent version UnifiedQA-v2 (3B size) (Khashabi et al., 2022) to test the ability of such model to recognize fallacies in zero-shot settings.We also do fewshot instruction-tuning of GPT-3 as many fallacy datasets are of small size, which poses the need for models that can perform well using few-shot training.We setup the instructions in a similar fashion to the ones used for T5 (i.e.List prompt in Figure 1).Additionally, we setup instructions with explanations where each few-shot example has a text segment, a fallacy label, and a sentence explaining why the fallacy label is suitable for the text, which is shown to improve the results of few-shot learning (Lampinen et al., 2022). 2 Constrained by the length allowed in the prompt, we use 2-shots per the five fallacy types for the ARGOTARIO dataset, and 1-shot per the nine-to-fifteen fallacy types for the other datasets.Given the high number of fallacy types, it is not feasible to instruction-tune GPT-3 on the 28 unique fallacy types that exist in all five datasets combined.Finally, we fine-tune BERT for 3 epochs on each dataset separately to test its ability to do fallacy recognition.All model hyperparameter details are shown in Appendix A.
We also use a T5-large model trained on each dataset separately using the instructions shown in Figure 1 as a baseline in order to compare the results of single-dataset with multi-dataset training.

Multitask Instruction-based Prompting vs. Baselines
Baseline Results Looking at the results shown in Table 3, UnifiedQA struggles to have any meaningful results and mostly predicts one or two fallacy types for all examples, which shows the infeasibility for models to perform well in zero-shot settings on a complex task such as fallacy recognition.GPT-3 is able to perform well on ARGOTARIO, even when trained with 1-shot per class, but struggles to beat any full-shot model on the other datasets, which highlights the difficulty of this task for fewshot training.Adding the explanations does not improve the performance, which could have been outweighed by the low number of shots per class and high number of fallacy classes.We notice that BERT has an acceptable performance on the AR-GOTARIO dataset (Acc.44% and F1 38%) that has the lowest number of classes (5 fallacy types), which is also the most balanced dataset compared to the other ones.However, when the number of fallacy classes increases to 9 or more, BERT struggles to have a good performance in any of the two evaluation metrics.The T5-large models is also trained on each dataset separately using the instructions shown in Figure 1.It has a surprisingly low performance on the ARGOTARIO dataset (Acc.25% and F1 14%) that is significantly lower than BERT and GPT-3.However, it is able to learn better for datasets with high number of classes (13-15 class) and large training data (e.g.PROPAGANDA and LOGIC).

Multitask Instruction-based Prompting Results
We train two sizes of the T5 models (large and 3B) on all datasets combined using the instructions mentioned in Figure 1.This increases the performance significantly on all datasets of the T5-large model compared to its performance when trained on one dataset at a time as shown in Table 3.The numbers further improve when we increase the size of the model from T5-large to T5-3B.This shows the benefit of our unified model based on multitask instruction-based prompting (multi-dataset) for fallacy recognition where we have limited resources and some very small datasets, and also shows the ability of larger models to generalize to the five test sets.The two multi-dataset models always have the best or second best results on all datasets.Also, the T5-3B model is better than T5-large in all accuracy and F1 scores for all datasets excepts accuracy scores for the COVID-19 and CLIMATE where the T5-large is better which could be due to having more correct predictions in the majority classes as the T5-3B is still better in macro F1 scores.To further understand the effect of the model size and prompt choice, we discuss in the next section the per-class performance of four different T5 models.

Performance of our Unified Model on Fallacy Types
We show the per-class (fallacy type) results of our unified model (multitask instruction-based prompting) using two model sizes (T5-large and T5-3B) and three prompts choices (Def, List, and All) in Tables 4-a to 4-e.
Model Size In general, increasing the model size (from T5-large to T5-3B both trained on all prompts) improves the overall results (especially macro F1) on all datasets.We notice the importance of model size in most datasets for fallacies types that have diversion moves (e.g.see which prompt is more useful for this task.We mainly experiment with two prompts that include either the definitions of all fallacies or only listing the names of all fallacies.In both cases, the prompt starts with an instruction followed by either definitions or fallacy names then ending with the segment that has the fallacious text.Including both prompts for each training instance yields the best results in most cases as we would expect.However, it seems that some fallacies benefit more from including the definitions in the prompt than others.In general, including the definitions (T5-3B-Def) rather than just fallacy names (T5-3B-List) has higher accuracy and macro F1 scores in 4 out of 5 datasets as shown in Table 4 (exceptions are accuracy in .In particular, it seems that definitions are more useful for fallacies that are closely related to other fallacies in one scheme where the definition helps in further clarifying the difference between the two.For example, in PROPAGANDA (Table 4-a) Thought-Terminating Cliches are defined as "words or phrases that offer short, simple and generic solutions to problems" which is mostly confused with Loaded Language by most models, especially ones not trained with definitions.Also in PROPAGANDA, T5-3B-Def has a much higher score than T5-3B-List on Whataboutism, which is "a discrediting technique that accuse others of hypocrisy" which includes introducing questions about other irrelevant matters.This could have caused models to confuse it with the Doubt fallacy.
Fallacy Types Across Datasets There are two fallacies that exist in all five datasets (i.e.Irrelevant Authority and Red Herring) and two other fallacies that exist in four datasets (i.e.Causal Oversimplification and Hasty Generalization).We closely look at these fallacies to understand the challenges posed by changes in domain, genre, and annotation guidelines.
Consider the results shown in Tables 4 (a-e) for Irrelevant Authority, we can notice three observations: i) T5-large is the best in PROPAGANDA, COVID-19, and CLIMATE; ii) T5-3B-All is the best in LOGIC and marginally second best (to T5-3B-Def) in ARGOTARIO; iii) similar to model size, including the definition in the prompt has inconclusive benefit across datasets.This can be mainly attributed to inconsistency in how this fallacy is defined in different schemes as for example it strictly refers to "mention of false authority on a given mat-ter" in COVID-19, while it additionally includes "referral to a valid authority but without supporting evidence" in PROPAGANDA (all definitions provided in Appendix B).
Similarly, no single model is consistently better in detecting Red Herring across all datasets as shown in Tables 4 (a-e).This, however, is more likely caused by the different format this particular fallacy has in different domains and genres as it consists of shorter phrases in PROPAGANDA, asking irrelevant or misleading questions in CLIMATE, and mentions of irrelevant entities in LOGIC.
Causal Oversimplification has more consistent results as shown in Tables 4 (a,b,d,e) where the T5-3B-All model has the best results in three out of four datasets.This illustrates that while the notion of this fallacy might differ across datasets, it still strongly shares common generic features (e.g. the existence of a causal relation) that make it distinguishable by a single model in different settings.
Finally, the results for Hasty Generalization shown in Tables 4 (b-e) indicate that detecting this fallacy becomes more challenging when other similar fallacies exist in a fallacy scheme (e.g.Cherry Picking in COVID-19 and CLIMATE), and less challenging when other fallacies in the scheme are further away (e.g.LOGIC and ARGOTARIO).
Nevertheless, this multitask setup provides the model with the opportunity to learn to detect specific fallacy types as they are expressed differently, and grouped with different fallacies, which consistently and significantly improves the overall results of fallacy recognition over single-scheme (or single dataset) models.

Error Analysis
In order to better understand model errors and quality of annotations for this complex task for both humans and machines, an expert looked at 70 wrongly predicted examples from the PROPA-GANDA datasets (5 examples each from 14 propaganda technique, Strawman was not included due to low counts).First, the expert looked only at the sentence and the fragment identified by the gold annotation as containing a fallacy and she independently annotated the propaganda technique at stake.Comparing this annotation with gold labels and model prediction (T5-3B-All), it turns out that the expert annotator agreed with the gold label in 75% of the cases, and with the model prediction His opinion is: "She may very well believe everything she's saying, and that is one of the signs of lunacy, believing something that isn't real.  in 15%, while she chose a different label in 10% of the cases.Table 5 shows three examples along with gold labels, model predictions, and expert annotations.
Consider the first example in Table 5 that has Doubt as the gold label.The expert agrees that the propaganda technique used rests on questioning the credibility of the lawyer (Doubt), even though the adjective "lunatic" is a literal instance of Name Calling.Thus, the label predicted by the model is not wrong, but less relevant since it is the lack of trustworthiness the most effective feature in undermining the antagonist's stance, regardless whether it is due to lunacy or lack of integrity.
In the second example of Table 5, the expert agrees with the model prediction of a Flag-Waving fallacy in the underlined segment rather than a Slogan as the gold label.The term "last hope" can be considered a slogan, however, when we consider the full propagandistic segment that includes the word "Christianity", it maps better to Flag-Waving as it has been defined in the guidelines (and included in the prompt) as "Playing on strong national feeling (or to any group)...".
The third example highlights even more the importance of the selected fragment in the prompt: without considering the reference to the "antichrist" threat, it is not possible to understand that the sentence is playing on a religious-based national feeling.
Considering the analysis of the 70 examples in the PROPAGANDA dataset, the following general observations are found: i) some fallacious seg-ments can map to more than one fallacy, especially when one of the two is a language fallacy (e.g., Name Calling, Exaggeration, Loaded Language).In such cases, the model tends to privilege the language fallacy type, even if usually not the most relevant from an argumentative perspective; ii) for some cases, the expert annotator had to read more context beyond the sentence; iii) for some cases, the expert agreed with the gold label but disagreed with the boundaries of the annotated fragment by choosing a larger or more informative one.
In light of this, improving automatic fallacy identification may entail i) considering additional context; ii) adopting a fallacy scheme with a heuristics that imposes an order into fallacy recognition (structural fallacy followed by diversion and logical fallacies with language fallacies at last when all the others are excluded).
Prompting Using prompts has emerged as a generic framework to train natural language processing models on multiple tasks using prefix text (Raffel et al., 2020), and few-shot prompt-tuning of GPT-3 (Brown et al., 2020).This was followed by multiple studies that use prompts on smaller size models using few and full shots on tasks such as natural language inference (Schick and Schütze, 2021b), text classification (Schick and Schütze, 2021a;Gao et al., 2021), relation extraction (Chen et al., 2022), and using instruction prompts for mul-tiple tasks (Mishra et al., 2022;Sanh et al., 2022).We follow a similar setup by training a T5 model using instruction prompts for different formulations of fallacy recognition approached as multiple tasks.

Conclusion
We introduced a unified model using multitask instruction-based prompting for solving the challenges faced by the fallacy recognition task.We could unify all the datasets by converting 28 fallacy types across 5 different datasets into natural language instructions.We showed that our unified model is better than training on a single dataset.We analyzed the effect of model size and prompt choice on the detection of specific fallacy types that could require additional knowledge better captured by bigger models (e.g., diversion fallacies such as Red Herring), and the distinction between similar fallacies better detected by more comprehensive prompts that include definitions of fallacy types (e.g., Doubt vs. Whataboutism).We analyzed the differences of fallacy types that appear in multiple fallacy schemes across the five datasets and showed that one fallacy type could have multiple meanings which further increases the complexity of this task (e.g., Irrelevant Authority).We conducted a thorough error analysis and released a new fallacy dataset for fact-checked content in the climate change domain.

Limitations
In the current setup, we consider all examples as fallacious or partially fallacious and do not include a "No Fallacy" class, which some of the fallacy datasets have.Based on this assumption, the model's task is to detect the type of fallacy given a fallacious example.Including "No Fallacy" makes the datasets severely imbalanced (e.g.70% of PRO-PAGANDA and 50% of COVID-19 are labeled as "No Fallacy").We elected to remove it for this work since not all datasets have a "No Fallacy" class (e.g.LOGIC) and since this class is bigger than all 28 fallacy class combined.Even with our initial experiments with downsampling of "No Fallacy" using BERT, the results were not promising.This setup is in line with the propaganda technique classification task (Da San Martino et al., 2020) and the logical fallacy detection task (Jin et al., 2022) that all do not include "No Fallacy" class.We leave further experimentation of pipeline or joined approaches to separate fallacies from non-fallacies text for future work.Other limitations include the need for external knowledge and the multi-labeling nature of some examples as discussed in Section 4.3, which we leave for future work.
We experiment with the second and third largest sizes of the T5 model, T5-3B (11GB) and T5-large (3GB) and do not run experiments with T5-11B (40GB) due to lack of resources.The T5-3B is run on 2 Nvidia A-100 GPUs with 40GB memory each with a batch size of 2. These GPU requirements could pose a limitation on using such models in resource-poor settings.They could also have environmental impacts if trained (and re-trained) for longer periods of time.The training time of the T5-3B on 2 GPUs for 5 epochs is on average 2-3 hours depending on the size of the dataset.

A Model Hyperparameters
We use huggingface's implementation (Wolf et al., 2020) of the T5 model (large and 3B) where we train all models for 5 epochs choosing the epoch with lowest evaluation loss as the final model.The models are run with 1e-4 learning rate, Adam optimizer, batch size 2, gradient accumulation steps 512, maximum source length 1024, maximum target length 64.At inference time, the target is generated using greedy decoding (beam search of size 1) with no sampling and default settings for T5.The generated target is then compared with the fallacies in the given scheme and the prediction is counted as correct if they are the same using strict string match.
We also use huggingface's implementation of BERT (base) and fine-tune the model for 3 epochs with 1e-5 learning rate, batch size 16, maximum sequence length 256.
For GPT-3, we use the completion API of Ope-nAI (Brown et al., 2020) using their large engine that is trained with instructions (text-davinci-002) with temperature 0, max generated tokens 150 and other parameters kept at default value (e.g.top_p=1).The generated target is considered correct if it has the gold fallacy (even with additional text).Since GPT-3 is trained with few-shots only, it sometimes generates some generic prefix, repeats the text segment, or generates more than one fallacy.

B Fallacy Datasets
We list in Tables 6 and 7 all the definitions and fallacy labels used in our prompts.As mentioned in Section 2, we unify the definitions and labels for fallacies that fully or partially overlap.Additionally, in the same tables we show the original labels and definitions for all four fallacy schemes as they are released by (Habernal et al., 2017) for ARGO-TARIO, (Da San Martino et al., 2019b) for PROPA-GANDA, (Jin et al., 2022) for LOGIC, and (Musi et al., 2022) for MISINFORMATION that is used by the COVID-19 and CLIMATE datasets.We also show counts of fallacy types in training/dev/test splits for all datasets in Table 8.Below is a detailed description of the four existing fallacy datasets.
ARGOTARIO Introduced by Habernal et al. (2017), the Argotario dataset consists of five fallacies in dialogue between players in game settings.The five fallacy types are: Ad Hominem, Appeal to Emotion, Red Herring, Hasty Generalization, irrelevant authority, in addition to the No Fallacy type.These types are selected because they are: common in argumentative discourse, distinguishable from each other, and have different difficulty levels.Players in the game are presented with a topic (question), which they answer using one of the fallacy types.Other players then try to predict the fallacy type written by author of the answer.The final label is determined when at least four players agree with the author of the answer on the type of fallacy.Each instance consist of a question-answer pair and one out of five fallacy labels.PROPAGANDA Da San Martino et al. (2019b) identified 18 propaganda techniques that appear in news articles.We focus on the following 15 of them that have a fallacy and frequent enough in the data: Loaded Language, Name Calling or Labeling, Exaggeration or Minimization, Doubt, Appeal to Fear/Prejudice, Flag-Waving, Causal Oversimplification, Slogans, Appeal to Authority, Blackand-White Fallacy, Thought-Terminating Cliche, Whataboutism, Reductio ad Hitlerum, Red Herring, and Strawman.We ignore propaganda techniques that do not have an argumentative fallacy (e.g.Repetition) or not frequent enough in the data (e.g.Bandwagon, OIVC).The authors annotate the text spans and propaganda technique (fallacy type) in 451 articles from 48 news outlets allowing multiple labels and partial overlap of text spans.We frame this at the sentence level where the fallacy type becomes the label of the sentence if the fragment is included within the sentence.For sentences with multiple fragments, we consider the label of the longer fragment.We ignore propaganda fragments that span across multiple sentences.This is the biggest dataset in our experiments but it is also the most imbalance one where 6 out the 18 propaganda techniques represent more than 80% of all propagandistic segments.Each training instance consists of a sentence, a fragment, and one out of fifteen fallacy labels.LOGIC Jin et al. (2022) collected examples of logical fallacies from educational websites on fallacies such as Quizziz, study.comand ProProfs.They identified 13 types of fallacies in the dataset using Wikipedia 3 as a reference.The fallacy types are: Faulty Generalization, False Causality, Circular Claim, Ad Populum, Ad Hominem, Deductive Fallacy Type Definition (Habernal et al., 2017)

Ad Hominem
The opponent attacks a person instead of arguing against the claims that the person has put forward.

Appeal to Emotion
This fallacy tries to arouse non-rational sentiments within the intended audience (Emotional Language) in order to persuade.

Hasty Generalization
The argument uses a sample which is too small, or follows falsely from a sub-part to a composite or the other way round.

Irrelevant Authority
While the use of authorities in argumentative discourse is not fallacious inherently, appealing to authority can be fallacious if the authority is irrelevant to the discussed subject.

Red Herring
This argument distracts attention to irrelevant issues away from the thesis which is supposed to be discussed.(Da San Martino et al., 2019b) Black and White Fallacy Presenting two alternative options as the only possibilities, when in fact more possibilities exist.As an the extreme case, tell the audience exactly what actions to take, eliminating any other possible choices (Dictatorship).Causal Oversimplification Assuming a single cause or reason when there are actually multiple causes for an issue.

Doubt
Questioning the credibility of someone or something.

Exaggeration
Either representing something in an excessive manner: making things larger, or Minimization better, worse or making something seem less important than it really is Appeal to fear/prejudice Seeking to build support for an idea by instilling anxiety and/or panic in the (Fear or Prejudice) population towards an alternative.In some cases the support is based on preconceived judgements.

Flag-Waving
Playing on strong national feeling (or to any group) to justify/promote an action/idea.

Appeal to Authority
Stating that a claim is true simply because a valid authority or expert on the issue said (Irrelevant Authority) it was true, without any other supporting evidence offered.We consider the special case in which the reference is not an authority or an expert in this technique, although it is referred to as Testimonial in literature.

Loaded Language
Using specific words and phrases with strong emotional implications (either positive or negative) to influence an audience.Name Calling or Labeling Labeling the object of the propaganda campaign as either something the target audience fears, hates, finds undesirable or loves, praises.

Red Herring
Introducing irrelevant material to the issue being discussed, so that everyone's attention is diverted away from the points made.

Reductio Ad Hitlerum
Persuading an audience to disapprove an action or idea by suggesting that the idea is popular with groups hated in contempt by the target audience.It can refer to any person or concept with a negative connotation.

Slogans
A brief and striking phrase that may include labeling and stereotyping.Slogans tend to act as emotional appeals.

Strawman
When an opponent's proposition is substituted with a similar one which is then refuted in place of the original proposition.

Thought-Terminating
Words or phrases that discourage critical thought and meaningful discussion Cliches about a given topic.They are typically short, generic sentences that offer seemingly simple answers to complex questions or distract attention away from other lines of thought.

Whataboutism
A technique that attempts to discredit an opponent's position by charging them with hypocrisy without directly disproving their argument.(Jin et al., 2022) Ad Hominem An irrelevant attack towards the person or some aspect of the person who is making the argument, instead of addressing the argument or position directly.

Ad Populum
A fallacious argument which is based on affirming that something is real or better because the majority thinks so.

False Dilemma
A claim presenting only two options or sides when there are many options or sides.(Black and White Fallacy) False Causality (Causal A statement that jumps to a conclusion implying a causal relationship without Oversimplification) supporting evidence Circular Reasoning A fallacy where the end of an argument comes back to the beginning without having proven itself.

Deductive Fallacy
An error in the logical structure of an argument.

Appeal to Emotion
Manipulation of the recipient's emotions in order to win an argument.(Emotional Language) Equivocation An argument which uses a phrase in an ambiguous way, with one meaning in one portion of the argument and then another meaning in another portion.

Fallacy of Extension
An argument that attacks an exaggerated/caricatured version of an opponent's.Faulty Generalization An informal fallacy wherein a conclusion is drawn about all or many instances of a (Hasty Generalization) phenomenon on the basis of one or a few instances of that phenomenon is an example of jumping to conclusions.Intentional Fallacy Some intentional/subconscious action/choice to incorrectly support an argument.

Fallacy of Credibility
An appeal is made to some form of ethics, authority, or credibility.(Irrelevant Authority) Fallacy of Relevance Also known as red herring, this fallacy occurs when the speaker attempts to divert (Red Herring) attention from the primary argument by offering a point that does not suffice as counterpoint/supporting evidence (even if it is true).(Musi et al., 2022) Evading Burden of Proof A position is advanced without any support as if it was self-evident.

Cherry Picking
The act of choosing among competing evidence that which supports a given position, ignoring or dismissing findings which do not support it.

Red Herring
The argument supporting the claim diverges the attention to issues which are irrelevant for the claim at hand.

Strawman
The arguer misinterprets an opponent's argument for the purpose of more easily attacking it, demolishes the misinterpreted argument, and then proceeds to conclude that the opponent's real argument has been demolished.

False Authority
An appeal to authority is made where the it credibility or knowledge in the (Irrelevant Authority) discussed matter or the authority is attributed a tweaked statement.

Hasty Generalization
A generalization is drawn from a sample which is too small, not representative or not applicable to the situation if all the variables are taken into account.

C Instructions
We list all instructions used during training in Table 10.ARGOTARIO, LOGIC and COVID-19 have two instructions per example: List: fallacy types listed in prompt, and Def fallacy definitions included in prompt.For PROPAGANDA, since each instance is sentence with a marked fallacious fragment, we construct three List instructions where the fragment is included in the first instruction, removed completely in the second instruction (no fragment in Table 10), and moved to the generation target in the third instruction (Frag).The same three instructions are done using Def prompts making the total six instructions per training example.For CLIMATE, each instance is constructed using four instructions: List and Def with and without fact-checkers comments (Com).These additional instructions for PROPAGANDA and CLIMATE are included during training only to increase the diversity of prompts.Also, as discussed in 4.1, we use few shot instruction-tuning of GPT-3 with and without explanations.The instructions that do not include Given the question and answer pairs below, which of the following fallacies occur in the answers: Emotional Language, Red Herring, Hasty Generalization, Ad Hominem, or Irrelevant Authority?
------------------1) Question: Is Christianity a peaceful religion?Answer: You are the antichrist, you want to destroy our belief in god.Fallacy: Ad Hominem Explanation: It is an ad hominem becase the speaker is attacked for his bad intentions and not for the point she is making.
2) Question: Is television an effective tool in building the minds of children?Answer: All TV-Shows are bad.Look at "the bachelor".Children cannot learn from it.Fallacy: Hasty Generalization Explanation: It is a hasty generalization since the evaluation of a whole category is drawn from the evaluation of a single element of the category.
... 5) Question: Should we allow animal testing for medical purposes?Answer: No, animals are so cuuuuteeeeeeee!!! Fallacy: Emotional Language Explanation: It is a fallacy of emotional language since the argument appels to positive emotions associated to animals' appearances.6) Question: Should gorillas be held in zoos Answer: No, I don't like gorillas.
------------------Fallacy: Red Herring Table 9: Example of GPT-3 few-shot instruction with explanations.Test Example Generated Fallacy Type explanations follow the same format of the List prompts shown in Table 10 where it starts "Given a text segment ..." followed by a list of fallacy types and then the few shot examples that include a text segment and a fallacy type.Additionally, we write explanations after each few-shot example in the instruction prompt which explains why a given text segment is labeled with the fallacy type.The explanations follow the fallacy type labels as shown in Table 9.

Table 2 :
Summary of five fallacy datasets.Ex: Total number of examples.F: Final number of fallacy types after unifing all datasets.† Original scheme has 18 propaganda techniques.‡ Original scheme has 10 fallacy types.
Existing Fallacy Datasets We include four existing fallacy datasets in our experiments.The first dataset is ARGOTARIO, introduced by Habernal et al. (

Table 3 :
Accuracy and Macro F1 scores on all datasets.Exp: explanations added to the few shot examples.Numbers in Bold represents the best score for each dataset, and underlined numbers are the second best.

Table 4 :
F1 scores for each fallacy type for two T5 model sizes (T5-Large and T5-3Billion), and for three prompt choices (Def: fallacy definitions in prompt; List: fallacy types listed in prompt; All: both Def and List prompts) to study the effect of model size and prompt choice.All models are trained on all five datasets combined.

Table 5 :
Example sentences from PROPAGANDA with gold label , model prediction and expert annotation .Underlined text highlights the propagandistic fragment.

Table 6 :
Fallacy Names and Definitions (Bold: definition of this fallacy used in all prompts for across datasets) False Cause (Causal X is identified as the cause of Y when another factor Z causes both X and Y Oversimplification) OR X is considered the cause of Y when actually it is the opposite Post Hoc (Causal It is assumed that because B happens after A, it happens because of A. In other words Oversimplification) a causal relation is attributed where, instead, a simple correlation is at stake False Analogy because two things [or situations] are alike in one or more respects, they are necessarily alike in some other respect.Vagueness A word/a concept or a sentence structure which are ambiguous are shifted in meaning in the process of arguing or are left vague being potentially subject to skewed interpretations.

Table 7 :
Musi et al. (2022) Names and Definitions (Bold: definition of fallacy used in all prompts across datasets)Fallacy, Appeal to Emotion, False Dilemma, Equivocation, Fallacy of Extension, Fallacy of Relevance, Fallacy of Credibility and Intentional Fallacy.Each training instance consists of a text segment (e.g.dialogue, sentence) and one of thirteen fallacy labels.The authors also introduce another challenge dataset: CLIMATELOGIC that follows the same fallacy scheme.However, it contains text segments that are too long (e.g.multiple paragraphs) with no annotations of smaller fallacious fragments like the Propaganda dataset.Therefore, CLIMATELOGIC is beyond the scope of this study.MisinformationMusi et al. (2022)identified 10 fallacies though analysis of fact-checked news (article and social media posts) about COVID-19.They consider fallacies as indicators of misinformation, which they define as misleading news that is not necessarily false communicated with the intention to deceive, thus making it harder to detect and fact-check.The fallacies are: Structural (Evading the Burden of Proof), Diversion (Cherry Picking, Strawman, Red Herring, False Authority), Logical (Hasty Generalization, Post Hoc, False Cause, False Analogy), and Language (Vagueness).They annotate 1,135 covid-19 news and social media posts (621 fallacious) that are fact-checked by five fact-checking organizations.